Nginx by some accounts now serves most of the world's top sites, and
is now an enterprise product, so I was very surprised when I couldn't
find a single mention of a problem in PID file handling that I've been
observing for a while.
On a restart the old Nginx 'master' process can remain active
for some time until all active connections on it close and it
terminates. When it does so it deletes the PID file on the file-system
which no longer belongs to it, it belongs to the new Nginx
'master' process which was spawned and already wrote
its own PID into the file unless you prevented it
from starting in the first place while a PID file exists on the
file-system.
This leads to many issues down the road, here's some of the most
severe that I experienced; Monitoring system alarming at 3 in the
morning about full RAID arrays when in reality Nginx kept open
file-descriptors on huge logs deleted long ago - log rotation jobs
simply failed to send USR1 to it, no PID file on the
file-system. Then failures from sysadmins and configuration management
agents alike to activate new configuration by reloading (or again
restarting) the Nginx service, signals being sent to the aether,
there's no PID on the file-system. That's where most of my surprise
came from, how in the world is everyone else successfully automating
their web farms when 10% of your configuration updates fail to apply
on average? What, that's only 100 servers when you are past 1000
nodes...
Nginx developers proclaimed this a feature and invalidated the bug
report. Officially this is the intended behavior: "you start new
nginx instance before old nginx instance was terminated. It's not
something allowed - you have to wait for an old master process to exit
before starting a new one.". That's acceptable to me, but then I
wonder how in the world is everyone else successfully doing
high-availability with their web farms? If you have a CDN origin and
edge nodes are pulling 2GB videos from it those connections are going
to take a while to close, meanwhile your origin is now failing all the
health checks from your HA frontends and it gets failed out...
The final official solution is that Nginx should never ever be
restarted. Every configuration update can be applied by a reload (send
HUP to master process). Unfortunately that doesn't work in
practice (how in the world is...), Nginx fails to apply many
configuration changes on a reload in my experience. If that is the
true bug I sometimes hit (ie. new FastCGI caching policy failed to
activate, new SSL certificates failed to activate etc. etc.) I
understand it and I accept it. However I remain of the opinion that
smarter PID file handling is a simple fix, and a useful thing to
have.
Things to do in this situation and avoid 3AM wake-up calls for a false
positive, while not giving up HA? The init script can maintain its own
PID file which is a clone of the one Nginx 'master' created
at the time it started, and rely on it for all future actions and so
can your log rotation jobs. This hack will certainly never be
distributed by an OS distribution - but many operations already
package their own Nginx because of all the extra modules modern web
servers require (media streaming, Lua scripting, Real IP...).