29.10.2013 22:11

Nginx PID file handling and HA

Nginx by some accounts now serves most of the world's top sites, and is now an enterprise product, so I was very surprised when I couldn't find a single mention of a problem in PID file handling that I've been observing for a while.

On a restart the old Nginx 'master' process can remain active for some time until all active connections on it close and it terminates. When it does so it deletes the PID file on the file-system which no longer belongs to it, it belongs to the new Nginx 'master' process which was spawned and already wrote its own PID into the file unless you prevented it from starting in the first place while a PID file exists on the file-system.

This leads to many issues down the road, here's some of the most severe that I experienced; Monitoring system alarming at 3 in the morning about full RAID arrays when in reality Nginx kept open file-descriptors on huge logs deleted long ago - log rotation jobs simply failed to send USR1 to it, no PID file on the file-system. Then failures from sysadmins and configuration management agents alike to activate new configuration by reloading (or again restarting) the Nginx service, signals being sent to the aether, there's no PID on the file-system. That's where most of my surprise came from, how in the world is everyone else successfully automating their web farms when 10% of your configuration updates fail to apply on average? What, that's only 100 servers when you are past 1000 nodes...

Nginx developers proclaimed this a feature and invalidated the bug report. Officially this is the intended behavior: "you start new nginx instance before old nginx instance was terminated. It's not something allowed - you have to wait for an old master process to exit before starting a new one.". That's acceptable to me, but then I wonder how in the world is everyone else successfully doing high-availability with their web farms? If you have a CDN origin and edge nodes are pulling 2GB videos from it those connections are going to take a while to close, meanwhile your origin is now failing all the health checks from your HA frontends and it gets failed out...

The final official solution is that Nginx should never ever be restarted. Every configuration update can be applied by a reload (send HUP to master process). Unfortunately that doesn't work in practice (how in the world is...), Nginx fails to apply many configuration changes on a reload in my experience. If that is the true bug I sometimes hit (ie. new FastCGI caching policy failed to activate, new SSL certificates failed to activate etc. etc.) I understand it and I accept it. However I remain of the opinion that smarter PID file handling is a simple fix, and a useful thing to have.

Things to do in this situation and avoid 3AM wake-up calls for a false positive, while not giving up HA? The init script can maintain its own PID file which is a clone of the one Nginx 'master' created at the time it started, and rely on it for all future actions and so can your log rotation jobs. This hack will certainly never be distributed by an OS distribution - but many operations already package their own Nginx because of all the extra modules modern web servers require (media streaming, Lua scripting, Real IP...).


Written by anrxc | Permalink | Filed under work