06.11.2011 00:17

Pulling strings

After one year of managing a network of 10 servers with Cfengine I'm currently building two clusters of 50 servers with Puppet (which I'm using for the first time), and have various notes to share. With my experience I had a feeling Cfengine just isn't right for this project, and didn't consider it seriously. These servers are all running Debian GNU/Linux and Puppet felt natural because of the good Debian integration, and the number of users whom also produced a lot of resources. Chef was out of the picture soon because of the scary architecture; CouchDB, Solr and RabbitMQ... coming from Cfengine this seemed like a bad joke. You probably need to hire a Ruby developer when it breaks. Puppet is somewhat better in this regard.

Puppet master needs Ruby, and has a built-in file server using WEBrick. My first disappointment with Puppet was WEBrick. Though PuppetLabs claim you can scale it up to 20 servers, that proved way off, the built-in server has problems serving as little as 5 agents/servers, and you get to see many dropped connections and failed catalog transfers. I was forced to switch to Mongrel and Nginx as frontend very early in the project, on both clusters. This method works much better (even though Apache+Passenger is the recommended method now from PuppetLabs), and it's not a huge complication compared to WEBrick (and Cfengine which doesn't make you jump through any hoops). Part of the reason for this failure is my pull interval, which is 5 minutes with a random sleep time of up to 3 minutes to avoid harmonics (which is still a high occurrence with these intervals and WEBrick fails miserably). In production a customer can not wait on 30/45 minute pull intervals to get his IP address whitelisted for a service, or some other mundane task, it must happen within 10 minutes... but I'll come to these kind of unrealistic ideas a little later.

Unlike the Cfengine article I have no bootstrapping notes, and no code/modules to share. By default the fresh started puppet agent will look for a host called "puppet" and pull in what ever you defined to bootstrap servers in your manifests. As for modules, I wrote a ton of code and though I'd like to share it, my employer owns it. But unlike Cfengine v3 there's a lot of resources out there for Puppet which can teach you everything you need to know, so I don't feel obligated to even ask.

Interesting enough, published modules would not help you get your job done. You will have to write your own, and your team members will have to learn how to use your modules, which also means writing a lot of documentation. Maybe my biggest disappointment is getting disillusioned by a lot of Puppet advocates and DevOps prophets. I found articles and modules most of them write, and experiences they share are simplistic - have nothing to do with the real world. It's like they host servers in (magic?) environments where everything is done in one way and all servers are identical. Hosting big websites and their apps is a much, much different affair.

Every customer does things differently, and I had to write custom modules for each of them. Just between these two clusters a module managing Apache is different, and you can abstract your code a lot but you reach a point where you simply can't push it any more. Or if you can, you create a mess that is unusable by your team members, and I'm trying to make their jobs better not make them miserable. One customer uses an Isilon NAS, the other has a content distribution network, one uses Nginx as a frontend, other has chrooted web servers, one writes logs to a NFS, other to a Syslog cluster... Now imagine this on a scale with 2,000 customers and 3 times the servers and most of the published infrastructure design guidelines become laughable. Instead you find your self implementing custom solutions, and inventing your own rules, best that you can...

I'm ultimately here to tell you that the projects are in a better state then they would be with the usual cluster management policy. My best moment was an e-mail from a team member saying "I read the code, I now understand it [Puppet]. This is fucking awesome!". I knew at that moment I managed to build something good (or good enough), despite the shortcomings I found, and with nothing more than using PuppetLabs resources. Actually, that is not completely honest. Because I did buy and read the book Pro Puppet which contains an excellent chapter on using Git for collaboration on modules between sysadmins and developers, with proper implementation of development, testing and production (Puppet)environments.


Written by anrxc | Permalink | Filed under work, code, books