11.01.2014 23:00

Load balancing Redis and MySQL with HAproxy

It's a common occurrence to have two and more load balancers as HA frontends to databases at high traffic sites. I've used the open-source HAproxy like this, and have seen others use it. Building this infrastructure and getting the traffic distributed evenly is not really the topic I'd like to write about, but what happens after you do.

Using HAproxy like this in front of replicated database backends is tricky, a flap on one part of the network can make one or more frontends activate the backup backends. Then you have a form of split-brain scenario on your hands with updates occurring simultaneously to all masters in a replicated set. Redis doesn't do multi-master replication and it's easier to get in trouble, with just one HA frontend, if it happens the old primaries are reactivated before you synced them with new ones.

One way to avoid this problem is building smarter infrastructure. Offloading health checks and role directing to an independent arbiter. But having one makes it a single point of failure, having more makes it another replicated nightmare to solve. I was never keen on this approach because solving it reliably is an engineering challenge each time, and I have the good sense of knowing when it can be done better by smarter people.

Last year I've been pestering HAproxy developers to implement cheap features as a start. Let's say if a fail-over to backup happens to keep the old primary permanently offline with a new special directive, which would be more reliable than gaming health check counters. Request was of course denied, they are not in it to write hacks. They always felt the agents are the best approach, and that the Loadbalancer.org associates might even come up with a common 'protocol' for health and director agents.

But developers heard my case, and I presume others who discussed the same infrastructure. HAproxy 1.5 which is about to be released as the new stable branch (source: mailing list) implements peering. Peering with the help of stick-tables, whose other improvements will bring many advancements to handling bad and unwanted traffic, but that's another topic (see HAproxy blog).

Peering synchronizes server entries in stick-tables between many HAproxy instances over TCP connections, and a backend failing health checks on one HA frontend will be removed from all. Using documentation linked above here's an example:

peers HAPEERS
    peer fedb01 192.168.15.10:1307
    peer fedb02 192.168.15.20:1307

backend users
    mode tcp
    option tcplog
    option mysql-check user haproxy
    stick-table type ip size 20k peers HAPEERS
    stick on dst
    balance roundrobin
    server mysql10 192.168.15.33:3306 maxconn 500 check port 3306 inter 2s
    server mysql12 192.168.15.34:3306 maxconn 500 check port 3306 inter 2s backup

#backend uploads
When talking about Redis in particular I'd like to emphasize improvements in HAproxy 1.5 health checks, which will allow us to query Redis nodes about their role directly, and fail-over only if a backend became the new master. If Redis Sentinel is enabled and the cluster elects a new master HAproxy will fail-over traffic to it transparently. Using documentation linked above here's an example:
backend messages
    mode tcp
    option tcplog
    option tcp-check
    #tcp-check send AUTH\ foobar\r\n
    #tcp-check expect +OK
    tcp-check send PING\r\n
    tcp-check expect +PONG
    tcp-check send info\ replication\r\n
    tcp-check expect string role:master
    tcp-check send QUIT\r\n
    tcp-check expect string +OK
    server redis15 192.168.15.40:6379 maxconn 1024 check inter 1s
    server redis17 192.168.15.41:6379 maxconn 1024 check inter 1s
    server redis19 192.168.15.42:6379 maxconn 1024 check inter 1s


Written by anrxc | Permalink | Filed under work

29.10.2013 22:11

Nginx PID file handling and HA

Nginx by some accounts now serves most of the world's top sites, and is now an enterprise product, so I was very surprised when I couldn't find a single mention of a problem in PID file handling that I've been observing for a while.

On a restart the old Nginx 'master' process can remain active for some time until all active connections on it close and it terminates. When it does so it deletes the PID file on the file-system which no longer belongs to it, it belongs to the new Nginx 'master' process which was spawned and already wrote its own PID into the file unless you prevented it from starting in the first place while a PID file exists on the file-system.

This leads to many issues down the road, here's some of the most severe that I experienced; Monitoring system alarming at 3 in the morning about full RAID arrays when in reality Nginx kept open file-descriptors on huge logs deleted long ago - log rotation jobs simply failed to send USR1 to it, no PID file on the file-system. Then failures from sysadmins and configuration management agents alike to activate new configuration by reloading (or again restarting) the Nginx service, signals being sent to the aether, there's no PID on the file-system. That's where most of my surprise came from, how in the world is everyone else successfully automating their web farms when 10% of your configuration updates fail to apply on average? What, that's only 100 servers when you are past 1000 nodes...

Nginx developers proclaimed this a feature and invalidated the bug report. Officially this is the intended behavior: "you start new nginx instance before old nginx instance was terminated. It's not something allowed - you have to wait for an old master process to exit before starting a new one.". That's acceptable to me, but then I wonder how in the world is everyone else successfully doing high-availability with their web farms? If you have a CDN origin and edge nodes are pulling 2GB videos from it those connections are going to take a while to close, meanwhile your origin is now failing all the health checks from your HA frontends and it gets failed out...

The final official solution is that Nginx should never ever be restarted. Every configuration update can be applied by a reload (send HUP to master process). Unfortunately that doesn't work in practice (how in the world is...), Nginx fails to apply many configuration changes on a reload in my experience. If that is the true bug I sometimes hit (ie. new FastCGI caching policy failed to activate, new SSL certificates failed to activate etc. etc.) I understand it and I accept it. However I remain of the opinion that smarter PID file handling is a simple fix, and a useful thing to have.

Things to do in this situation and avoid 3AM wake-up calls for a false positive, while not giving up HA? The init script can maintain its own PID file which is a clone of the one Nginx 'master' created at the time it started, and rely on it for all future actions and so can your log rotation jobs. This hack will certainly never be distributed by an OS distribution - but many operations already package their own Nginx because of all the extra modules modern web servers require (media streaming, Lua scripting, Real IP...).


Written by anrxc | Permalink | Filed under work

29.06.2013 23:01

SysV init on Arch Linux

Arch Linux distributes systemd as its init daemon, and has finally deprecated SysV this June. I could always appreciate the elegance in Arch' simple design, and its packaging framework. Both of which make it trivial for any enthusiast to run his own init daemon, be it openrc, upstart or SysV. To my surprise this didn't seem to be the prevailing view, and many users converted their workstations to other distributions. This irrational behavior also led to censorship of users mailing lists. Which made it impossible to reach out to other UNIX enthusiasts interested in keeping SysV usable as a specialized (and unofficial) user group.

When rc.d scripts started disappearing from official packages, I rescued those I could and packaged them as rcdscripts-aic. There was no user group, just me, and in expectation of other rc.d providers I added my initials as the suffix to the package and made a decision not to monopolize /etc/rc.d/ or /usr/share/rcdscripts/ to avoid conflict. Apparently no other provider showed up, but I still use /usr/share/rcdscripts-aic/ without strict guidelines how to make use of the scripts in that directory (copy or symlink to /etc/rc.d/ and /etc/conf.d/?).

Later this month Arch Linux also deprecated directories /bin, /sbin and /usr/sbin in favor of /usr/bin. Since initscripts was at this point obsolete, unsupported and unmaintained piece of code SysV became unusable. Again with no other provider available to me I forked, and packaged initscripts-aic. At least sysvinit found a maintainer and I didn't have to take over that as well.

The goal is providing a framework around SysV init for hobbyists and UNIX enthusiasts to boot their SysV systems. Stable basic boot is the priority for me, new features are not. There is no systemd revolution, I do not wish to associate my self with any systemd trolling. I do not want my packages censored and deleted from the Arch User Repository.


Written by anrxc | Permalink | Filed under code

24.06.2013 19:29

Hosting with Puppet - Design

Two years ago I was a small time Cfengine user moving to Puppet on a large installation, and more specifically introducing it to a managed hosting provider (which is an important factor driving my whole design and decision making process later). I knew how important it's going to be to get the base design right, and I did a lot of research on Puppet infrastructure design guidelines but with superficial results. I was disappointed, the DevOps crowd was producing tons of material on configuration management, couldn't at least a small part be applicable to large installations? I didn't see it that way then, but maybe that knowledge was being reserved for consulting gigs. After criticizing it is only fair that I write something of my own on the subject.

First of all, a lot has happened since. Wikimedia decided to release all their Puppet code to the public. I learned a lot, even if most of it was what not to do - but that was the true knowledge to be gained. One of the most prominent Puppet Forge contributors, example42 labs, released the next generation of their Puppet modules, and the quality has increased immensely. The level of abstraction is high, and for the first time I felt the Forge can possibly become a provider for me. Then 8 months ago the annual PuppetConf conference hosted engineers from Mozilla and Nokia talking about design and scaling challenges they faced running Puppet in a big enterprise. Someone with >2,000 servers sharing their experiences with you, soak it up.

* Puppet design principles


Running Puppet in a hosting operation is a very specific use case. Most resources available to you will concern running one or two web applications, on a hopefully standardized software stack across a dozen servers all managed by Puppet. But here you are a level above that, running thousands of such apps and sites, across hundreds of development teams that have nothing in common. If they are developing web-apps in Lisp you are there to facilitate it, not to tell stories about Python.

Some teams are heavily involved with their infrastructure, others depend entirely on you. Finally, there are "non-managed" teams which only need you to manage hardware for them but you still want to provide them with a hosted Puppet service. All this influences my design heavily, but must not define it. If it works for a 100 apps it must work for 1 just the same, so the design principles below are universal.

- Object oriented


Do not treat manifests like recipes. Stop writing node manifests. Write modules.

Huge manifests with endless instructions, if conditionals, and node (server) logic are a trap. They introduce an endless cycle of "squeezing in just one more hack" until the day you throw it all away and re-factor from scratch. This is one of the lessons I learned from Wikimedia.

Write modules (see Modular services and Module levels) that are abstracted. Within modules write small abstracted classes with inheritance in mind (see Inheritance), and write defined types (defines) for resources that have to be instantiated many times. Write and distribute templates where possible, not static files, to reduce chances of human error, to reduce number of files to be maintained by your team, and finally number of files compiled into catalogs (which concerns scaling).

Here's a stripped down module sample to clarify this topic, and those discussed below:
# - modules/nfs/manifests/init.pp
class nfs (
    $args = 'UNSET'
    ){

    # Abstract package and service names, Arch, Debian, RedHat...
    package { 'portmap': ensure => 'installed', }
    service { 'portmap': ensure => 'running', }
}
# - modules/nfs/manifests/disable.pp
class nfs::disable inherits nfs {
    Service['portmap'] { ensure => 'stopped', }
}
# - modules/nfs/manifests/server.pp
class nfs::server (
    $args = 'UNSET'
    ){

    package  { 'nfs-kernel-server': ensure => 'installed', }
    @service { 'nfs-kernel-server': ensure => 'running', }
}
# - modules/nfs/manifests/mount.pp
define nfs::mount (
    $arg  = 'UNSET',
    $args = 'UNSET'
    ){

    mount { $arg: device => $args['foo'], }
}
# - modules/nfs/config.pp
define nfs::config (
    $args = 'UNSET'
    ){

    # configure idmapd, configure exports...
)

- Modular services


Maintain clear roles and responsibilities between modules. Do not allow overlap.

Maybe it's true that a server will never run PHP without an accompanying web server, but it's not a good reason to bundle PHP management into the apache2 module. Same principle is here to prevent combining mod_php and PHP-FPM management into a single module. Write php5, phpcgi, phpfpm modules, and use them for Apache2, Lighttpd, Nginx web servers interchangeably.

- Module levels


Exploit modulepath support. Multiple module paths are supported, they can greatly improve your design.

Reserve default /etc/puppet/modules path for modules exposing the top level API (for lack of a better acronym). These modules should define your policy for all the software you standardize on, how a software distribution is installed and how it's managed: iptables, sudo, logrotate, dcron, syslog-ng, sysklogd, rsyslog, nginx, apache2, lighttpd, php5, phpcgi, phpfpm, varnish, haproxy, tomcat, fms, darwin, mysql, postgres, redis, memcached, mongodb, cassandra, supervisor, postfix, qmail, puppet it self, puppetmaster, pepuppet (enterprise edition), pepuppetmaster...

Use the lower level modules for defining actual policy and configuration for development teams in organizations (or customers in the enterprise), and their servers. Here's an example:
- /etc/puppet/teams/t1000/
  |_ /etc/puppet/teams/t1000/files/
     |_ php5/
        |_ apc.ini
  |_ /etc/puppet/teams/t1000/manifests/
     |_ init.pp
     |_ services.pp
     |_ services/
        |_ encoder.pp
     |_ webserver.pp
     |_ webserver/
        |_ production.pp
     |_ users/
        |_ virtual.pp
  |_ /etc/puppet/teams/t1000/templates/
     |_ apache2/
        |_ virtualhost.conf.erb
For heavily involved teams the "services" classes are here to enable them to manage their own software, code deployments and simillar tasks.

- Inheritance


Understand class inheritance, and use it to abstract your code to allow for black-sheep servers.

These servers are always present - that one server in 20 which does things "just a little differently".
# - teams/t1000/manifests/init.pp
class t1000 {
    include ::iptables

    class { '::localtime': timezone => 'Etc/UTC', }

    include t1000::users::virtual
}
# - teams/t1000/manifests/webserver.pp
class t1000::webserver inherits t1000 {
    include ::apache2

    ::apache2::config { 't1000-webcluster':
        keep_alive_timeout  => 10,
        keep_alive_requests => 300,
        name_virtual_hosts  => [ "${ipaddress_eth1}:80", ],
    }
}
# - teams/t1000/manifests/webserver/production.pp
class t1000::webserver::production inherits t1000::webserver {
    include t1000::services::encoder

    ::apache2::vhost { 'foobar.com':
        content => 't1000/apache2/virtualhost.conf.erb',
        options => {
            'listen'  => "${ipaddress_eth1}:80",
            'aliases' => [ 'prod.foobar.com', ],
        },
    }
}
Understand how resources are inherited across classes. This will not work:
# - teams/t1000/manifests/webserver/legacy.pp
class t1000::webserver::legacy inherits t1000::webserver {
    include ::nginx

    # No, you won't get away with it
    Service['apache2'] { ensure => 'stopped', }
}
Only a sub-class inheriting its parent class can override resources of that parent class. But this is not a deal breaker, once you understand it. Remember our "nfs::disable" class from an earlier example, which inherited its parent class "nfs" and proceeded to override a service resource?
# - teams/t1000/manifests/webserver/legacy.pp
class t1000::webserver::legacy inherits t1000::webserver {
    include ::nginx

    include ::apache2::disable
}
This was the simplest scenario. Consider these as well: legacy server needs to run MySQL v5.1 in a cluster of v5.5 nodes, server needs Nginx h264 streaming support compiled into nginx binary and its provider is a special package, server needs PHP 5.2 to run a legacy e-commerce system...

- Function-based classifiers


Export only bottom level classes of bottom level modules to the business, as node classifiers:
# - manifests/site.pp (or External Node Classifier)
node 'man0001' { include t1000::webserver::production }
This leaves system engineers to define system policy with a 100% flexibility, and allows them to handle complex infrastructure. They in turn must ensure the business is never lacking, a server either functions as a production webserver or not, it must never include top level API classes.

- Dynamic arguments


Do not limit your templates to a fixed number of features.

Use hashes to add support for optional arbitrary settings that can be passed onto resources in defines. When a developer asks for a new feature there is nothing to modify, nothing to re-factor, options hash (in earlier "apache2::vhost" example) is extended and the template is expanded as needed with new conditionals.

- Convergence


Embrace continuous repair. Design for it.

Is it to my benefit to go all wild on class relationships to squeeze everything into a single puppet run? But if just one thing changes whole policy breaks apart. Micro manage class dependencies and resource requirements. If a webserver refused to start because a Syslog-ng FIFO was missing we know it will succeed on the next run. Within a few runs we can deploy whole clusters across continents.

There is however a specific here which is not universal, a hosting operation needs to keep agent run intervals frequent to keep up with an endless stream of support requests. Different types of operations can get away with 45-60 minute intervals, and sometimes use them for one reason or another (ie. scaling issues). I followed the work of Mark Burgees (author of Cfengine) for years and agree with Cfengine's 5 minutes intervals for just about any purpose.

- Configuration abstraction


Know how much to abstract, and where to draw the line.

Services like Memcache and MongoDB have a small set of run-time parameters. Their respective "*::config" defines can easily abstract their whole configuration files into a dozen arguments expanded into variables of a single template. Others like Redis support hundreds of run-time parameters, but if you consider that >80% of Redis servers run in production with default parameters even a 100 arguments accepted by "redis::config" is not too much. For any given server you will provide 3-4 arguments, the rest will be filled from default values, and yet when you truly need to deploy an odd-ball Redis server the flexibility to do so is there without the need to maintain a hundred redis.conf copies.

Services like MySQL and Apache2 can exist in an endless number of states, which can not be abstracted. Or to be honest they can, but you make your team miserable when you set out to make their jobs better. This is where you draw the line. For the most complex software distributions abstract only the fundamentals and commonalities needed to deploy the service. Handle everything else through "*::dotconf", "*::vhost", "*::mods" etc. defines.

- Includes


Make use of includes in services which support them, and those that don't.

Includes allow us to maintain small fundamental configuration files, which include site specific configuration from small configuration snippets dropped into their conf.d directories. This is a useful feature when trying to abstract and bring together complex infrastructures.

Services which do not support includes by default can fake them. Have the "*::dotconf" define install configuration snippets and then call an exec resource to assemble primary configuration file from individual snippets in the improvised conf.d directory (alternative approach is provided by puppet-concat). This functionality also allows you to manage shared services across shared servers, where every team provides a custom snippet in their own repository. They all end up on the shared server (after review) without the need to manage a single file across many teams (opening all kind of access-control questions).

- Service controls


Do not allow Puppet to become the enemy of the junior sysadmin.

Every defined type managing a service resource should include 3 mandatory arguments, let's call them: onboot, autorestart, and autoreload. On clustered setups it is not considered useful to bring back broken or outdated members into the pool on boot, it is also not considered useful to automatically restart such service if detected as "crashed" while it's actually down for maintenance, and often times it is not useful to restart such a service when a configuration change is detected (and in the process flush 120GB of data from memory).

Balance these arguments and provide sane defaults for every single service on its own merits. If you do not downtime will occur. You will also have sysadmins stopping Puppet agents the moment they login, naturally forgetting to start it again, and 2 weeks later you realize half of your cluster is not in compliance (Puppet monitoring is important, but is an implementation detail).

- API documentation


Document everything. Use RDoc markup and auto-generate HTML with puppet doc.

At the top of every manifest: document every single class, every single define, every single of their arguments, every single variable they search for or declare, and provide multiple usage examples for each class and define. Finally include contact information, bug tracker link and any copyright notices.

Puppet includes a tool to auto generate documentation from these headers and comments in your code. Have it run periodically refreshing your API documentation, and export it to your operations and development teams. It's not just a nice thing to do for them, it is going to save you from re-inventing the wheel on your Wiki system. Your Wiki now only needs the theory documented; what is Puppet, what is revision control, how to commit a change... and these bring me to the topics of implementation and change management, which are beyond the scope of design.


Written by anrxc | Permalink | Filed under work, code

31.05.2013 23:11

Jacks and Masters

I haven't written a single word this year. It's a busy one for me building, scaling and supporting more big sites. Interesting problems were solved, bugs were found... but still, I didn't feel like I stumbled onto anything worthy of publishing that hasn't been rehashed a 1000 times already through blogs and eventually change-logs. But thinking about my lack of material, while catching up on my podcast backlog gave me an idea to write something about the sysadmin role in all this.

Many times in the last year I returned to two books as best practices guides for building and scaling web-apps. Those are Scalability Rules and Web Operations. I recommend these books to any sysadmin interested in web operations, as they share experiences from engineers working on the top sites and there's just no other way to gain these insights unless you join them.

That brings me back to my podcast backlog. Episode 38 of the DevOps Cafe had a very interesting guest (Dave Zwieback) that talked a lot about hiring sysadmins, and generalist vs. specialist roles in systems administration today. I am a generalist which in it self is fine, but there's a big difference between "jack of all trades, master of some", and being a "master of none". I've been thinking about it since the podcast, as I wasn't thrilled with what I was doing lately, that is jumping through a lot of new technologies to facilitate all kinds of new frameworks web developers use. Often times that means skipping some kind of "natural" learning course, instead you learn enough to deploy and manage it, while the real knowledge comes within a certain period of time spent debugging problems when it breaks apart.

Now to tie both parts of the text together. If you want to join site reliability engineers at one of the top sites how do you justify drawing a blank when asked to explain how Varnish malloc storage works internally, if claiming you built huge caches with it? The Jack issue is amplified if you consider there is now a first, or even a second, generation of sysadmins who have never stepped into a data-center and are missing out on hardware and networking experience. Appropriate name that comes to mind is "the cloud generation", and I'm a part of it.


Written by anrxc | Permalink | Filed under work, books

09.12.2012 05:24

Hybrid IRCD for Arch Linux

Hybrid IRCD has been a favorite of mine for many years. I tried it once because a Croatian IRC network ran it and it stuck with me. I'm very happy to announce Hybrid packages for Arch Linux are available in AUR from today. I worked on it as a side project for a while and finished today thanks to the blizzard that kept me inside this weekend. Hybrid server is available as ircd-hybrid, and Hybserv2 services are available as ircd-hybrid-serv. They adhere to standards set by all other ircd providers, default configuration for both is usable out of the box, and examples for connecting services to the server are included. They were built and tested on both arches, only component not tested by me are systemd service files.


Written by anrxc | Permalink | Filed under main, code

09.12.2012 04:58

GNU/Linux and ThinkPad T420

I got a new workstation last month, a 14" laptop from the ThinkPad T series. The complete guide for TuxMobil about installing Arch Linux on it is here.

It replaced a (thicker and heavier) 13" HP ProBook 4320s which I used a little over a year, before giving up on it. In some ways ProBook was excellent, certified for SUSE Linux it had complete Linux support down to the most insignificant hardware components. In other ways it was the worst laptop I ever used. That ProBook series has chiclet-style keyboards, and I had no idea just how horrible they can be. Completely flat keys, widely spread and with bad feedback caused me a lot of wrist pain. Even after a year I never got used to the keyboard, and I was making a lot of typos, on average I would miss-type even my login every second boot. At the most basic level my job can be described as a "typist" so all this is just plain unacceptable.

The touchpad however is worse than the keyboard. It's a "clickpad", with one big surface serving as both the touchpad area and the button area. To get it in a usable state a number of patches are needed, coupled with an extensive user-space configuration. But even after a year of tweaking it was never just right. The most basic of operations like selecting text, dragging windows or pressing the middle button is an exercise in patience. Sadly clickpads are present in a huge number of laptops today.

Compared to the excellent UltraNav device in the ThinkPad they are worlds apart. Same is true of the keyboard in T420, which is simply the best laptop keyboard I've ever used. I stand behind these words as I just ordered another T420, for personal use. One could say these laptops are in different categories, but that's not entirely true. I had to avoid the latest ThinkPad models because of the chiclet-style keyboards they now have. Lenovo is claiming that's "keyboard evolution", to me they just seem cheaper to produce, and this machine could be the last ThinkPad I'll ever own. If this trend continues I don't know where to turn next for decent professional grade hardware.


Written by anrxc | Permalink | Filed under main, desktop, work

01.10.2012 01:53

Net-installing Arch Linux

Recently I had to figure out the most efficient way of net-installing Arch Linux on remote servers that fits into the deployment process, with many other operating systems, which runs a DHCP and TFTP daemons serving various operating system images.

The Arch Linux PXE wiki put me on the right track and I downloaded the archboot-x86_64 ISO, which I temporarily mounted, so I can copy the key parts of the image:

# wget http://mirrors.kernel.org/archlinux/iso/archboot/2012.06/archlinux-2012.06-1-archboot-x86_64.iso 
# mkdir /mnt/archiso
# mount -o loop,ro archlinux-2012.06-1-archboot-x86_64.iso /mnt/archiso
Let's say the TFTP daemon serves images using pxelinux, chrooted in /srv/tftpboot. The images are stored in the images/ sub-directory and the top level pxelinux.cfg configuration gets copied from the appropriate images/operating-system/ directory automatically based on the operating system selection in the provisioning tool:
# mkdir -p images/arch/arch-installer/amd64/
# cp -ar /mnt/archiso/boot/* images/arch/arch-installer/amd64/
The boot directory of the archboot ISO contains the kernel and initrd images, and a syslinux installation. I proceeded to create the pxelinux configuration to boot them, ignoring syslinux:
# cd images/arch/
# mkdir arch-installer/amd64/pxelinux.cfg/
# emacs arch-installer/amd64/pxelinux.cfg/default

  prompt 1
  timeout 1
  label linux
    kernel images/arch/arch-installer/amd64/vmlinuz_x86_64
    append initrd=images/arch/arch-installer/amd64/initramfs_x86_64.img gpt panic=60 vga=normal loglevel=3

# ln -s arch-installer/amd64/pxelinux.cfg ./pxelinux.cfg
To better visualize the end result, here's the final directory layout:
arch-installer/
arch-installer/amd64/
arch-installer/amd64/grub/*
arch-installer/amd64/pxelinux.cfg/
arch-installer/amd64/pxelinux.cfg/default
arch-installer/amd64/syslinux/*
arch-installer/amd64/initramfs_x86_64.img
arch-installer/amd64/vmlinuz_x86_64
arch-installer/amd64/vmlinuz_x86_64_lts
pxelinux.cfg/
pxelinux.cfg/default
I left the possibility of including i686 images in the future, but that is not likely ever to happen due to almost non-existent demand for this operating system on our servers. Because of that fact I didn't spend any time on further automation, like automated RAID assembly or package pre-selection. On the servers I deployed assembling big RAID arrays manually was tedious, but really nothing novel compared to dozens you have to rebuild or create every day.

From a fast mirror the base operating system installs from the Arch [core] repository in a few minutes, and included is support for a variety of boot loaders, with my favorite being syslinux which in Arch Linux has an excellent installer script "syslinux-install_update" with RAID auto detection. I also like the fact 2012.06-1 archboot ISO still includes the curses menu based installer, which was great for package selection, and the step where the base configuration files are listed for editing. Supposedly the latest desktop images now only have helper scripts for performing installations - but I wouldn't know for sure as I haven't booted an ISO in a long time, Arch is an operating system you install only once, the day you buy the workstation.

Another good thing purely from the deployment standpoint is the rolling releases nature, as the image can be used to install the latest version of the operating system at any time. Or at least until the systemd migration which might obsolete the image, but I dread that day for other reasons - I just don't see its place on servers, or our managed service with dozens of proprietary software distributions. But right now, we can deploy Arch Linux half way around the globe in 10 minutes, life is great.


Written by anrxc | Permalink | Filed under work

26.08.2012 03:35

More on Redis

In managed hosting you're not often present in design stages of new applications and sometimes you end up supporting strange infrastructure. Or at least that was my experience in the past. So little by little I found my self supporting huge (persistent) Redis databases, against my better judgment.

Someone sent me a link to the Redis Sentinel beta announcement last month. It may even make it into the 2.6 release... but all of this I had to implement on my own long ago. A lot of developers I supported didn't even want to use the 2.4 branch (in my opinion just the memory fragmentation improvements are more than enough reason to ditch 2.2 forever). Another highly anticipated Redis feature, the Redis Cluster, may not even make it into the 2.6 release. That's too bad, there's too much features with Redis that are always "just around the corner", yet I have a feeling I'll be supporting Redis 2.4 for at least another 3 years, with all its flaws and shortcomings (I scratched the surface in my last article with AOF corruption, and not-so-cheap hardware needed for reliable persistent storage).

Typically I would split members of a Redis cluster across 2 or more power sources and switches. But that's just common sense for any HA setup, as is not keeping all your masters together. Redis doesn't have multi-master replication so a backup 'master' is always passive, and is just another slave of the primary with slaves of its own. If the primary master fails only half of the slaves have to be failed-over to the backup master. This has its problems (ie. double complexity of replication monitoring by the fail-over agents), but the benefits outweight failing-over a whole cluster to the new master. That could take half a day, as fail-over is an expensive operation (it is a full re-sync from the master). You can find replication implementation details here.

If you can't allow slaves to serve stale data (tunable in redis.conf) you need enough redundancy in the pool to be able to run at half capacity for at least a few hours, until at least one of the outdated slaves is fully re-synced to its new master. And that finally brings me to knowing when is the right time to fail-over.

Any half decent load balancer can export the status of a backend pool through an API, or just a HTTP page (if yours can't it's time to use the open source HAproxy). That information is ripe for exploiting to our advantage, but we need to be weary of false positives. I can't share my own solutions, but you will want all N slaves confirming that the master pool is truly degraded, and initiate fail-overs one by one to avoid harmonics if you are serving stale data, or all at once if you aren't. For all that you will need them to communicate with each other, and a simple message broker can do the job well.

As I am writing these last notes I realized I haven't mentioned another fundamental part of any Redis deployment I do - backups. This article documents the persistence implementation in Redis, and explains that the RDB engine snapshots are good for taking backups. RDB is never modified directly, and snapshots are renamed into their final destination atomically only when they are complete. From here it's trivial to write a script that initiates a background save, waits until it's done and transfers the fresh snapshot off site.


Written by anrxc | Permalink | Filed under work

05.03.2012 00:06

Infrastructure you can blog about

I spent last 5 months planning and building new infrastructure for one of the biggest websites out there. I was working around the clock while developers were rewriting the site, throwing away an ancient code base and replacing it with a modern framework. I found no new interesting topics to write about in that time being completely focused on the project, while the RSS feed of this journal was constantly the most requested resource on the web server. I'm sorry there was nothing new for you there. But I learned some valuable lessons during the project, and they might be interesting enough to write about. Everything I learned about Puppet, which was also a part of this project, I shared in my previous entry. I'll focus on other parts of the cluster this time.

Here's a somewhat simplified representation of the cluster:
Network diagram

Following the traffic path first thing you may ask your self is "why is Varnish Cache behind HAproxy?". Indeed placing it in front in order to serve as many clients as soon as possible is logical. Varnish Cache is good software, but often unstable (developers are very quick to fix bugs given a proper bug report, I must say). Varnish Cache plugins (so called vmods) are even more unstable, crashing varnish often and degrading cache efficiency. This is why HAproxy is imperative in front, to route around crashed instances. But it's the same old HAproxy that has proven it self balancing numerous high availability setups. Also, Varnish Cache as a load balancer is a nice try, but I won't be using it as such any time soon. Another thing you may ask is "how is Varnish Cache logging requests to Syslog when it has no Syslog support?". I found FIFOs work good enough - and remember traffic is enormous, so that says a lot.

Though with a more mature threaded implementation I can't see my self using Rsyslog over syslog-ng on big log servers in the near future. Hopefully threaded syslog-ng only gets better, resolving this dilemma for me for all times. Configuration of rsyslog feels awkward (though admittedly syslog-ng is not a joy to configure either). Version packaged in Debian stable has bugs, one of which made it impossible to apply different templates to different network sources. Which is a huge problem when it's going to be around for years. I had to resort to packaging my own, but ultimately dropped it completely for non threaded syslog-ng which is working pretty good.

Last thing worth sharing are Redis experiences. It's really good software (ie. as alternative to Memcached) but ultimately I feel disappointed with the replication implementation. Replication, with persistence engines in use, and with databases over 25GB in size is a nightmare to administrate. When a slave (re)connects to a master it initiates a SYNC which triggers a SAVE on the master, and a full resync is performed. This is an extremely expensive operation, and makes cluster wide automatic fail-over to a backup master very hard to implement right. I've also experienced AOF corruption which could not be detected by redis-check-aof. This makes BGREWRITEAOF jobs critical to execute regularly, but with big databases this is another extremely expensive operation, especially if performed under traffic. The following has proven it self as a best solution for high performing Redis servers; 4x147GB 15k SAS disks in (h/w) RAID10, and Xeon 5000 series CPUs.

While working on this the running joke was I'm building infrastructure you can blog about (but otherwise do little else with it). But it does do a little more than just look impressive on paper.


Written by anrxc | Permalink | Filed under work