16.01.2021 03:25
FreeBSD pkg signing with an agent
Preamble: if you want the code and don't care for my ramblings here you go -
http://git.sysphere.org/freebsd-pkgsign/
Coming from GNU/Linux where gpg-agent was available to
facilitate key management when signing repositories or packages I
missed that feature. FreeBSD however uses SSL not GPG. But those keys
can be read by the ssh-agent and we can work with
that. Recent SolarWinds supply chain attack is a good reminder to
safeguard your software delivery pipeline.
If you announced PUBKEY repositories to your users or
customers up until this point you would have to switch to
FINGERPRINTS instead, in order to utilize the
pkg-repo(8) support for an external
signing_command.
The Python Paramiko library makes communication with an agent
simple and it is readily available as the py37-paramiko
package (or port) so I went with that. There was however a small
setback (with RSA sign flags) but more about that at the bottom of the
article. If you would prefer a simpler implementation of the agent
protocol and to have a self sufficient tool I found sshovel to be
pretty good (and confirmed signing is implemented well enough to work
for this purpose). I didn't have time to strip out (now unnecessary)
encryption code, and more importantly didn't have time to port
sshovel to python3 (as python2 is deprecated in FreeBSD).
We are all used to digests of public keys serving as fingerprints and
identifiers. However Paramiko derives fingerprints from the public key
in the SSH format. For simplicity I decided to flow with it and
reference keys by Paramiko fingerprints. The "--dump"
argument is implemented as a helper in pkgsign to list
Paramiko fingerprints of all keys found in ssh-agent. But
before we dump fingerprints if your key(s) is on the file-system
without a passphrase (which they really shouldn't be) it's time to put
a passphrase on them now (and don't forget to shred the old
ones). Here's a crash course on ssh-agent operation, and how
to get pkgsign to connect to it:
$ ssh-agent -t 1200s >~/.ssh/ssh-agent.info $ source ~/.ssh/ssh-agent.info $ ssh-add myprivatekey.key.enc Enter passphrase: [PASSPHRASE] $ ./pkgsign --dump INFO: found ssh-agent key [FINGERPRINT]If you wanted to automate key loading through some associative array etc. It would be beneficial to rename your private key to match the fingerprint. But you don't have to. However for the public key it is expected (unless you change the default behavior). This is because converting the public key obtained directly from the agent to the PEM pkcs8 format (that pkg-repo(8) is expecting in return) would be more code than this entire thing. It is much simpler to just read the public key from the file-system and be done with it.
# ln -s /usr/local/etc/ssl/public/mypublickey.pub /usr/local/etc/ssl/public/[FINGERPRINT].pubThe ownership/permissions/chflags scheme on the encrypted private key and parent directories is up to you. Or plug it in on external media, or cryptokey, or scp it only when needed, or shutdown the signing server after signing... This is crucial. Agent availability is an improvement, but don't get complacent because of it.
When pkg-repo(8) is used with signing_command the data for signing is piped to the specified command. In addition to that pkgsign expects a fingerprint passed to it as an argument. Why all this messing around with fingerprints at all? Because the ability to use different keys for different repositories is important. Because it aids automation, and because you don't want your repository signed by some OpenSSH key by mistake. To explore some possibilities let's consider this simplified cog of an imaginary automated system:
#!/usr/bin/env bash declare -A REPO_KEYS REPO_KEYS['xfce']=FINGERPRINT11111111111111111111 REPO_KEYS['gnome']=FINGERPRINT22222222222222222222 # /path/to/repos/xfce/FreeBSD:12:amd64/ ARG=$1 SOFTWARE_DISTRIB="${ARG%/*/}" SOFTWARE_DISTRIB="${SOFTWARE_DISTRIB##/*/}" SOFTWARE_DISTRIB_KEY="${REPO_KEYS[$SOFTWARE_DISTRIB]}" /usr/sbin/pkg repo $ARG signing_command: ssh signing-server /path/to/pkgsign ${SOFTWARE_DISTRIB_KEY}How to bootstrap your users or convert existing ones to the new repository format is explained in the manual very well but let's go over it anyway. Since the command to generate the fingerprint may look intimidating to users you could instead opt to pregenerate it and host it along side the public key:
# mkdir -p /usr/local/etc/pkg/keys # mkdir -p /usr/local/etc/pkg/fingerprints/YOURORG/trusted # mkdir -p /usr/local/etc/pkg/fingerprints/YOURORG/revoked # fetch -o "/usr/local/etc/pkg/keys/YOURORG.pub" https://www2.you.com./YOURORG.pub # sh -c '( echo "function: sha256"; echo "fingerprint: $(sha256 -q /usr/local/etc/pkg/keys/YOURORG.pub)"; ) \ >/usr/local/etc/pkg/fingerprints/YOURORG/trusted/fingerprint' # emacs /usr/local/etc/pkg/repos/YOURORG.conf ... #signature_type: "PUBKEY", #pubkey: "/usr/local/etc/pkg/keys/YOURORG.pub", signature_type: "FINGERPRINTS", fingerprints: "/usr/local/etc/pkg/fingerprints/YOURORG", ...If you want to evaluate pkgsign with OpenSSL pkeyutl first to confirm all of this is possible you can do so for example like this (but only after patching Paramiko as explained in the paragraph below this one):
$ echo -n "Hello" | \ openssl dgst -sign myprivatekey.key.enc -sha256 -binary >signature-cmp $ echo Hello | \ ./pkgsign --debug [FINGERPRINT] >/dev/null $ echo -n "Hello" | \ openssl sha256 -binary | openssl pkeyutl -verify -sigfile signature-Hello \ -pubin -inkey mypublickey.pub -pkeyopt digest:sha256 Signature Verified SuccessfullyNow for the bad news. To make this project happen I had to patch Paramiko to add support for RSA sign flags. I submitted the patch upstream but haven't heard anything back yet. It would be nice of them to accept it, but if it takes a very long time then luckily the changes are very minor. It is trivial to keep moving it forward in a py37-paramiko port.
--- paramiko/agent.py 2021-01-15 23:03:50.387801224 +0100 +++ paramiko/agent.py 2021-01-15 23:04:34.667800388 +0100 @@ -407,12 +407,12 @@ def get_name(self): return self.name - def sign_ssh_data(self, data): + def sign_ssh_data(self, data, flags=0): msg = Message() msg.add_byte(cSSH2_AGENTC_SIGN_REQUEST) msg.add_string(self.blob) msg.add_string(data) - msg.add_int(0) + msg.add_int(flags) ptype, result = self.agent._send_message(msg) if ptype != SSH2_AGENT_SIGN_RESPONSE: raise SSHException("key cannot be used for signing")
06.03.2019 22:22
My new hobby
A few years ago, sitting in an emergency room, I realized I'm not
getting any younger and if I want to enjoy some highly physical
outdoor activities for grownups these are the very best years I have
left to go and do them. Instead of aggravating my RSI with further
repetitive motions on the weekends (i.e. trying to learn how to suck
less at programming) I mostly wrench on an old BMW coupe and drive it
to the mountains (documenting that journey, and the discovery of
German engineering failures, was best left to social media and
enthusiast forums).
Around the same time I switched jobs, and the most interesting stuff I
encounter that I could write about I can't really write about, because
it would disclose too much about our infrastructure. If you are
interested in HAProxy for the enterprise you can follow development on
the official blog.
22.10.2014 22:51
SysV init on Arch Linux, and Debian
Arch Linux distributes
systemd as its init daemon, and has deprecated SysV init in
June 2013. Debian is doing the same now and we see panic and terror
sweep through that community, especially since this time thousands of
my sysadmin colleagues are affected. But like with Arch Linux we are
witnessing irrational behavior, loud protests all the way to
the BSD camp and public threats
of Debian forking. Yet all that
is needed, and let's face it much simpler to achieve, is organizing a
specialized user group interested in keeping SysV (or your
alternative) usable in your favorite GNU/Linux distribution with
members that support one another, exactly as
I
wrote back then about Arch Linux.
Unfortunately I'm not aware of any such group forming in the Arch
Linux community
around sysvinit,
and I've been running SysV init alone as my PID 1 since then. It was
not a big deal, but I don't always have time or the willpower to break
my personal systems after a 60 hour work week, and the real problems
are yet to come anyway - if
(when)
for example
udev stops working without systemd PID 1. If you had a
support group, and especially one with a few coding gurus among you
most of the time chances are they would solve a difficult problem
first, and everyone benefits. On some other occasions an enthusiastic
user would solve it first, saving gurus from a lousy weekend.
For anyone else left standing at the cheapest part of the stadium,
like me, maybe uselessd
as a drop-in replacement is the way to go after major subsystems stop
working in our favorite GNU/Linux distributions. I personally like
what they reduced systemd to (inspired by
suckless.org philosophy?), but
chances are without support the project ends inside 2 years, and we
would be back here duct taping in isolation.
11.01.2014 23:00
Load balancing Redis and MySQL with HAproxy
It's a common occurrence to have two and more load balancers as HA
frontends to databases at high traffic sites. I've used the
open-source HAproxy like this, and
have seen others use it. Building this infrastructure and getting the
traffic distributed evenly is not really the topic I'd like to write
about, but what happens after you do.
Using HAproxy like this in front of replicated database
backends is tricky, a flap on one part of the network can make one or
more frontends activate the backup backends. Then you have a
form of split-brain scenario on your hands with updates occurring
simultaneously to all masters in a replicated set. Redis doesn't do
multi-master replication and it's easier to get in trouble, with just
one HA frontend, if it happens the old primaries are reactivated
before you synced them with new ones.
One way to avoid this problem is building smarter
infrastructure. Offloading health checks and role directing to an
independent arbiter. But having one makes it a single point of
failure, having more makes it another replicated nightmare to solve. I
was never keen on this approach because solving it reliably is an
engineering challenge each time, and I have the good sense of knowing
when it can be done better by smarter people.
Last year I've been pestering HAproxy developers to implement cheap
features as a start. Let's say if a fail-over to backup
happens to keep the old primary permanently offline with a new special
directive, which would be more reliable than gaming health check
counters. Request was of course denied, they are not in it to write
hacks. They always felt the agents are the best approach, and that
the Loadbalancer.org associates might even come up with a
common 'protocol' for health and director agents.
But developers heard my case, and I presume others who discussed the
same infrastructure. HAproxy 1.5 which is about to be released as the
new stable branch (source: mailing list) implements
peering. Peering
with the help of stick-tables, whose other improvements will
bring many advancements to handling bad and unwanted traffic, but
that's another topic
(see HAproxy
blog).
Peering synchronizes server entries in stick-tables between many
HAproxy instances over TCP connections, and a backend failing health
checks on one HA frontend will be removed from all. Using
documentation linked above here's an example:
peers HAPEERS peer fedb01 192.168.15.10:1307 peer fedb02 192.168.15.20:1307 backend users mode tcp option tcplog option mysql-check user haproxy stick-table type ip size 20k peers HAPEERS stick on dst balance roundrobin server mysql10 192.168.15.33:3306 maxconn 500 check port 3306 inter 2s server mysql12 192.168.15.34:3306 maxconn 500 check port 3306 inter 2s backup #backend uploadsWhen talking about Redis in particular I'd like to emphasize improvements in HAproxy 1.5 health checks, which will allow us to query Redis nodes about their role directly, and fail-over only if a backend became the new master. If Redis Sentinel is enabled and the cluster elects a new master HAproxy will fail-over traffic to it transparently. Using documentation linked above here's an example:
backend messages mode tcp option tcplog option tcp-check #tcp-check send AUTH\ foobar\r\n #tcp-check expect +OK tcp-check send PING\r\n tcp-check expect +PONG tcp-check send info\ replication\r\n tcp-check expect string role:master tcp-check send QUIT\r\n tcp-check expect string +OK server redis15 192.168.15.40:6379 maxconn 1024 check inter 1s server redis17 192.168.15.41:6379 maxconn 1024 check inter 1s server redis19 192.168.15.42:6379 maxconn 1024 check inter 1s
29.10.2013 22:11
Nginx PID file handling and HA
Nginx by some accounts now serves most of the world's top sites, and
is now an enterprise product, so I was very surprised when I couldn't
find a single mention of a problem in PID file handling that I've been
observing for a while.
On a restart the old Nginx 'master' process can remain active
for some time until all active connections on it close and it
terminates. When it does so it deletes the PID file on the file-system
which no longer belongs to it, it belongs to the new Nginx
'master' process which was spawned and already wrote
its own PID into the file unless you prevented it
from starting in the first place while a PID file exists on the
file-system.
This leads to many issues down the road, here's some of the most
severe that I experienced; Monitoring system alarming at 3 in the
morning about full RAID arrays when in reality Nginx kept open
file-descriptors on huge logs deleted long ago - log rotation jobs
simply failed to send USR1 to it, no PID file on the
file-system. Then failures from sysadmins and configuration management
agents alike to activate new configuration by reloading (or again
restarting) the Nginx service, signals being sent to the aether,
there's no PID on the file-system. That's where most of my surprise
came from, how in the world is everyone else successfully automating
their web farms when 10% of your configuration updates fail to apply
on average? What, that's only 100 servers when you are past 1000
nodes...
Nginx developers proclaimed this a feature and invalidated the bug
report. Officially this is the intended behavior: "you start new
nginx instance before old nginx instance was terminated. It's not
something allowed - you have to wait for an old master process to exit
before starting a new one.". That's acceptable to me, but then I
wonder how in the world is everyone else successfully doing
high-availability with their web farms? If you have a CDN origin and
edge nodes are pulling 2GB videos from it those connections are going
to take a while to close, meanwhile your origin is now failing all the
health checks from your HA frontends and it gets failed out...
The final official solution is that Nginx should never ever be
restarted. Every configuration update can be applied by a reload (send
HUP to master process). Unfortunately that doesn't work in
practice (how in the world is...), Nginx fails to apply many
configuration changes on a reload in my experience. If that is the
true bug I sometimes hit (ie. new FastCGI caching policy failed to
activate, new SSL certificates failed to activate etc. etc.) I
understand it and I accept it. However I remain of the opinion that
smarter PID file handling is a simple fix, and a useful thing to
have.
Things to do in this situation and avoid 3AM wake-up calls for a false
positive, while not giving up HA? The init script can maintain its own
PID file which is a clone of the one Nginx 'master' created
at the time it started, and rely on it for all future actions and so
can your log rotation jobs. This hack will certainly never be
distributed by an OS distribution - but many operations already
package their own Nginx because of all the extra modules modern web
servers require (media streaming, Lua scripting, Real IP...).
29.06.2013 23:01
SysV init on Arch Linux
Arch Linux
distributes systemd as its init daemon, and has finally
deprecated SysV this June. I could always appreciate the
elegance in Arch' simple design, and its packaging framework. Both of
which make it trivial for any enthusiast to run his own init daemon,
be it openrc, upstart or SysV. To my surprise this
didn't seem to be the prevailing view, and many users converted their
workstations to other distributions. This irrational behavior also led
to censorship of users mailing
lists. Which made it impossible to reach out to other UNIX enthusiasts
interested in keeping SysV usable as a specialized
(and unofficial) user group.
When rc.d scripts started disappearing from official
packages, I rescued those I could and packaged them
as rcdscripts-aic. There
was no user group, just me, and in expectation of other rc.d
providers I added my initials as the suffix to the package and made a
decision not to monopolize /etc/rc.d/
or /usr/share/rcdscripts/ to avoid conflict. Apparently no
other provider showed up, but I still
use /usr/share/rcdscripts-aic/ without strict guidelines how
to make use of the scripts in that directory (copy or symlink
to /etc/rc.d/ and /etc/conf.d/?).
Later this month Arch Linux also
deprecated
directories /bin, /sbin and /usr/sbin in
favor
of /usr/bin. Since initscripts
was at this point obsolete, unsupported and unmaintained piece of code
SysV became unusable. Again with no other provider available to me I
forked, and
packaged initscripts-aic. At
least sysvinit
found a maintainer and I didn't have to take over that as well.
The goal is providing a framework around SysV init for hobbyists and
UNIX enthusiasts to boot their SysV systems. Stable basic boot is the
priority for me, new features are not. There is no systemd revolution,
I do not wish to associate my self with any systemd trolling. I do not
want my packages censored and deleted from
the Arch User Repository.
24.06.2013 19:29
Hosting with Puppet - Design
Two years ago I was a small
time Cfengine user moving
to Puppet on a large
installation, and more specifically introducing it to a managed
hosting provider (which is an important factor driving my whole design
and decision making process later). I knew how important it's going to
be to get the base design right, and I did a lot of research on Puppet
infrastructure design guidelines but with superficial results. I was
disappointed, the DevOps crowd was producing tons of material
on configuration management, couldn't at least a small part be
applicable to large installations? I didn't see it that way then, but
maybe that knowledge was being reserved for consulting gigs. After
criticizing it is only fair that I write something of my own on the
subject.
First of all, a lot has happened since. Wikimedia decided
to release
all their Puppet code to the public. I learned a lot, even if most of
it was what not to do - but that was the true knowledge to be
gained. One of the most
prominent Puppet Forge
contributors, example42 labs,
released
the next
generation of their Puppet modules, and the quality has increased
immensely. The level of abstraction is high, and for the first time I
felt the Forge can possibly become a provider for me. Then 8
months ago the annual PuppetConf conference hosted engineers
from Mozilla
and Nokia
talking about design and scaling challenges they faced running Puppet
in a big enterprise. Someone with >2,000 servers sharing their
experiences with you, soak it up.
* Puppet design principles
Running Puppet in a hosting operation is a very specific use case. Most resources available to you will concern running one or two web applications, on a hopefully standardized software stack across a dozen servers all managed by Puppet. But here you are a level above that, running thousands of such apps and sites, across hundreds of development teams that have nothing in common. If they are developing web-apps in Lisp you are there to facilitate it, not to tell stories about Python.
Some teams are heavily involved with their infrastructure, others depend entirely on you. Finally, there are "non-managed" teams which only need you to manage hardware for them but you still want to provide them with a hosted Puppet service. All this influences my design heavily, but must not define it. If it works for a 100 apps it must work for 1 just the same, so the design principles below are universal.
- Object oriented
Do not treat manifests like recipes. Stop writing node manifests. Write modules.
Huge manifests with endless instructions, if conditionals, and node (server) logic are a trap. They introduce an endless cycle of "squeezing in just one more hack" until the day you throw it all away and re-factor from scratch. This is one of the lessons I learned from Wikimedia.
Write modules (see Modular services and Module levels) that are abstracted. Within modules write small abstracted classes with inheritance in mind (see Inheritance), and write defined types (defines) for resources that have to be instantiated many times. Write and distribute templates where possible, not static files, to reduce chances of human error, to reduce number of files to be maintained by your team, and finally number of files compiled into catalogs (which concerns scaling).
Here's a stripped down module sample to clarify this topic, and those discussed below:
# - modules/nfs/manifests/init.pp class nfs ( $args = 'UNSET' ){ # Abstract package and service names, Arch, Debian, RedHat... package { 'portmap': ensure => 'installed', } service { 'portmap': ensure => 'running', } }
# - modules/nfs/manifests/disable.pp class nfs::disable inherits nfs { Service['portmap'] { ensure => 'stopped', } }
# - modules/nfs/manifests/server.pp class nfs::server ( $args = 'UNSET' ){ package { 'nfs-kernel-server': ensure => 'installed', } @service { 'nfs-kernel-server': ensure => 'running', } }
# - modules/nfs/manifests/mount.pp define nfs::mount ( $arg = 'UNSET', $args = 'UNSET' ){ mount { $arg: device => $args['foo'], } }
# - modules/nfs/config.pp define nfs::config ( $args = 'UNSET' ){ # configure idmapd, configure exports... )
- Modular services
Maintain clear roles and responsibilities between modules. Do not allow overlap.
Maybe it's true that a server will never run PHP without an accompanying web server, but it's not a good reason to bundle PHP management into the apache2 module. Same principle is here to prevent combining mod_php and PHP-FPM management into a single module. Write php5, phpcgi, phpfpm modules, and use them for Apache2, Lighttpd, Nginx web servers interchangeably.
- Module levels
Exploit modulepath support. Multiple module paths are supported, they can greatly improve your design.
Reserve default /etc/puppet/modules path for modules exposing the top level API (for lack of a better acronym). These modules should define your policy for all the software you standardize on, how a software distribution is installed and how it's managed: iptables, sudo, logrotate, dcron, syslog-ng, sysklogd, rsyslog, nginx, apache2, lighttpd, php5, phpcgi, phpfpm, varnish, haproxy, tomcat, fms, darwin, mysql, postgres, redis, memcached, mongodb, cassandra, supervisor, postfix, qmail, puppet it self, puppetmaster, pepuppet (enterprise edition), pepuppetmaster...
Use the lower level modules for defining actual policy and configuration for development teams in organizations (or customers in the enterprise), and their servers. Here's an example:
- /etc/puppet/teams/t1000/ |_ /etc/puppet/teams/t1000/files/ |_ php5/ |_ apc.ini |_ /etc/puppet/teams/t1000/manifests/ |_ init.pp |_ services.pp |_ services/ |_ encoder.pp |_ webserver.pp |_ webserver/ |_ production.pp |_ users/ |_ virtual.pp |_ /etc/puppet/teams/t1000/templates/ |_ apache2/ |_ virtualhost.conf.erbFor heavily involved teams the "services" classes are here to enable them to manage their own software, code deployments and simillar tasks.
- Inheritance
Understand class inheritance, and use it to abstract your code to allow for black-sheep servers.
These servers are always present - that one server in 20 which does things "just a little differently".
# - teams/t1000/manifests/init.pp class t1000 { include ::iptables class { '::localtime': timezone => 'Etc/UTC', } include t1000::users::virtual }
# - teams/t1000/manifests/webserver.pp class t1000::webserver inherits t1000 { include ::apache2 ::apache2::config { 't1000-webcluster': keep_alive_timeout => 10, keep_alive_requests => 300, name_virtual_hosts => [ "${ipaddress_eth1}:80", ], } }
# - teams/t1000/manifests/webserver/production.pp class t1000::webserver::production inherits t1000::webserver { include t1000::services::encoder ::apache2::vhost { 'foobar.com': content => 't1000/apache2/virtualhost.conf.erb', options => { 'listen' => "${ipaddress_eth1}:80", 'aliases' => [ 'prod.foobar.com', ], }, } }Understand how resources are inherited across classes. This will not work:
# - teams/t1000/manifests/webserver/legacy.pp class t1000::webserver::legacy inherits t1000::webserver { include ::nginx # No, you won't get away with it Service['apache2'] { ensure => 'stopped', } }Only a sub-class inheriting its parent class can override resources of that parent class. But this is not a deal breaker, once you understand it. Remember our "nfs::disable" class from an earlier example, which inherited its parent class "nfs" and proceeded to override a service resource?
# - teams/t1000/manifests/webserver/legacy.pp class t1000::webserver::legacy inherits t1000::webserver { include ::nginx include ::apache2::disable }This was the simplest scenario. Consider these as well: legacy server needs to run MySQL v5.1 in a cluster of v5.5 nodes, server needs Nginx h264 streaming support compiled into nginx binary and its provider is a special package, server needs PHP 5.2 to run a legacy e-commerce system...
- Function-based classifiers
Export only bottom level classes of bottom level modules to the business, as node classifiers:
# - manifests/site.pp (or External Node Classifier) node 'man0001' { include t1000::webserver::production }This leaves system engineers to define system policy with a 100% flexibility, and allows them to handle complex infrastructure. They in turn must ensure the business is never lacking, a server either functions as a production webserver or not, it must never include top level API classes.
- Dynamic arguments
Do not limit your templates to a fixed number of features.
Use hashes to add support for optional arbitrary settings that can be passed onto resources in defines. When a developer asks for a new feature there is nothing to modify, nothing to re-factor, options hash (in earlier "apache2::vhost" example) is extended and the template is expanded as needed with new conditionals.
- Convergence
Embrace continuous repair. Design for it.
Is it to my benefit to go all wild on class relationships to squeeze everything into a single puppet run? But if just one thing changes whole policy breaks apart. Micro manage class dependencies and resource requirements. If a webserver refused to start because a Syslog-ng FIFO was missing we know it will succeed on the next run. Within a few runs we can deploy whole clusters across continents.
There is however a specific here which is not universal, a hosting operation needs to keep agent run intervals frequent to keep up with an endless stream of support requests. Different types of operations can get away with 45-60 minute intervals, and sometimes use them for one reason or another (ie. scaling issues). I followed the work of Mark Burgees (author of Cfengine) for years and agree with Cfengine's 5 minutes intervals for just about any purpose.
- Configuration abstraction
Know how much to abstract, and where to draw the line.
Services like Memcache and MongoDB have a small set of run-time parameters. Their respective "*::config" defines can easily abstract their whole configuration files into a dozen arguments expanded into variables of a single template. Others like Redis support hundreds of run-time parameters, but if you consider that >80% of Redis servers run in production with default parameters even a 100 arguments accepted by "redis::config" is not too much. For any given server you will provide 3-4 arguments, the rest will be filled from default values, and yet when you truly need to deploy an odd-ball Redis server the flexibility to do so is there without the need to maintain a hundred redis.conf copies.
Services like MySQL and Apache2 can exist in an endless number of states, which can not be abstracted. Or to be honest they can, but you make your team miserable when you set out to make their jobs better. This is where you draw the line. For the most complex software distributions abstract only the fundamentals and commonalities needed to deploy the service. Handle everything else through "*::dotconf", "*::vhost", "*::mods" etc. defines.
- Includes
Make use of includes in services which support them, and those that don't.
Includes allow us to maintain small fundamental configuration files, which include site specific configuration from small configuration snippets dropped into their conf.d directories. This is a useful feature when trying to abstract and bring together complex infrastructures.
Services which do not support includes by default can fake them. Have the "*::dotconf" define install configuration snippets and then call an exec resource to assemble primary configuration file from individual snippets in the improvised conf.d directory (alternative approach is provided by puppet-concat). This functionality also allows you to manage shared services across shared servers, where every team provides a custom snippet in their own repository. They all end up on the shared server (after review) without the need to manage a single file across many teams (opening all kind of access-control questions).
- Service controls
Do not allow Puppet to become the enemy of the junior sysadmin.
Every defined type managing a service resource should include 3 mandatory arguments, let's call them: onboot, autorestart, and autoreload. On clustered setups it is not considered useful to bring back broken or outdated members into the pool on boot, it is also not considered useful to automatically restart such service if detected as "crashed" while it's actually down for maintenance, and often times it is not useful to restart such a service when a configuration change is detected (and in the process flush 120GB of data from memory).
Balance these arguments and provide sane defaults for every single service on its own merits. If you do not downtime will occur. You will also have sysadmins stopping Puppet agents the moment they login, naturally forgetting to start it again, and 2 weeks later you realize half of your cluster is not in compliance (Puppet monitoring is important, but is an implementation detail).
- API documentation
Document everything. Use RDoc markup and auto-generate HTML with puppet doc.
At the top of every manifest: document every single class, every single define, every single of their arguments, every single variable they search for or declare, and provide multiple usage examples for each class and define. Finally include contact information, bug tracker link and any copyright notices.
Puppet includes a tool to auto generate documentation from these headers and comments in your code. Have it run periodically refreshing your API documentation, and export it to your operations and development teams. It's not just a nice thing to do for them, it is going to save you from re-inventing the wheel on your Wiki system. Your Wiki now only needs the theory documented; what is Puppet, what is revision control, how to commit a change... and these bring me to the topics of implementation and change management, which are beyond the scope of design.
31.05.2013 23:11
Jacks and Masters
I haven't written a single word this year. It's a busy one for me
building, scaling and supporting
more
big sites. Interesting problems were solved, bugs were
found... but still, I didn't feel like I stumbled onto anything worthy
of publishing that hasn't been rehashed a 1000 times already through
blogs and eventually change-logs. But thinking about my lack of
material, while catching up on my podcast backlog gave me an idea to
write something about the sysadmin role in all this.
Many times in the last year I returned to two books as best practices
guides for building and scaling web-apps. Those
are Scalability
Rules
and Web
Operations. I recommend these books to any sysadmin interested in
web operations, as they share experiences from engineers working on
the top sites and there's just no other way to gain these insights
unless you join them.
That brings me back to my podcast backlog. Episode
38 of the DevOps Cafe had a very interesting guest
(Dave Zwieback) that talked a lot about hiring sysadmins, and
generalist vs. specialist roles in systems
administration today. I am a generalist which in it self is fine, but
there's a big difference between "jack of all trades, master of
some", and being a "master of none". I've been thinking
about it since the podcast, as I wasn't thrilled with what I was doing
lately, that is jumping through a lot of new technologies to
facilitate all kinds of new frameworks web developers use. Often times
that means skipping some kind of "logical" learning course of R&D,
instead you learn enough to deploy and manage it, while the real
knowledge comes within a certain period of time spent debugging
problems when it breaks apart in production.
Now to tie both parts of the text together. If you want to join site
reliability engineers at one of the top sites how do you justify
drawing a blank when asked to explain
how Varnish malloc
storage works internally, if claiming you built huge caches with it?
The Jack issue is amplified if you consider there is now a first, or
even a second, generation of sysadmins who have never stepped into a
data-center and are missing out on hardware and networking
experience. Appropriate name that comes to mind is "the cloud
generation", and I'm a part of it.
09.12.2012 05:24
Hybrid IRCD for Arch Linux
Hybrid IRCD has been a favorite of mine for many years. I tried it once because a Croatian IRC network ran it and it stuck with me. I'm very happy to announce Hybrid packages for Arch Linux are available in AUR from today. I worked on it as a side project for a while and finished today thanks to the blizzard that kept me inside this weekend. Hybrid server is available as ircd-hybrid, and Hybserv2 services are available as ircd-hybrid-serv. They adhere to standards set by all other ircd providers, default configuration for both is usable out of the box, and examples for connecting services to the server are included. They were built and tested on both arches, only component not tested by me are systemd service files.
09.12.2012 04:58
GNU/Linux and ThinkPad T420
I got a new workstation last month, a 14" laptop from the ThinkPad T
series. The complete guide
for TuxMobil about installing
Arch Linux on it
is here.
It replaced a (thicker and heavier)
13" HP
ProBook 4320s which I used a little over a year, before giving up
on it. In some ways ProBook was excellent, certified for SUSE
Linux it had complete Linux support down to the most
insignificant hardware components. In other ways it was the worst
laptop I ever used. That ProBook series has chiclet-style keyboards,
and I had no idea just how horrible they can be. Completely flat keys,
widely spread and with bad feedback caused me a lot of wrist
pain. Even after a year I never got used to the keyboard, and I was
making a lot of typos, on average I would miss-type even my login
every second boot. At the most basic level my job can be described as
a "typist" so all this is just plain unacceptable.
The touchpad however is worse than the keyboard. It's a "clickpad",
with one big surface serving as both the touchpad area and the button
area. To get it in a usable state
a number
of patches
are needed, coupled with an
extensive user-space
configuration. But even after a year of tweaking it was never just
right. The most basic of operations like selecting text, dragging
windows or pressing the middle button is an exercise in
patience. Sadly clickpads are present in a huge number of
laptops today.
Compared to the
excellent UltraNav
device in the ThinkPad they are worlds apart. Same is true of the
keyboard in T420, which is simply the best laptop keyboard I've ever
used. I stand behind these words as I just ordered another T420, for
personal use. One could say these laptops are in different categories,
but that's not entirely true. I had to avoid
the latest ThinkPad models because of the
chiclet-style keyboards they now have. Lenovo is claiming that's
"keyboard evolution", to me they just seem cheaper to produce, and
this machine could be the last ThinkPad I'll ever own. If this trend
continues I don't know where to turn next for decent professional
grade hardware.