Devops malarkey

Success. Failure. Cake.

Tumbleweed

… And then everbody left.

I could write a thing about burnout, but I was too fried to notice when it happened. The interesting/exciting/disturbing thing about being properly stressed is that becomes entirely normal and you only realise something is broken when the music stops. Then all the things in your head that you’ve been ignoring pitch up at once and are like ‘HELLO’ and waving cartoon cricket-bats with nails and broken glass embedded.

Re-inventing the Wheel - Square and on Fire

I am incompetent and I can’t make Vagrant work.

At least, that’s the excuse I’ve been making for not joining in with the rest of the floor in using that for Puppet (among other things) development. Instead, my dev-rig is a VM running as a puppetmaster that’s tracking the changes I make to a given branch in our Git repo via the magic of post-commit hooks and another VM in which I can run ‘puppet agent –debug –blah –server (first VM)’. Once in a while I remember to blow away the second VM so I can make sure everything builds in the right order. However, even with snapshots it’s just slightly too painful to happen regularly.

Meanwhile, quite a lot of the recent developments at Future have involved rigs of between eight and twelve boxes. Generating a worthwhile test/dev version of one of those is rather tiresome because even if you’ve got the spare horsepower lying about, you have to spend yea-long wiring it all together, sanitising the static and/or test data and it all quickly gets completely oh god why did i even get out of bed i should have been a farmer like my dad mind you if i had done that i’d have been out of bed at six to go feed the sheep on long barrow bank perhaps not after all…

So when a mildly broken Dell R620 arrived back from one of the DCs coincidentally with me wanting to have a play with this Docker business, it all seemed a bit convenient.

I am incompetent and I can’t make Docker work. On Debian.

LXC, on the other hand, was slightly simpler than falling off a log.

Given that it’s simple to build a puppetmaster that’s the same as one of the live ones, and that all the machine config I currently care about is in a manifest, it should be pretty easy to generate a container and have it puppet itself up tolerably quickly.

This indeed turned out to be the case. However, having to hand-allocate IP addresses and fiddle about with container naming such that they picked up the configs in use on the live rig was all a bit too hands-on and really not what I wanted.

DNSMasq fixed the first problem. It is a surprisingly useful tool.

A rakefile which read a list of made-up machine names, generated softlinks to the actual hiera node configs and then instantiated the relevant containers fixed the second.

I also spent quite some time building a Wheezy ‘image’ that minimised apt-get as much as possible.

Result - fully puppeted containers come up in circa a minute. Somewhat longer if you have to install PHP. If I didn’t have quite such a rational hatred of golden images and all who sail in them, it would likely be faster still.

The next part is a bit fiddly.

The example problem I now have is that some parts of my collection of yea-many VMs want to connect to other parts. For instance if I have a redis slave, I need to know what the master’s IP address might be during the puppet run. At Future, we generate a location fact and use that in our hiera, er, hierarchy to configure things like message brokers, smarthosts and DNS ordering. I could just add yet another location - testbox, or something - allocate a block of IP space and then add some extra indirection. And then I could do that again for each person and/or project that wanted to run up a test-rig. At which point one has just run into a behaviour pattern that should probably be named ‘It’s OK, I can fix that for you.’

I first came across this in, er, 1991 when doing some NHS-related coding. One of the other chaps had written a thing which had to deal with, oh I don’t know, ten items or something. Because he was a forward-thinking sort, he allocated sixteen slots in his array and beetled off all smug for a coffee and a corned beef sandwich. As you might expect, a few months later one site or other had a list of seventeen items and a bug report. ‘It’s OK, I can fix that for you!’ went our chap and expanded the array to the clearly ludicrous value of some twenty-three slots…

There’s scope for an Eric Berne knockoff book of tiresome technical behaviour antipatterns, isn’t there?

Anyway, I’m using DHCP, and I wanted the entire edifice to work with little or no extra typing.

CoreOS’s ETCD looked like a good fit. Emit salient facts to etcd database when bringing up (say) redis master, then query same via Garethr’s hiera-etcd when bringing up the slave. Profit!

That bit did take a little tinkering to get right.

It seems to me that the notion of a reactive puppet configuration is really rather interesting. Other people may well be screaming in terror and jabbering about things like ‘deadly embrace’ and ‘terrible feedback loops are fine for the Jesus and Mary Chain (Or A Place to Bury Strangers if that’s too retro for you) but have no place in a theoretically stable configuration.’ However, just as a top-down decision process enforced by rigid hierarchy is a hateful idea for a workplace environment, so it is for a machine environment.

TL;DR - code in Github, patches welcome.

Treating People Like Dicks (Distance Learning Edition)

Today one of the old Solaris boxes expired. Well, I say ‘box’ and ‘expired’. I mean ‘1U Solaris-X86-what-were-we-thinking machine’ and ‘fell into maintenance mode while I was eating breakfast’. And, in a truly extraordinary amount of digression and rambling, when I say ‘what were we thinking’ I probably mean ‘The kit had actually managed to serve MySQL tolerably reliably for some 1500 days.’

I don’t know if you lot remember the uptime wars, but they were medium sized in the late nineties. Rather like Sleeper or Menswear, but with fewer annoying tunes and rather more waiting around. We learned better as soon as someone equated long uptimes with being an obvious target for some bollix with a copy of Metasploit.

Anyway. A machine that hadn’t been restarted since it was shoved in a rack, that was host to a pile of Solaris zones. What could possibly have gone wrong with that?

It transpired that one set of binary logs or another had experienced a Jolly Interesting Time and had managed to confuse the zpool enough that the alleged hypervisor had thrown a strop and gone into maintenance mode. Which, um, okay…

Thankfully there are no beard-fondling Solaris types around to tell me that the next move was a Bad Idea, but mucking out the disks, clearing maintenance mode and restarting the beast looked to be the least-worst option.

That is until we discover that the running network config had never been written back to various bits of /etc and indeed there were no build notes or valid excuses on either the deceased wiki or the somewhat shiny new one.

There now followed a swearing competition.

I suspect that what happened in 2009 was pretty much like what happened this morning. After the eighth or twelfth reboot, the people wanting the databases back won over ‘I would like to make this network config survive a reboot surely the combined wit of the Sun/Oracle doc and three dozen assorted blogs and HOWTOs can’t all be missing the vital something or other that we can’t spot either…’

It’s still an unpleasant trick, though.

Short Commercial Break

Trigger warning: contains talk of horrible old Unix kit running horrible old Unix

There’s a good chance I’m probably a massive arsehole and I get paid for it. Which, I dunno, maybe I’m supposed to be pleased with myself about it because being disruptive is seen as a good thing these days. The last time I came across people being described as such, they were the hyperactive (or just sugared-up) kids at junior school who seemed to be convinced that it was all about them and if it wasn’t they’d throw a massive strop and wander round with a lower lip hung out like a soup-plate. I’m assuming a disruptive technology doesn’t have a howling fit in the middle of the organic vegetable section of the supermarket if it doesn’t get its own way, but then I wouldn’t be surprised if it did.

The thing about the arseholedom isn’t malevolent in the slightest, it’s more a case of going up to someone and asking them why they’re nailing their legs to the table. You get this very weird selection of looks when you do things like that. As if they’re expecting you to come up with something sarcastic about using the contact adhesive on the shelf. Then they’ll say something like ‘Well we’ve always nailed our legs to the table in this department because it keeps the bees from flying to Winchcombe.’ Which, um, okay…

I mean, there’s no answer to that. Especially when some manager piles out of the end office going “D’you want the bees to go to Winchcombe? Do you? Because that’s what’s going to happen if you don’t buck your ideas up and crack on with that hammering.”

But you have to try. You point to one of the chairs in the corner and suggest that using those would be much less unpleasant. That’s when the trouble really starts. The manager goes pop-eyed and kicks off about ‘You smart buggers in IT think you know everything coming down here with your ideas I don’t have time for ideas there’s barely enough time to send Bob here down to the hardware shop for more nails, what with the bleeding and the Tetanus jabs and now you want us to cross-train to chairs I’m glad you think we all sit around with glue-guns like you wasters someone should sort you lot out once and for all.’

So you pull a chair over and they look at you like you just shat out a railway station.

What this is really about is that years ago (HP-UX 10.20 ago, in fact) I was given a HP9000 to look after. In poking around the filesystem to see what dreadful sort of albatross I’d been handed, I found a whole pile of cron-jobs that ran scripts to monitor sendmail and some more scripts that re-started sendmail and further scripts that tested the state of earlier scripts. It all seemed a bit pointless because even then sendmail could more or less be left alone to generate remote root exploits and sometimes deliver mail.

I asked one of the longer-serving chaps and he came over all leg-nailing. Apparently it wasn’t to be touched because sendmail was dreadfully unreliable and crashed every half hour.

I nodded, smiled and went off to throw away all the junk and upgrade the sendmail install to $latest.

It didn’t crash.

The point being that writing long-lived daemon processes is really very well known science and instead of mucking about with multiple layers of monitoring and backup, you’re much better off making the daemon work right.

There Are Two Hard Problems in Computer Science: Caching

Title stolen from one of the myriad on the internet more cleverer and witty than wot I am. However, Octopress seems to have added its own twist, so you’ll have to do the rest of the ‘joke’ yourselves…

Years ago, not long after a visit to the Anarchist Bookshop and having become mildly peeved with the names of computers at Previous Employ (The failover pair named after the (in)appropriate Southpark characters, the ones that were funny if you were twelve… Mind, we were all twelve; that was part of the fun. Mind also that our American management decided to call us all ‘spanners’ because of I don’t know what made up terrible morale-boosting exercise. Tip for the MBAs out there - if your entire English team has a fit of the giggles in an ‘all hands’, you have just said something hysterically inappropriate and they are not going to let on until you have the t-shirts printed), I started naming kit I built after anarchists. I think I got as far as kropotkin and bakunin before the option of voluntary redundancy came up and I followed my political convictions and ran pell-mell towards the ¬£MONEY.

The Americans had something of a sense of humour failure (or actually maybe they didn’t in retrospect) and started naming machines nasdaq, bourse et al.

Last year, self & Sam(oth) start calling the notion of Devops, ‘anarcho-syndicalism in action’.

Actually, I think he found the reference elsewhere, but it totally struck home because a lot of the alleged problems that the modern middle class while male technocratic elite have to put up with (only decent latte halfway across town, nowhere to dry yr bike kit in the office) are best approached with an eye to Solidarity (with other teams. Don’t let ‘managers’ or ‘stakeholders’ play at divide and conquer), Direct Action (fix those problems yourselves. You know your environment best. ‘Management’ ‘control’ is bollocks) and Worker’s Self-Management (do not replicate process with code. Optimise it out. Build the environment in which you wish to work. No-one will do it for you.)

And, obviously, this is a debased and pitiful version of a full-on political movement. Which is generally home to misogynistic rape-apologist dickheads it seems. (Who act like the polis at the first sign of trouble because that’s the only model of dissent-management they have. There is a policeman inside all of our heads it must be stopped.)

You may imagine my lack of surprise at discovering a tool called ‘Serf’ which lives at ‘serfdom.io’

Again, that could well be irony so sufficiently advanced that it is indistinguishable from reality. However, such Hayek-followers as I have come across didn’t hold with that sort of malarkey.

I guess this sort of thinking fits in well with the sort of sods who talk about being ‘disruptive’ but actually just want other people to provide free services for which they can charge rent.

There’s probably another ‘talk’ in this, but I think it’s the sort of thing better done by the likes of Shanley Kane.

Giving Stuff Away on the Internet Is Probably a Good Thing.

For reasons that will become sadly apparent when these posts are read in the wrong order, I’ve been engaged in the job of interviewing people who’ve expressed interest in the notion of coming to work for Future. At least one of those people was keen to point out that they’d been looking at our code on Github and was keen to come along and play with it.

Which was nice.

Small Parts Isolated and Deployed

It seems that one of the things that people new to Puppet (and sometimes by extension, automated CD/CI rigs) try to do is brickhammer their existing deployment chains into the thing. You can go look at the mailing list and about once a week, someone will go ‘I need Puppet to manage this thumping great source directory which we will distribute to $list-of-servers and then build in situ. How do I make Puppet do a ./configure && make && make install?’

To which the answer is ‘No.’ and the answer to that is stropping because $reasons.

If you or your organisation still want to do that sort of thing, my suggestion is that you bin the terrible Unix systems you’re using and try one of the many free (or indeed expensive) versions that come with 1990s features like a package-management system. Mind, if you’re using Gentoo for production systems then I can’t help you. Please stop reading there is nothing for you here.

Of course you can’t package up everything you might wish to bung on a server from a distance. There are also going to be rules-lawyers hunting out corner cases in order to prove me wrong. Which, I don’t know, seems to be the broken behaviour patterns of those who’re somehow proud of keeping some ancient and spavined code-management technique alive into the C21st. Don’t do that either, you’re just making you own life hard. Or you’re working for an organisation ditto and why are you doing that?

Our own rules are entirely arbitrary and look like this:

Rebuilt Debian packages and/or backports and/or wonky Ruby code that has a config file and an initscript are served as .debs from our own repo. Building your own Debian repository is desperately simple.

Website code is managed through the magic of git, or the nearly-magic of svn. Not via puppet. The site furniture is instantiated via some puppet, but deploys happen via MCollective. Sinatra-based webapps also fit here, even though they’re wonky Ruby code with config files and initscripts. We may fix this. Or not. Who can say?

Tomcat apps are emitted from the end of a Jenkins-based chain and largely manage themselves. Getting Puppet involved just seems to confuse things.

The new special case that prompted this ramble is a Java app that’s going to sit on some edge servers. The last thing that happens in that Jenkins chain is that the app is packaged up as a .deb. Ok, a Java-style .deb, so the file-layout would make a Debian packager shit themselves with hatred, but still. Since our package generation has been mostly ‘by hand’ up until now, I’d never bothered with hacking up the auto-upload bits of reprepro. For the Jenkins stuff to work properly, I had to fix that. Thus when there’s a new build of the Java app, it appears moments later (depending on cronjob) in our Debian repository.

At that point, I thought it would be a good thing to have the repository-uploader send a message to the event-logger so we could see that there was a new version of code and something should probably be done about it. Not long after that, I realised that the ‘something’ might as well be automated, too. So actually, the repository-uploader will emit a message to a relevant topic on our message-bus, which will trigger an ‘apt-get update’ on the servers where that app is installed. If we’re feeling brave and the Puppet code that manages the app has ‘ensure => latest’ in the package statement, then they’ll go on and install that newly updated version.

Which is kind of exactly the behaviour one would expect from a continuous deployment rig.

I Had That Janov Bloke in the Back of My Cab Once

Here’s a non-technical thing that’s been wandering round my head: brogrammers are more or less exactly what you can expect in an environment run by old-school Unix admins. Or rather, they emerged as a species in reaction to an environment which itself was a reaction.

I guess I’d better unpack that and provide some material so people can go TL;DR.

Brogrammers.

So (i). Brogrammer. You can go look that up on the internet, because that’s what sensible people do. If they come across a term or statement they’re not sure about, they can poke about the internet for a bit, gather information from several sources and perhaps come to a useful conclusion. It’s not, y’know, required, but it’s nice when it happens and makes them look much less like dicks than the sort of people who’ll just stand there going ‘No! Tell me what you mean!’

I would also ask you to go read this: what your culture really says, because it crystallised (or began the process of precipitation or whatever) a lot of what this ramble may or may not be about. I have no particular axe to grind with that piece because I am a white English bloke in my mid forties, and if I’ve been a participant in any of the scenarios listed I’ve not had the wit to realise it. It rings true, though. True enough that I suspect the ‘if’ in that preceding sentence is a ‘when’.

Finding Places to Put Things

I suspect this blog-thing will just contain sporadic apologies for lack of content for most of its lifetime.

Anyway.

This time the excuses have been brought to you by the words ‘fail’, ‘power’, ‘generator’, ‘contactor’, ‘250A supply’, ‘melted’ and the phrases ‘boot that filer from a different Vol0’, ‘can you smell smoke?’ and ‘Oh hell not again’.

As you might imagine, it’s been busy and the DR plan has been tested and found interesting.

We’re still Barberising and Hiera-ing up our shonky collection of Puppet modules. I’d say that they’re getting less shonky by the day, but it’s taking longer than that. I hesitate to talk about ‘patterns’, because… Actually, I think that’s an example of self-taught-hacker anti-intellectualism, which is an equal amount of rubbish as its opposite.

So. The Barberis(ing|ed) pattern is a fine thing and, when used in combination with the wonder that is Hiera, allows us to do more things in simpler code.

However. One of the modules that I’d been putting off refactoring (so ‘patterns’ are suspect but ‘refactoring’ is fine, eh?) was the one that manages our NSD install and thus the DNS for quite a number of domains, some of which contain rather popular websites.

NSD is the authoritative-only nameserver daemon written by NLNet Labs, who are a top bunch of chaps. We abandoned Bind after there were one too many vulnerability notices.

I’d been putting off the work because the v0.1 module just drops the entirety of the zone-files directory under ../files/ and lets Puppet do the work of synchronising the files across the nameservers. It’s not as if it’s a terrible thing to do at first glance - Puppet’s file-serving means you can stop faffing with hand-brewed rsync scripts for managing the out-of-band DNS data, and if you’ve got your Puppet tree in a sensible SCM, you get version control ‘for free’.

However (again), great lumps of org-specific data like that shouldn’t really, we are told, be held within the module tree. It’s not necessarily obvious where the data should go, though. Nor is it terribly obvious how you connect it back to the Puppet module and have changes in the one signal the other to perform tasks.

Well, it is if you look at the right corners of the Internet, but this thing is mostly me groping around and trying stuff out as a warning to others.

NSD installation and management goes in the now-Barberised NSD module.

This also deposits code that rebuilds the NSD config file when a domain is added or removed. And indeed the out-of-band master list of domains, which semi-obviously has to travel separately from the zonefiles for $reasons.

(It’s about this time that someone-who-is-not-me would be going ‘Why isn’t all this domain gubbins in a nice database somewhere, then all zone maintenance would be a simple ‘SELECT mumble FROM yinglebart WHERE tewkesbury ISNT something”, which would be very shortly before I hauled out the sarcasm throwing machine.)

The zonefiles live in a git repo of their own. That repo is cloned down onto the master DNS server(s) and kept current via the magic of post-commit hooks. Meanwhile, there’s a file resource in the NSD module which looks like this:

file { '/var/lib/nsd3/.git/HEAD':
  audit   => content,
  notify  => Exec['rebuild'],
}

exec { 'rebuild':
  command     => '/etc/nsd3/code/refresh.sh',
  refreshonly => true,
}

… Which is lifted wholesale from here. Either we’ve found one of the non-terrible use cases for this hack, or I’ll be writing another rambling post in a few months when I’ve had a better idea.

Actually Just Testing Something Else

You’d think, after all this time, that I could bosh things together and have them perform some semblance of useful work, right?

You’d think…

… Argh. ‘-’ characters in directory names? Surely not…