Wednesday 29 October 2014

On taking the time to see what happened

I kind of want to watch the video of the failed Antares launch from last night, for the sake of context for the photo of the launch site that NASA recently published on Google+… but I saw the photos of the explosion on Twitter shortly after it happened, and just kind of shivered. Knowing perfectly well that no one was killed in the accident didn’t help—that was clearly a huge blow for the company. As someone on Twitter said, spaceflight is hard. NASA and Roscosmos have it pretty well sorted at this point, having been in the game longer than anyone else (with full credit due, of course, to CNSA and ISRO), so it really does bear pointing out that both Orbital Sciences and SpaceX have only been putting rockets into orbit since about 1990 and 2008 respectively. It’s not that SpaceX is necessarily doing launches and rockets better than Orbital. Orbital has way more experience. SpaceX just hasn’t had a rocket blow up on the pad yet. Hopefully, that doesn’t happen, but the reality is that there’s fundamentally little difference between a rocket and a bomb.

I’ve been trying to stay on top of the commercial American launches, almost just because the advent of commercial spaceflight is really exciting to me. I’ve seen two Falcon 9 launches so far—the most recent one, and (if I recall correctly) its first mission to the ISS—but I haven’t had the opportunity to see anything from Orbital Sciences and Wallops. Maybe it’s a stronger cultural affinity for Cape Canaveral; as far as I knew until the last year or so, the only launch facility NASA had was in Florida. But every time a Virginia launch is mentioned, I secretly hope that this time it’ll be visible from Toronto. I look at the maps of where the launch will be visible from, at what angle, and I’m always a little disappointed that Toronto is well outside the arc. I took the opportunity to see the launch of STS-135 when my family travelled to Orlando for a Disney World/Universal Studio holiday. Our timing coincided with the launch date, and having never seen a Space Shuttle launch before, my wife, my sister-in-law, and I took my then-six-month-old son from Greater Orlando to Titusville to visit Kennedy Space Center. We didn’t make it Titusville for the launch, but we saw it, pretty clearly, from the roughly twenty miles away that we were when the countdown hit the last few minutes. That was an incredible experience, even from that distance (because, by God, you can hear it), and I’d love to have that experience again.

I’m also interested in the Antares launch, and specifically its failure, from a process engineering perspective. A few people on Twitter and Google+ noted that, as soon as the rocket exploded, Orbital Sciences’ Twitter feed went silent. Reports came in from NASA about the same time that the Orbital mission controllers were giving witness statements and storing the telemetry they’d had from the rocket up until that point. In their business, this is absolutely critical for figuring out what caused the incident, so that it can be avoided in the future. Rockets are expensive, so having all that cash go up in flames is a disaster.

But in technology, we can certainly learn from this. So often, when something goes wrong on a server, particularly a production server, our first response is simply to fix it, and get the website running again. Don’t get me wrong; this is important, too—in an industry where companies can live or die on uptime, getting the broken services fixed as soon as possible is important. But preventing the problem from happening again is equally important, because if you’re constantly fighting fires, you can’t improve your offering. When something goes wrong, and you have the option, your first response should be to remove the broken machine from the load balance. Disconnect it from any message queues that it might be listening to, but otherwise keep the environment untouched, so that you can perform some forensic analysis and discover what went wrong. In addition to redundancy, you also need logging. Oh my good good God, you need logging. Yes, logs take up disk space. That’s what services like logrotate are for—logs take up space, sure, but gzipped logs take up a tenth of that space. And if you haven’t looked at those logs for, let’s say, six months…you probably have a solid enough service that you don’t need them any more. And if, for business reasons, you think you might… you can afford to buy more disks and archive your logs to tape. In the grand scheme of things, disks are cheap, but tape is cheaper.

So, ultimately, what’s the takeaway for the software industry? Log everything you can. Track everything you can. And when the shit hits the fan, stop, and gather information before you do anything. I know cloud computing gives us the option (when we plan it out well) of just dropping a damaged cloud instance on the floor, spinning up a new one, and walking away, but if you do that without even trying to diagnose what went wrong, you’ll never fix it.

Saturday 4 October 2014

A lesson learned the very hard way

Two nights ago, I took a quick look at a website I run with a few friends. It’s a sort of book recommendation site, where you describe some problem you’re facing in your life, and we recommend a book to help you through it. It’s fun to try to find just the right book for someone else, and it really makes you consider what you keep on your shelves.

But alas, it wasn’t responding well—the images were all fouled up, and when I tried to open up a particular article, the content was replaced by the text "GEISHA format" over and over again. So now I’m worried. Back to the homepage, and the entire thing—markup and everything—has been replaced by this text.

First things first: has anyone else ever heard of this attack? I can’t find a thing about it on Google, other than five or six other sites that were hit by it when Googlebot indexed them, and one of them at least a year ago.

So anyway, I tried to SSH in, with no response. Pop onto my service provider to access the console (much as I wish I had the machine colocated, or even physically present in my home, I just can’t afford the hardware and the bandwidth fees), and that isn’t looking good, either.

All right, restart the server.

Now HTTP has gone completely nonresponsive. And when I access the console, it’s booted into initramfs instead of a normal Linux login. This thing is hosed. So I click the “Rescue Mode” button on my control panel, but it just falls into an error state. I can’t even rescue the thing. At this point, I’m assuming I’ve been shellshocked.

Very well. Open a ticket with support, describing my symptoms, asking if there’s any hope of getting my data back. I’m assuming, at this point, the filesystem’s been shredded. But late the next morning, I hear back. They’re able to access Rescue Mode, but the filesystem can’t fsck properly. Not feeling especially hopeful, I switch on Rescue Mode and log in.

And everything’s there. My Maildirs, my Subversion repositories, and all the sites I was hosting. Holy shit!

I promptly copied all that important stuff down to my personal computer, over the course of a few hours, and allowed Rescue Mode to end, and the machine to restart into its broken state. All right, I think, this is my cosmic punishment for not upgrading the machine from Ubuntu Hardy LTS, and not keeping the security packages up to date. Reinstall a new OS, with the latest version of Ubuntu they offer, and keep the bastard thing up to date.

Except that it doesn’t quite work that well. On trying to rebuild the new OS image… it goes into a error state again.

Well and truly hosed.

I spun up a new machine, in a new DC, and I’m in the process of reinstalling all the software packages and restoring the databases. Subversion’s staying out; this is definitely the straw that broke the camel’s back in terms of moving my personal projects to Git. Mail comes last, because setting up mail is such a pain in the ass.

And monitoring this time! And backups! Oh my God.

Let this be a lesson: if you aren’t monitoring it, you aren’t running a service. Keep backups (and, if you have the infrastructure, periodically try to refresh from them). And keep your servers up-to-date. Especially security updates!

And, might I add, many many thanks to the Rackspace customer support team. They really saved my bacon here.