Brian and Susan

Just another weblog

Production Outage October 1, 2008

Filed under: Work — Brian @ 2:04 pm
Tags: , , ,
I took the failboat out for a joyride.

(or how I rode the failboat out for a joyride)

Last night I took down about 6 servers for work.  Important ones, too.  It was supposed to be just a small fix to the zoning on our main SAN switches for something unusual that happened yesterday during the day (human error caused by a new team member who I also didn’t stop at the time).  The original problem wasn’t a huge deal, but it created a small mess than needed to be cleaned up.

Apparently, I committed the fix wrong while cleaning it up later that night.  It’s really easy to do.  I’m kind of surprised it doesn’t happen more often.  There’s an A-side switch and a B-side switch (for redundancy), and if you accidentally activate changes on the B-side with the A-side selected, bad things happen.  And they did.  Every server lost connection with the B-side of our storage network.  Fortunately, I only made the mistake on one side, which means everything stayed connected on the A-side, so most servers could still see their storage but just had fewer “paths” to get to it.  Unfortunately, not all servers are configured correctly to handle losing paths (or worse, still aren’t even connected to both sides).  Those were the ones that went down.  Thank God it was only one side.  Accidentally bringing down all 200 physical servers and another 400 virtual servers would have been a total disaster, even after business hours.  I try not to think about it, but the fact that it’s that easy for me to do something like this is scary to say the least.

There’s a root cause meeting going on now.  No doubt the production control police will be grilling my team lead as to why there was no change ticket for this ‘change’ when it really wasn’t even supposed to be a change at all.  It hasn’t been a very good couple of months for infrastructure.  We’ve had a lot of stupid outages lately.  This does not help my team out at all.  The demand for only touching production devices on Saturdays will definitely rise.  Those meetings can be extremely annoying, too.  When something like this happens, you get to try and explain what happened to the production control team, people who generally have no clue what you’re talking about and seem to think that “if only you had created a ticket” or “if only this had been done on a Saturday” nothing would have happened.  The truth is that outages are going to happen eventually.  People make mistakes, and hardware fails.  Saying “we must have 24/7 availability” does not instantly make it so, especially when you don’t wish to pay for that kind of environment and also expect your employees to work terrible hours and pretend that it’s “just part of their job.”  It may be part of the job, but the way infrastructure employees work around here goes far beyond what I consider their core job, especially when compared to the rest of IT.  I have managed to dodge most of the after-hours work on this team due to most of my contributions being scripting; most others have not been so lucky.

Things like this make me question a future in the IT department, at least on a lower-level infrastructure team.  I’m not a nights and weekends kinda guy when it comes to my job, especially if I’m not getting paid overtime and/or a shift differential of some sort.  Back in the day, IT apparently used to get overtime here as well as reimbursement for being forced to travel between two data centers, but those days are long gone for most all companies now because they became “too expensive.”  Instead, we’re supposed to take “comp time” where if you work a Saturday, you take off the next Friday or something.  The main problem is that pretty much no one ever takes their full share of comp time, usually because they have too much crap going on during the day to just take off.  Another problem is that having some random day off during the week does not make up for losing a Saturday.  When I have Saturday off, I can hang out with my wife and other friends/family.  When I have Friday off, I sit at home by myself doing nothing while everyone else is at work.

It isn’t that I’m not paid well, it’s that I’m paid the same as other people on application teams who almost never have to wait until after-hours to actually do their work.  Those teams also deal with a much lower degree of personal risk.  If they make a mistake, at worst they probably just screw up their one application.  If I make a mistake, I can easily bring down the entire IT infrastructure of my business unit.  Managing millions of dollars of SAN equipment has been very satisfying.  It really is cool stuff.  But it touches almost every single server (hundreds and hundreds) in our environment.  Centralized storage makes things much more efficient, flexible, and generally more awesome.  It also means any outages to your SAN are absolutely devastating.  When servers lose their storage, they basically fall over and enter a coma which can only be remedied by the server teams very tediously logging into every single one and rebooting.  That’s assuming no data corruption, which takes a lot more work to fix.  It’s also assuming people remember how to bring up a server+application which might not have been rebooted in over a year.  Oh, and if an outage does happen during the day and some of the really important stuff goes down, God save you from the wrath of everyone else in IT because you just negatively impacted a very important part of their quarterly bonus.  People get angry when you’re the reason they get paid less.

I’m sure the idea is that everyone in IT is equal and no one should be treated differently.  It’s just not true.  Some jobs are harder or more important than others.  Having worked on one of if not the most critical team, I can say that it’s just not worth it if there isn’t some other incentive. Why would I want to work on a team that has to do most everything after-hours where mistakes cost the company tens of thousands of dollars an hour when I could be compensated the same for working on a team that has no real time restriction and mistakes have a small area of effect?  It makes no sense.  The only time anyone really pays attention to us is when something breaks.  I get the idea that some members of management seem to have a lot of unrealistic expectations which we have really brought on ourselves by being so lucky in the past.  They have no real understanding of the technical complexity of what it takes to guarantee 24/7 availability and yet they want to demand it on a whim.

I don’t think any of this is unique to where I work, either.  It’s a larger problem with the way most business view their IT department.  Technology is expensive, and people that are able and willing to manage it really well are hard to find.  A lot of businesses see how much money they spend on IT and feel they’re entitled to demand just about anything.  I can understand why they would feel that way.  IT costs are ridiculous, but it still doesn’t mean anyone can act like they own the IT employees.  We try to have lives outside of work just like anyone else.

Lucky for me, I don’t have to deal with the worst problems much longer.  Oddly enough, according to HR, today is actually my first official day off the SAN team’s roster and onto the General Systems team where I will be doing application support.  As part of the college grad hire program, you get moved around after 2 years.  My physical move happens later this month where I’ll be going back to the main building.  While I’m very excited about the move, I feel bad for those I’m leaving.  They’ll have to keep dealing with stupid stuff like this.

Of course, maybe I’m wrong about how easy the applications team have it.  I’ll find out soon enough.