Brian and Susan

Just another WordPress.com weblog

PNCB October 4, 2008

Filed under: Play — Susan @ 3:13 pm

Since Brian posted about his non-trip (hopefully trip to be) I thought I would share about the trip I took in August. Shortly after I graduated I began volunteering for the pediatric nursing certification board as an external item writer. I was thrilled to learn that I was invited to attend a workshop on writing test questions – and I was even more thrilled to learn that it would be paid for! I am definitely a big nerd in that I really like to study, especially when it is on my  own terms and timeline, so writing test questions in my spare time is actually a lot of fun. It also keeps me on my toes on topics in primary care.

 

I headed off for Reston, VA for the conference with nervous excitement about traveling alone. I was to  navigate 3 airports going to and from the conference. The first part of the trip was a breeze. I had a non-stop flight to Washington D.C. then caught a shuttle to the hotel we stayed in (very nice btw). The hotel was in the middle of a large shopping area in a nice town about 10 minutes away from the Dulles airport. I had brought several books and was really excited about watching cable but was very disappointed to find that apparently the hotel signed up for all-sports tv – no food network, tlc, MTV, vH1, bravo – nada. Nonetheless I had plenty of reading material, so after grabbing some dinner I sat down to enjoy a good book. Then the power went out.

Not just a little power outage, but a full city block. I went out in the hallway to see if anyone else knew what was going on and heard the maids yelling in another language, then the fire alarm started sounding and an intercom said to exit the building, so I took that as my cue to head downstairs. On the way down I befriended another visitor of the hotel who happened to be in the navy. We went outside into the plaza outside the hotel and waited for about 2 hours with other people in the area for the lights to come back on. Several rumors swirled about the cause of the outage, but the true reason remains a mystery. All I know is that it’s hard not to be a little freaked out when the power goes out in the whole city you’re in all alone that happens to be just a few miles from the nation’s capitol. Thank goodness my freak-out threshold has been raised over time or I would have been a mess. I actually had a decent time talking with the guy I met  and some of the other people waiting outside. And it turned out that the guy I met was a Christian which was cool.

The workshop was really great. I had another small freak out moment when one of my flights for home was severely delayed and I thought I’d be spending the night in an airport in New Jersey, but once again God totally helped me out and I got a direct flight to Nashville, and got home 4 hours earlier than I thought I would. Overall a great experience.  An added bonus, I got to fly in a tiny propeller plane!

 

Production Outage October 1, 2008

Filed under: Work — Brian @ 2:04 pm
Tags: , , ,
I took the failboat out for a joyride.

(or how I rode the failboat out for a joyride)

Last night I took down about 6 servers for work.  Important ones, too.  It was supposed to be just a small fix to the zoning on our main SAN switches for something unusual that happened yesterday during the day (human error caused by a new team member who I also didn’t stop at the time).  The original problem wasn’t a huge deal, but it created a small mess than needed to be cleaned up.

Apparently, I committed the fix wrong while cleaning it up later that night.  It’s really easy to do.  I’m kind of surprised it doesn’t happen more often.  There’s an A-side switch and a B-side switch (for redundancy), and if you accidentally activate changes on the B-side with the A-side selected, bad things happen.  And they did.  Every server lost connection with the B-side of our storage network.  Fortunately, I only made the mistake on one side, which means everything stayed connected on the A-side, so most servers could still see their storage but just had fewer “paths” to get to it.  Unfortunately, not all servers are configured correctly to handle losing paths (or worse, still aren’t even connected to both sides).  Those were the ones that went down.  Thank God it was only one side.  Accidentally bringing down all 200 physical servers and another 400 virtual servers would have been a total disaster, even after business hours.  I try not to think about it, but the fact that it’s that easy for me to do something like this is scary to say the least.

There’s a root cause meeting going on now.  No doubt the production control police will be grilling my team lead as to why there was no change ticket for this ‘change’ when it really wasn’t even supposed to be a change at all.  It hasn’t been a very good couple of months for infrastructure.  We’ve had a lot of stupid outages lately.  This does not help my team out at all.  The demand for only touching production devices on Saturdays will definitely rise.  Those meetings can be extremely annoying, too.  When something like this happens, you get to try and explain what happened to the production control team, people who generally have no clue what you’re talking about and seem to think that “if only you had created a ticket” or “if only this had been done on a Saturday” nothing would have happened.  The truth is that outages are going to happen eventually.  People make mistakes, and hardware fails.  Saying “we must have 24/7 availability” does not instantly make it so, especially when you don’t wish to pay for that kind of environment and also expect your employees to work terrible hours and pretend that it’s “just part of their job.”  It may be part of the job, but the way infrastructure employees work around here goes far beyond what I consider their core job, especially when compared to the rest of IT.  I have managed to dodge most of the after-hours work on this team due to most of my contributions being scripting; most others have not been so lucky.

Things like this make me question a future in the IT department, at least on a lower-level infrastructure team.  I’m not a nights and weekends kinda guy when it comes to my job, especially if I’m not getting paid overtime and/or a shift differential of some sort.  Back in the day, IT apparently used to get overtime here as well as reimbursement for being forced to travel between two data centers, but those days are long gone for most all companies now because they became “too expensive.”  Instead, we’re supposed to take “comp time” where if you work a Saturday, you take off the next Friday or something.  The main problem is that pretty much no one ever takes their full share of comp time, usually because they have too much crap going on during the day to just take off.  Another problem is that having some random day off during the week does not make up for losing a Saturday.  When I have Saturday off, I can hang out with my wife and other friends/family.  When I have Friday off, I sit at home by myself doing nothing while everyone else is at work.

It isn’t that I’m not paid well, it’s that I’m paid the same as other people on application teams who almost never have to wait until after-hours to actually do their work.  Those teams also deal with a much lower degree of personal risk.  If they make a mistake, at worst they probably just screw up their one application.  If I make a mistake, I can easily bring down the entire IT infrastructure of my business unit.  Managing millions of dollars of SAN equipment has been very satisfying.  It really is cool stuff.  But it touches almost every single server (hundreds and hundreds) in our environment.  Centralized storage makes things much more efficient, flexible, and generally more awesome.  It also means any outages to your SAN are absolutely devastating.  When servers lose their storage, they basically fall over and enter a coma which can only be remedied by the server teams very tediously logging into every single one and rebooting.  That’s assuming no data corruption, which takes a lot more work to fix.  It’s also assuming people remember how to bring up a server+application which might not have been rebooted in over a year.  Oh, and if an outage does happen during the day and some of the really important stuff goes down, God save you from the wrath of everyone else in IT because you just negatively impacted a very important part of their quarterly bonus.  People get angry when you’re the reason they get paid less.

I’m sure the idea is that everyone in IT is equal and no one should be treated differently.  It’s just not true.  Some jobs are harder or more important than others.  Having worked on one of if not the most critical team, I can say that it’s just not worth it if there isn’t some other incentive. Why would I want to work on a team that has to do most everything after-hours where mistakes cost the company tens of thousands of dollars an hour when I could be compensated the same for working on a team that has no real time restriction and mistakes have a small area of effect?  It makes no sense.  The only time anyone really pays attention to us is when something breaks.  I get the idea that some members of management seem to have a lot of unrealistic expectations which we have really brought on ourselves by being so lucky in the past.  They have no real understanding of the technical complexity of what it takes to guarantee 24/7 availability and yet they want to demand it on a whim.

I don’t think any of this is unique to where I work, either.  It’s a larger problem with the way most business view their IT department.  Technology is expensive, and people that are able and willing to manage it really well are hard to find.  A lot of businesses see how much money they spend on IT and feel they’re entitled to demand just about anything.  I can understand why they would feel that way.  IT costs are ridiculous, but it still doesn’t mean anyone can act like they own the IT employees.  We try to have lives outside of work just like anyone else.

Lucky for me, I don’t have to deal with the worst problems much longer.  Oddly enough, according to HR, today is actually my first official day off the SAN team’s roster and onto the General Systems team where I will be doing application support.  As part of the college grad hire program, you get moved around after 2 years.  My physical move happens later this month where I’ll be going back to the main building.  While I’m very excited about the move, I feel bad for those I’m leaving.  They’ll have to keep dealing with stupid stuff like this.

Of course, maybe I’m wrong about how easy the applications team have it.  I’ll find out soon enough.