Sep 01 2008
Bill, a former colleague of mine who is very technically savvy, has a little corner of the blogspace called www.notgoodenough.net.
Here he posts about the everyday dumb issues that he comes across that make his working life that little bit harder. By day, he manages some very complex infrastructure for a company of over 1000 staff, and we often catch up for a coffee to compare notes on the latest head-shaking piece of dumbness that we have recently encountered. Now doing "not good enough" stories for Microsoft is simply too easy, so that’s why I like the fact that Bill goes after the likes of IBM and Packeteer as well
The thing that is really scary when you read his stories, is the sheer unbridled freakin *lameness* of some of the things that he has encountered. My own stores are very similar, and it reinforces to me that usually when something goes pear shaped enough to cost time and money, the underlying cause is rarely clever.
My personal case in point…
I put my heart and soul into my first Windows/SQL Server cluster. Lot of time went into planning and design, I configured it all from the fully redundant SAN configuration, configuration of the blade servers that it ran from and configuration of the redundant network. I ran disk I/O stress-testing tools against various SAN disk configuration options and compared stats. All in all, I tested the crap out of it for days, pulling out cables, shutting down the SAN switches, terminating services, powering down blades, etc. Finally, I handed testing over to other staff members who were not able to kill it either. (We even had it independently reviewed by a local integrator that specialised in this area).
All tests passed and into production it went, along with lovingly documented technical specification as well as an operations guide for the system administrators.
For 3 months it worked perfectly, until a power outage took out one of the blade servers. This time the cluster did not fail-over correctly and SQL Server was unavailable. When we investigated why, it turned out that the person who set up the various service accounts had made an error with one of them that no-one noticed. The cluster service account had been set for an expiring password. Of course, the account eventually expired and therefore made all the good work that went into the design redundant when the cluster failed to fail-over.
What this example highlights is that it is very easy to be brought undone by something simple and stupid. Given the massive amount of complexity at the various layers of this particular cluster, we missed something obvious/dumb and paid for it.
The same applies in the field of IT security. In the typical malicious hack, Tom Cruise does not descend down the conveniently sized air conditioning chute and steal the data. The reality is much more mundane. Most hacks take advantage of human flaws via social engineering or via exploitation of known vulnerabilities. In fact most hacks are surprisingly low brow and the hackers more likely resemble Eric Cartman than Tom Cruise. About the only thing they will slide down is a slice of pizza on its way to their stomach
A great example from Bill is his fun with an IBM BladeCenter. This is basically a chassis that allows a high density of servers, conserving power consumption and rack space. IP networking and fibre channel networking are provided by built-in modules on the chassis. The problem Bill had that after a power failure *none* of the servers could talk to the rest of the network and were therefore unavailable. Obviously not good when 1000 users are twiddling their thumbs. To quote from Bill’s story…
We have an IBM BladeCenter with two Management Modules, two Nortel Ethernet switches and two Brocade Fibre Channel switches (I/O Modules).
We have had an issue with the BladeCenter where after an outage the external ports on the I/O Modules come up disabled. I had to connect to each of the modules and enable to external ports. Now, the first time this happened I assumed that some twit (me) had forgotten to save the configuration of the I/O Modules. So, of course I made extra sure that I saved the configuration.
The next outage we had the same thing happened. The external ports where disabled. However, it was obvious that the configuration had been saved because all the other settings (VLANs, etc.) were correct.
Later I discovered by accident that there’s a setting in the Management Module that *overrides* the I/O Modules. This setting is tucked away in the Admin/Power/Restart screen when *all other* configuration is access via the Configuration screen. And it seems that this setting defaults to disabled:
Grr, that is such a lame root cause and yet very difficult to find if you’re not a BladeCenter guru. Bill agrees:
Now, I can perhaps think of a reason for allowing the Management Module to override the I/O Modules (maybe – if you want to disable all external I/O to a particular module, although we can do that by connecting to the modules themselves, the place where we would normally configure them). But why default to disabled? And if we enable the ports on the I/O modules themselves, shouldn’t the above setting also change to enabled?
It’s just not good enough!
Got your own "not good enough" story?
I think most people in the IT industry have their own versions of the "not good enough" story. Feel free to post them here in the comments or via Bill’s blog. I’d love to hear how something insanely complex (yet pitched to make things simpler) was undone via something really silly and seemingly insignificant.
It doesn’t have to SharePoint either (although I have mentioned a few of mine already on this blog)
Here are some good (albeit extreme) ones to get your thinking caps on…
- For all the complexity and investment in technology, at Ohio state, the offsite-backup methodology was for an *intern* to take the tapes home and 800,000 social security numbers were lost! http://tech.blorge.com/Structure:%20/2007/07/26/800000-stolen-social-security-numbers-a-22-year-old-scapegoat/
- "Engineer deletes cloud – hehe" – http://www.theregister.co.uk/2008/08/28/flexiscale_outage/
- A complex Cisco network that is an design/engineering masterpiece is held hostage by an rogue Cisco CCIE nerd. http://www.pcworld.com/businesscenter/article/148669/the_story_behind_san_franciscos_rogue_network_admin.html
- Google has the biggest cluster of boxes in the world but someone forgot to renew the adwords SSL certificate: http://www.pdxtc.com/wpblog/google/even-google-makes-mistakes-expired-ssl/
- Microsoft took their entire online presence off the air for almost a day because of a DNS mistake: http://www.wired.com/science/discoveries/news/2001/01/41412
Thanks for reading