Thursday, July 16, 2015

Dual Controllers, Single Point of Failure?

We’ve all (hopefully) heard the term before, “Single Point of Failure.”  This phrase strikes fear in the hearts of people in management, the idea that if this one important resource has issues, then everything dependent on it fails.  It’s the weak link, the gremlin in your environment, and according to our old friend Murphy, “Anything that can go wrong will go wrong.”  And you better believe this SPOF gremlin is going to rear its ugly head at the most opportune and painful time – just ask any veteran IT professional.

So how do we combat these SPOF gremlins?  We build in redundancy, we limit failure domains, we vigilantly monitor our environments, and alert on any changes or anomalies.  So when failures do occur, we have either an automatic failover or near immediate solution that will keep our users happily clicking away.

So let’s apply this to the topic of storage, specifically a storage array.  Forget the network connections to the array for now; let’s hone in on the modern storage array chassis itself.  They are often equipped with multiple network connections, power supplies, disks, processors, memory banks, etc.  “We have dual controllers, everything is mirrored the instant it is brought into the array, so this is not a single point of failure.”

So are they correct?  Will a dual controller storage array be able to keep the SPOF gremlins at bay?  I wish I could give you a conclusive answer, because I suspect that some storage manufacturers are nearing the point where the odds of a failure bringing an entire dual controller array down is comical.  But let’s ponder this…and I’m speaking from a painful past experience here. The operating system that runs the array, what is protecting you from failures within that?  “We have the best engineers in the industry,”  “We run our revisions through rigorous tests to ensure stability,” and “We guarantee 99.999% uptime.”

Interestingly enough, five-nines of reliability still allows for up to 5 minutes and fifteen seconds or less of downtime a year.  Think of the damage a SPOF gremlin could do in that amount of time – yeah, it will be painful and likely take longer than 5 minutes to fully recover.

So what do we do?  Well, if you have high tier workloads that require constant uptime, then it’s probably a good idea to look at replica technology.  Storage arrays often have some sort of storage replication built within them as a feature.  If that doesn’t work out, there are multiple applications and features built into services that will provide a similar solution.

My best advice is: continue to be vigilant with your monitoring and don’t let your guard down.  Those gremlins are out there somewhere, and when they strike, you need to be ready.  Let us help with planning your defenses and maintaining your uptime goals.  We have the expertise to identify the single points of failure (they can be very sneaky) and how to combat them.  After all, if you take on the gremlins yourself, could you be considered a SPOF?



No comments:

Post a Comment