Skip to content

RAID in the 21st Century

Storagebod’s musings on storage availability got me thinking about RAID (Redundant Array of Independent Disks) technologies and how they have evolved to handle ever larger drive and array sizes.  RAID is, after all,  a risk mitigation technique.  Disk drives fail.  Sometimes this is a pure mechanical failure.  Other times, the drive media may develop bad sectors which render portions of the drive unreadable.  In either case, data has been lost that must be recovered via redundant data stored within the array.

Historically, the most popular RAID technology has been RAID-5 (striping with distributed parity).  RAID-5 performs well, has excellent storage efficiency, and is reliable enough for most common uses.  RAID-1(mirroring) and RAID-10 (mirroring and striping) are also common and have typically been used where RAID-5 does not perform well enough.  RAID-1/10 are also considered to be more resilient to failure than RAID-5.  This additional performance and resiliency comes at the expense of greatly reduced storage efficiency.

Recently, RAID-5 has been showing its age.  As drive sizes have become ever larger, the amount of time required to reconstruct the data from a failed drive has increased as well.  This has led to uncomfortably long periods of time where a single bad sector discovered during array reconstruction can wipe out an entire RAID array.  Statistically speaking, RAID-5 still seems to be working well for enterprise fiber channel drives, but I have become uncomfortable with RAID-5 arrays constructed from large SATA drives.  (I define large as 500+ GB.)  I expect my discomfort to increase as drive sizes continue to grow.

RAID-6 (striping with two independently calculated parity values) is one possible solution to the problem of data integrity exposure during array reconstruction.  With RAID-6, three things (as opposed to two with RAID-5) have to go wrong before data is lost.  This is dramatically more reliable than RAID-5, and still much more efficient than RAID-1.  RAID-6, however, is not a perfect solution.

The problem with RAID-6 is that most implementations are slow.  The additional I/O operations and parity calculations required by RAID-6 pose a significant performance penalty on write operations.  Clever implementations such as Intelligent Write Caching (available in DS8000 R4.2+), or the hybrid WAFL/RAID-DP approach taken by Data ONTAP (available in NetApp FAS and IBM N series arrays) significantly reduce the performance penalty of RAID-6.  In fact, DS8000 Intelligent Write Caching makes RAID-6 arrays on the DS8000 perform almost as well as pre-Intelligent Write Caching RAID-5 arrays.

So what about XIV?  XIV uses a completely different storage scheme named RAID-X.  RAID-X is a radical re-think of the way we store data in an array.  It is a hybrid of mirroring, massive parallelism, and dynamic balancing of system resources.  RAID-X’s goals are simple: make commodity level hardware perform at enterprise levels, make storage administration dramatically simpler, provide consistent performance in the face of wildly varying I/O requirements, and be able to seamlessly adapt to ever increasing hard drive sizes.

There’s a lot of misinformation about how RAID-X works.  In an attempt to clear matters up, I offer the following.

A fully populated XIV frame contains 15 modules.  Each module contains 12 disk drives, cache, and processor resources.  This gives us a total of 180 disk drives.  Today’s shipping XIVs use 1 TB SATA disk drives, so the raw capacity of the frame is approximately 180 TB.  (In reality, it is a bit smaller since a 1 TB drive doesn’t actually hold 1 TB of data, but that is a topic for another post.)  All data is mirrored internally and some space is set aside for spare capacity and system metadata.  This gives a usable capacity of 79 TB per fully populated XIV frame.

When data comes in to XIV, it is divided into 1 MB “partitions”.  Each partition is allocated to a drive using a pseudo-random distribution algorithm.  A duplicate copy of each partition is written to another drive with the requirement that the copy not reside within the same module as the original.  This protects against a module failure.  A global distribution table tracks the location of each partition and its associated duplicate.  When a failure occurs, the system knows exactly which partitions are no longer protected and immediately begins creating new copies to restore redundancy.  This is where the parallelism of the design comes into play.  The entire machine goes to work re-creating the missing redundancy, so very little work has to be done by any one component.  This allows XIV to rebuild a failed 1 TB drive in minutes as opposed to the hours it would take in traditional RAID implementations.

The most common FUD point raised against RAID-X is that it is vulnerable to a double-drive failure and since data is spread across the entire machine, the failure of any two drives will cause data loss.  While this makes a great talking point on a competitive slide, it is simply not the case.  Allow me to explain.

As I pointed out above, RAID is a risk mitigation technique.  The most common ways to mitigate the risk of data loss are to decrease the probability that a critical failure combination can occur, and/or decrease the window of time where there is insufficient redundancy to protect against a second failure.  RAID-6 takes the former approach.  RAID-10 and RAID-X take the combination approach.  Both RAID-10 and RAID-X reduce the probability that a critical combination of failures will occur by keeping two copies of each bit of data.  Unlike RAID-5, where the failure of any two drives in the array can cause data loss, RAID-10 and RAID-X require that a specific pair of drives fail.  In the case of RAID-X, there is a much larger population of drives than a typical RAID-10 array can handle, so the probability of having the specific pair of failures required to lose data is even lower than RAID-10.

Another differentiation between RAID-X and RAID-10 is the window of vulnerability between the time a drive fails, and the time when full redundancy is restored.  RAID-10 has to copy the entire contents of the surviving member of the pair to a spare drive.  The copy process is directly proportional to the size of the drive since only one source volume is copying to one target volume.  While much faster than RAID-5, it can still take a while to copy a 1 TB drive.  The limiting factor is the transfer rate supported by the single source and target volume pair.

RAID-X operates differently.  When a drive fails, the global distribution table immediately knows the locations of all data that are no longer redundant.  This non-redundant data is evenly distributed among 168 drives.  The system immediately goes to work creating redundant copies of all exposed data.  This is done as a fully parallel, any-to-any operation with all surviving 179 drives participating.  Each drive only has to carry out 0.6% of the effort required to restore redundancy.  In addition, and this is key, only partitions that actually contain user data need to be copied.  If the system is only 50% full, RAID-X only needs to copy 500 GB worth of data to fully recover from a failed drive.  Contrast this to RAID-10,  which has to copy the entire drive regardless of the amount of user data actually stored on the drive.  Between the inherent parallelism of the design and the intelligence of the copy process, RAID-X can completely recover from a failed 1 TB drive in as little as 15 minutes.

I hope this sheds some light on how XIV’s RAID-X really works.  As with many new and creative approaches to old problems, there are a lot of misunderstandings, misinformation, and outright FUD in the marketplace concerning RAID-X.  I firmly believe that RAID-X is at least as reliable as any other mirroring technology and has further advantages, not all of which I have been able to include here.

Thank you for reading this rather lengthy post.  I look forward to continuing the conversation.

Posted in IBM, Storage.

Tagged with , , , , , .

8 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Techmute says

    If you’re talking a 1MB host side allocation, then yes, it is almost inconceivable that you’d lose both drives. But, given that all allocations are spread across the maximum number of drives possible (at 1MB chunks) and are greater than 1MB, isn’t it fairly likely you’ll lose both copies of a chunk?

    What percentage of XIV arrays are typically used? You cite 50% above, is that when most users hit the array’s IOP limit?

    What happens if you lose a backend switch?

  2. Techmute says

    I just realized I got the math on my last comment wrong (with regard to allocation size). I’ll either post a new comment this afternoon or else I’ll just post an update to my blog.

  3. K.T. Stevenson says

    @techmute Array utilization is all over the map. I arbitrarily chose 50% as an example to illustrate the intelligent rebuild process.

    If you lose a back-end switch, then internal bandwidth is constrained until it is repaired.

    I’ll be looking for your blog update.

  4. SRJ says

    Excellent overview. One statement, however, I do not think is accurate:

    “Unlike RAID-5, where the failure of any two drives in the array can cause data loss, RAID-10 and RAID-X require that a specific pair of drives fail. In the case of RAID-X, there is a much larger population of drives than a typical RAID-10 array can handle, so the probability of having the specific pair of failures required to lose data is even lower than RAID-10.”

    This is definitely not true for the reason you gave. If I lose a drive in a R10 array, I will only lose data if a very specific second drive fails. If I lose a drive in an XIV with RAID-X, I am guaranteed to lose data if ANY ONE of 168 other drives fail before the rebuild finishes. Using math alone (ignoring other factors such as exposure time, etc…) I’d say the odds of losing one of those 168 drives is greater than the odds of losing that one specific drive in a R10 array. In fact, I’d say using math alone (again, ignoring other factors), the probabilities of losing a second drive are actually much worse than RAID-5. As a matter of fact, it’s even worse than many RAID-5 sets with an LVM doing striping over them!

    So – I disagree with the statement as written. HOWEVER – I don’t think this is actually what matters, although the FUD masters will definitely latch onto this and try to make it seem worse than it actually is…and they have, and they will continue to do it.

    Regardless, the probability of losing drives irrespective of other factors is really not what’s important. Probability of losing DATA is what’s actually important, taking into account all the factors which affect the probability of losing data. This is where I think you should have made your point…because it is actually fairly strong in XIV’s favor. Well, you actually did make this point with some of your examples, but the way in which you wrote the point itself, isn’t actually true.

    I used the caveat above “using math along (ignoring other factors)” because I think the probability of losing data is higher with RAID-10 (and RAID-5 without a doubt!) than with RAID-X on XIV. You already pointed out some of the obvious reasons for this (orders of magnitude less exposure time in non-redundant state, completely non-stressful rebuild time, etc…), but there are several more which are less talked-about.

    Gotta run…more later.

  5. Tony Pearson says

    Keith, great post! Yes, IBM XIV is very resilient against double drive failures. While a double drive failure has yet to cause any IBM XIV customer to lose data, if it ever were to happen, only a few GB of data are inaccessible, the files of the affected LUNs can be identified and recovered in less time than RAID5 rebuild. See my post for details here:

  6. Sam says

    SRJ, I think you misunderstood how the 1MB chunk is placed. It is kept only on 2 disks across different modules and not on all the disks. Hence, the chance of those 2 disks to fail at the same time is very minimal than your argument of second disk failure out of the available disks.

  7. Aleef says

    @SAM: SRJ was actually right. I think you misunderstood how the data chunks are placed. Just imagine one drive failed and the mirrored data is spread across all the other drives. Pretty much any new failure will result in data loss. The only difference here is that you will not lose the whole information that was on the failed drives but only whatever original AND mirrored data was on them.

  8. K.T. Stevenson says

    Absolutely simultaneous double drive failures are exceedingly rare in any device. Most “double drive failures” are actually a corrupt block discovered during rebuild.

    The key to any “RAID” scheme is to repair the damage before the second failure occurs. XIV is designed to: 1) keep a minimum level of redundancy at all times. 2) Create additional redundancy when a drive begins to report errors. 3) Only “rebuild” data that no longer has an acceptable level of redundancy.

    #3 is the big one when compared to RAID-5/6. In most cases, a failed XIV drive requires much less than a drive’s worth of data to be redistributed across the array. The whole machine goes into motion to accomplish this redistribution, so it’s pretty quick. Traditional RAID schemes are going to rebuild an entire drive capacity worth of bits, whether or not those bits actually need to be recovered. This process is limited by the number of drives in the array which is typically much smaller than an XIV.

    At any rate, this post is almost a year old now, so I’m closing it down for comments.