Storagebod’s musings on storage availability got me thinking about RAID (Redundant Array of Independent Disks) technologies and how they have evolved to handle ever larger drive and array sizes. RAID is, after all, a risk mitigation technique. Disk drives fail. Sometimes this is a pure mechanical failure. Other times, the drive media may develop bad sectors which render portions of the drive unreadable. In either case, data has been lost that must be recovered via redundant data stored within the array.
Historically, the most popular RAID technology has been RAID-5 (striping with distributed parity). RAID-5 performs well, has excellent storage efficiency, and is reliable enough for most common uses. RAID-1(mirroring) and RAID-10 (mirroring and striping) are also common and have typically been used where RAID-5 does not perform well enough. RAID-1/10 are also considered to be more resilient to failure than RAID-5. This additional performance and resiliency comes at the expense of greatly reduced storage efficiency.
Recently, RAID-5 has been showing its age. As drive sizes have become ever larger, the amount of time required to reconstruct the data from a failed drive has increased as well. This has led to uncomfortably long periods of time where a single bad sector discovered during array reconstruction can wipe out an entire RAID array. Statistically speaking, RAID-5 still seems to be working well for enterprise fiber channel drives, but I have become uncomfortable with RAID-5 arrays constructed from large SATA drives. (I define large as 500+ GB.) I expect my discomfort to increase as drive sizes continue to grow.
RAID-6 (striping with two independently calculated parity values) is one possible solution to the problem of data integrity exposure during array reconstruction. With RAID-6, three things (as opposed to two with RAID-5) have to go wrong before data is lost. This is dramatically more reliable than RAID-5, and still much more efficient than RAID-1. RAID-6, however, is not a perfect solution.
The problem with RAID-6 is that most implementations are slow. The additional I/O operations and parity calculations required by RAID-6 pose a significant performance penalty on write operations. Clever implementations such as Intelligent Write Caching (available in DS8000 R4.2+), or the hybrid WAFL/RAID-DP approach taken by Data ONTAP (available in NetApp FAS and IBM N series arrays) significantly reduce the performance penalty of RAID-6. In fact, DS8000 Intelligent Write Caching makes RAID-6 arrays on the DS8000 perform almost as well as pre-Intelligent Write Caching RAID-5 arrays.
So what about XIV? XIV uses a completely different storage scheme named RAID-X. RAID-X is a radical re-think of the way we store data in an array. It is a hybrid of mirroring, massive parallelism, and dynamic balancing of system resources. RAID-X’s goals are simple: make commodity level hardware perform at enterprise levels, make storage administration dramatically simpler, provide consistent performance in the face of wildly varying I/O requirements, and be able to seamlessly adapt to ever increasing hard drive sizes.
There’s a lot of misinformation about how RAID-X works. In an attempt to clear matters up, I offer the following.
A fully populated XIV frame contains 15 modules. Each module contains 12 disk drives, cache, and processor resources. This gives us a total of 180 disk drives. Today’s shipping XIVs use 1 TB SATA disk drives, so the raw capacity of the frame is approximately 180 TB. (In reality, it is a bit smaller since a 1 TB drive doesn’t actually hold 1 TB of data, but that is a topic for another post.) All data is mirrored internally and some space is set aside for spare capacity and system metadata. This gives a usable capacity of 79 TB per fully populated XIV frame.
When data comes in to XIV, it is divided into 1 MB “partitions”. Each partition is allocated to a drive using a pseudo-random distribution algorithm. A duplicate copy of each partition is written to another drive with the requirement that the copy not reside within the same module as the original. This protects against a module failure. A global distribution table tracks the location of each partition and its associated duplicate. When a failure occurs, the system knows exactly which partitions are no longer protected and immediately begins creating new copies to restore redundancy. This is where the parallelism of the design comes into play. The entire machine goes to work re-creating the missing redundancy, so very little work has to be done by any one component. This allows XIV to rebuild a failed 1 TB drive in minutes as opposed to the hours it would take in traditional RAID implementations.
The most common FUD point raised against RAID-X is that it is vulnerable to a double-drive failure and since data is spread across the entire machine, the failure of any two drives will cause data loss. While this makes a great talking point on a competitive slide, it is simply not the case. Allow me to explain.
As I pointed out above, RAID is a risk mitigation technique. The most common ways to mitigate the risk of data loss are to decrease the probability that a critical failure combination can occur, and/or decrease the window of time where there is insufficient redundancy to protect against a second failure. RAID-6 takes the former approach. RAID-10 and RAID-X take the combination approach. Both RAID-10 and RAID-X reduce the probability that a critical combination of failures will occur by keeping two copies of each bit of data. Unlike RAID-5, where the failure of any two drives in the array can cause data loss, RAID-10 and RAID-X require that a specific pair of drives fail. In the case of RAID-X, there is a much larger population of drives than a typical RAID-10 array can handle, so the probability of having the specific pair of failures required to lose data is even lower than RAID-10.
Another differentiation between RAID-X and RAID-10 is the window of vulnerability between the time a drive fails, and the time when full redundancy is restored. RAID-10 has to copy the entire contents of the surviving member of the pair to a spare drive. The copy process is directly proportional to the size of the drive since only one source volume is copying to one target volume. While much faster than RAID-5, it can still take a while to copy a 1 TB drive. The limiting factor is the transfer rate supported by the single source and target volume pair.
RAID-X operates differently. When a drive fails, the global distribution table immediately knows the locations of all data that are no longer redundant. This non-redundant data is evenly distributed among 168 drives. The system immediately goes to work creating redundant copies of all exposed data. This is done as a fully parallel, any-to-any operation with all surviving 179 drives participating. Each drive only has to carry out 0.6% of the effort required to restore redundancy. In addition, and this is key, only partitions that actually contain user data need to be copied. If the system is only 50% full, RAID-X only needs to copy 500 GB worth of data to fully recover from a failed drive. Contrast this to RAID-10, which has to copy the entire drive regardless of the amount of user data actually stored on the drive. Between the inherent parallelism of the design and the intelligence of the copy process, RAID-X can completely recover from a failed 1 TB drive in as little as 15 minutes.
I hope this sheds some light on how XIV’s RAID-X really works. As with many new and creative approaches to old problems, there are a lot of misunderstandings, misinformation, and outright FUD in the marketplace concerning RAID-X. I firmly believe that RAID-X is at least as reliable as any other mirroring technology and has further advantages, not all of which I have been able to include here.
Thank you for reading this rather lengthy post. I look forward to continuing the conversation.