Skip to content


Moving On…

I’m preparing to change the focus of this blog.  There is a lot that can be said about enterprise storage, but for the most part, the topic is being well covered by other people.  I’ve come to the realization that while storage system design is a fascinating topic, storage is a means to an end and not the end in itself.  Please don’t misunderstand me.  I remain passionate about the DS8000 and its capabilities.  I know it better than any other device I have worked with in my 14 years as an IT professional.  It’s an amazing machine.  The DS8000 is only now beginning to realize the capabilities inherent in its design.  The best is yet to come.  IBM may not be able to out market it’s competitors, but we out engineer them at every turn.

I am not entirely certain what my new focus will be.  It will certainly be broader than just storage.  I am going to take a couple of weeks off to decide what direction to take.

I am also leaving IBM.

I will be back in a couple of weeks with a new focus, a new employer, and new topics to discuss.  I hope to see all of you then.

Posted in Other.


RAID in the 21st Century

Storagebod’s musings on storage availability got me thinking about RAID (Redundant Array of Independent Disks) technologies and how they have evolved to handle ever larger drive and array sizes.  RAID is, after all,  a risk mitigation technique.  Disk drives fail.  Sometimes this is a pure mechanical failure.  Other times, the drive media may develop bad sectors which render portions of the drive unreadable.  In either case, data has been lost that must be recovered via redundant data stored within the array.

Historically, the most popular RAID technology has been RAID-5 (striping with distributed parity).  RAID-5 performs well, has excellent storage efficiency, and is reliable enough for most common uses.  RAID-1(mirroring) and RAID-10 (mirroring and striping) are also common and have typically been used where RAID-5 does not perform well enough.  RAID-1/10 are also considered to be more resilient to failure than RAID-5.  This additional performance and resiliency comes at the expense of greatly reduced storage efficiency.

Recently, RAID-5 has been showing its age.  As drive sizes have become ever larger, the amount of time required to reconstruct the data from a failed drive has increased as well.  This has led to uncomfortably long periods of time where a single bad sector discovered during array reconstruction can wipe out an entire RAID array.  Statistically speaking, RAID-5 still seems to be working well for enterprise fiber channel drives, but I have become uncomfortable with RAID-5 arrays constructed from large SATA drives.  (I define large as 500+ GB.)  I expect my discomfort to increase as drive sizes continue to grow.

RAID-6 (striping with two independently calculated parity values) is one possible solution to the problem of data integrity exposure during array reconstruction.  With RAID-6, three things (as opposed to two with RAID-5) have to go wrong before data is lost.  This is dramatically more reliable than RAID-5, and still much more efficient than RAID-1.  RAID-6, however, is not a perfect solution.

The problem with RAID-6 is that most implementations are slow.  The additional I/O operations and parity calculations required by RAID-6 pose a significant performance penalty on write operations.  Clever implementations such as Intelligent Write Caching (available in DS8000 R4.2+), or the hybrid WAFL/RAID-DP approach taken by Data ONTAP (available in NetApp FAS and IBM N series arrays) significantly reduce the performance penalty of RAID-6.  In fact, DS8000 Intelligent Write Caching makes RAID-6 arrays on the DS8000 perform almost as well as pre-Intelligent Write Caching RAID-5 arrays.

So what about XIV?  XIV uses a completely different storage scheme named RAID-X.  RAID-X is a radical re-think of the way we store data in an array.  It is a hybrid of mirroring, massive parallelism, and dynamic balancing of system resources.  RAID-X’s goals are simple: make commodity level hardware perform at enterprise levels, make storage administration dramatically simpler, provide consistent performance in the face of wildly varying I/O requirements, and be able to seamlessly adapt to ever increasing hard drive sizes.

There’s a lot of misinformation about how RAID-X works.  In an attempt to clear matters up, I offer the following.

A fully populated XIV frame contains 15 modules.  Each module contains 12 disk drives, cache, and processor resources.  This gives us a total of 180 disk drives.  Today’s shipping XIVs use 1 TB SATA disk drives, so the raw capacity of the frame is approximately 180 TB.  (In reality, it is a bit smaller since a 1 TB drive doesn’t actually hold 1 TB of data, but that is a topic for another post.)  All data is mirrored internally and some space is set aside for spare capacity and system metadata.  This gives a usable capacity of 79 TB per fully populated XIV frame.

When data comes in to XIV, it is divided into 1 MB “partitions”.  Each partition is allocated to a drive using a pseudo-random distribution algorithm.  A duplicate copy of each partition is written to another drive with the requirement that the copy not reside within the same module as the original.  This protects against a module failure.  A global distribution table tracks the location of each partition and its associated duplicate.  When a failure occurs, the system knows exactly which partitions are no longer protected and immediately begins creating new copies to restore redundancy.  This is where the parallelism of the design comes into play.  The entire machine goes to work re-creating the missing redundancy, so very little work has to be done by any one component.  This allows XIV to rebuild a failed 1 TB drive in minutes as opposed to the hours it would take in traditional RAID implementations.

The most common FUD point raised against RAID-X is that it is vulnerable to a double-drive failure and since data is spread across the entire machine, the failure of any two drives will cause data loss.  While this makes a great talking point on a competitive slide, it is simply not the case.  Allow me to explain.

As I pointed out above, RAID is a risk mitigation technique.  The most common ways to mitigate the risk of data loss are to decrease the probability that a critical failure combination can occur, and/or decrease the window of time where there is insufficient redundancy to protect against a second failure.  RAID-6 takes the former approach.  RAID-10 and RAID-X take the combination approach.  Both RAID-10 and RAID-X reduce the probability that a critical combination of failures will occur by keeping two copies of each bit of data.  Unlike RAID-5, where the failure of any two drives in the array can cause data loss, RAID-10 and RAID-X require that a specific pair of drives fail.  In the case of RAID-X, there is a much larger population of drives than a typical RAID-10 array can handle, so the probability of having the specific pair of failures required to lose data is even lower than RAID-10.

Another differentiation between RAID-X and RAID-10 is the window of vulnerability between the time a drive fails, and the time when full redundancy is restored.  RAID-10 has to copy the entire contents of the surviving member of the pair to a spare drive.  The copy process is directly proportional to the size of the drive since only one source volume is copying to one target volume.  While much faster than RAID-5, it can still take a while to copy a 1 TB drive.  The limiting factor is the transfer rate supported by the single source and target volume pair.

RAID-X operates differently.  When a drive fails, the global distribution table immediately knows the locations of all data that are no longer redundant.  This non-redundant data is evenly distributed among 168 drives.  The system immediately goes to work creating redundant copies of all exposed data.  This is done as a fully parallel, any-to-any operation with all surviving 179 drives participating.  Each drive only has to carry out 0.6% of the effort required to restore redundancy.  In addition, and this is key, only partitions that actually contain user data need to be copied.  If the system is only 50% full, RAID-X only needs to copy 500 GB worth of data to fully recover from a failed drive.  Contrast this to RAID-10,  which has to copy the entire drive regardless of the amount of user data actually stored on the drive.  Between the inherent parallelism of the design and the intelligence of the copy process, RAID-X can completely recover from a failed 1 TB drive in as little as 15 minutes.

I hope this sheds some light on how XIV’s RAID-X really works.  As with many new and creative approaches to old problems, there are a lot of misunderstandings, misinformation, and outright FUD in the marketplace concerning RAID-X.  I firmly believe that RAID-X is at least as reliable as any other mirroring technology and has further advantages, not all of which I have been able to include here.

Thank you for reading this rather lengthy post.  I look forward to continuing the conversation.

Posted in IBM, XIV.

Tagged with , , , , , .


Happy Holidays!

May you aPolka-dot dressnd yours have a joyous holiday season!  I’ll be back in early 2010.
Creative Commons License photo credit: kodamatic

Posted in Other.


Another Take on ‘Going to 11′

I’m not going to attempt to take credit for starting a resurgence of the Spinal Tapour amps go to 11” meme, but since I referenced it in Inside the DS8700 Part 2, I’ve seen it in quite a few new places.  Perhaps I’m just sensitized to it now.  The latest place I’ve seen it is on Randall Munroe’s brilliant xkcd webcomic.  (Fair warning, Randall’s sense of humor is a little unusual and some of his comics may be considered by some to be unsafe for work.)

Click to be taken to the punchline

Click to be taken to the punchline

I almost never fail to be amused by Randall’s work, but this one almost caused me to choke on my coffee!  It’s very, VERY, good.  It’s also all too true in the technology industry.  For those of you who aren’t familiar with xkcd, I hope you enjoy it as much as I do.

Posted in Amusing.

Tagged with , .


Inside the DS8700 Part 3 – Summary and Wrap-up

storage_disk_images_ds8700_79x172I’ve spent the past few weeks describing some of the technological underpinnings of IBM’s new DS8700.  In Part 1, I discussed the new POWER6 based processor complexes.  In Part 2, I examined the move from RIO-G buses to a PCI Express fabric.   Today, I am going to wrap up this series with a summary and some odds-and-ends that didn’t fit under any of the other topics.

Being Green

To get started, let’s talk about being green.  As companies have realized that being green (environmental) can directly translate to having more green ($$$), IT departments have come under scrutiny.  Let’s face it.  IT is a power-hungry activity.  After all, it’s no coincidence that more and more datacenters are being built next to power plants.  Customers have begun looking at metrics like work per watt, capacity per watt, and other measurements of power efficiency.  It’s no longer enough to be fast.  Efficiency is also a requirement.

The DS8700 takes advantage of the energy efficient design of the POWER6 processor to deliver highly efficient performance.  As an example, the DS8700 is capable of delivering 10 IOPS/watt using traditional spinning disk.  This is over 50% more IOPS/watt than the DS8300, which was already quite efficient.  Install solid state drives into the DS8700 and this number jumps even higher.  This makes the DS8700 an attractive consolidation vehicle for older, less energy efficient storage devices.  Going green to save green couldn’t be easier.

New Management Interface

As with prior releases, the R5 microcode includes enhancements to the DS8000 management GUI.  The DS8000 line has always been a customer configurable device.  There never has been a requirement to contract a vendor engineer to come and configure your device for you.  Starting in R3, IBM began a re-work of the GUI to make the configuration process faster and more intuitive.  The R5 GUI contains new visualizations that make it easier to see the relationships between logical constructs and the underlying physical hardware.  It also contains a new real-time performance graph to help storage administrators see what is going on under the covers of the machine.

DS8000 R5 Real-time Performance View

DS8000 R5 Real-time Performance View

DS8000 R5 Hardware Visualizer

DS8000 R5 Hardware Visualizer

Summary

To summarize, I’m going to quote from IBM’s DS8700 announcement presentation.

The DS8700 announcement introduces the most advanced model in IBM’s DS8000 lineup with up to over a 150% boost in performance.  This new hardware refresh not only offers much higher performance, it also builds on the DS8000’s unrivaled reputation for reliability and investment protection by maintaining its IBM POWER-based architecture over generations of new models.

This release underscores the commitment to our flagship enterprise disk platform and enables us to continue providing an ideal combination of optimized performance, scalability, reliability, and value that our most demanding customers expect from IBM.

That sums it up rather nicely, doesn’t it?

Posted in DS8000, IBM.

Tagged with , , , , , .


Inside the DS8700 Part 2 – PCI Express

ds8-pcieWelcome back to this behind-the-scenes look at the DS8700.  In Part 1, I examined the POWER6 based processor complexes that form the heart of the DS8700.  Today, I’m looking at the PCI Express gen2 I/O fabric that makes up the backbone of the most advanced version of IBM’s flagship enterprise disk product.

There are many design decisions made during the development of a storage subsystem.  One of the most fundamental is the interconnect topology used to connect all the components in the machine.  The DS8100 and DS8300 use a high-speed bus known as a RIO-G loop to connect the processor complexes and the PCI-X based I/O towers.  This has been a very successful design (as 1000s of our customers can attest), but for the DS8700 we wanted more.  To borrow a phrase from Spinal Tap, we wanted the DS8700 to go to ‘11‘.

We still have a RIO-G loop in the DS8700, but it is only used to connect the two POWER6 processor complexes together for synchronization and control purposes.  The big change is in how we connect the processor complexes to the I/O towers that contain the back-end disk adapters and the front-end host adapters.  For these connections, the DS8700 has replaced the RIO-G loops with a fabric of point-to-point PCI Express connections.  Each I/O tower has a dedicated 2 GB/s connection to each of the processor complexes.  This translates into a significant increase in the amount of data throughput we can sustain with the DS8700.

Making the move to PCI Express has brought more than increased performance to the DS8700.  It has also allowed us to further raise the bar in terms of reliability.  PCI Express adapters are intelligent devices.  Transient bit or CRC errors that can freeze other I/O technologies are caught in the PCI Express adapter and handled by the adapter itself.  Persistent errors can be dealt with by gracefully degrading the data transmission speed and notifying the processor complex of the problem.  By using smarter adapters that can self-heal, we add another layer of reliability to an already highly available system.

In my next article on DS8700 internals, I’ll be stepping away from the hardware and taking a look at some of the microcode enhancements in the R5 code that powers the DS8700.

Update 1:  Added link to YouTube clip that illustrates taking things to ‘11′.  Thanks David!

Posted in DS8000, IBM.

Tagged with , , , .


Inside the DS8700 Part 1 – POWER6

Power6 CPU

Power6 CPU

This is the first in a series of posts taking a behind-the-scenes look at the new DS8700.  Today, I’m taking a look at the POWER6 based processor complexes that make up the heart of the DS8700.

One of the strengths of the DS8000 platform is the use of IBM’s industry leading POWER Systems server technology as the foundation of the DS8000.  From the RS64 processors in the IBM ESS, to the POWER5 processors in the DS8100 and DS8300, to the POWER6 processors in the DS8700, having a fully integrated processor and memory subsystem has been key to delivering the highest levels of balanced throughput and performance in a disk subsystem.  The POWER6 processor has been a runaway hit in our server product line, and the DS8700 benefits from two years of real-world acceptance (and success!) as well as operational experience.  We’re also well positioned to take advantage of future advances in the POWER processor line without needing to make drastic changes to the fundamental architecture of the DS8000.

Here are some vital statistics of the DS8700 central electronics complexes, or CECs, as we like to call them.

  • Dual IBM 4.7 GHz POWER6 based controllers, available with either 2 or 4 CPUs per controller.
  • Up to 126 128 GB of system memory for cache and NVS available in the 2-way controllers.
  • Up to 384 GB of system memory for cache and NVS available in the 4-way controllers.
  • PCI Express Generation 2 internal I/O fabric (Covered in Part 2 of this series.)

In addition to the bigger/faster/stronger aspects of the move to POWER6, there are advances in server design that also come into play with the DS8700.  This is particularly true in the areas of Reliability, Availability, and Serviceability (RAS).  For example, each POWER6 processor has an internal processor recovery unit.  Before a machine instruction is dispatched to any of the POWER6’s nine(!) execution units, the recovery unit takes a snapshot of the processor state.  Should a fault be detected during the execution of that machine instruction, the processor state can be recovered and the instruction retried.  Should faults continue to be detected, we can use the recovery snapshot to re-create the processor state on another processor in the system, and execute the instruction there.  If necessary, the system can then dynamically de-allocate the failing components and schedule a support call all without affecting access to data!  In fact, IBM tortured tested this design by irradiating an operating POWER6 system with a high-energy proton beam while measuring the processor error recovery activities.  Shooting your DS8700 with a particle beam will void your warranty (and is certainly not recommended), but it’s nice to know that our engineers take their testing so seriously (and have access to seriously interesting test equipment.)

Before I forget… these new POWER6 resiliency features are in addition to Chipkill memory, redundant bit steering, spare cache lines, and advanced predictive failure analysis algorithms that have been carried forward from the DS8100 and DS8300.  All of this together  helps the DS8700 to deliver greater than five-nines availability.

So that’s a brief look at the new POWER6 processors in the DS8700.  Next, I’ll be taking a look at the new PCI-Express Generation 2 interconnect fabric in the DS8700.  Post timing will be dependent upon my workload as I’m a field guy with a large territory to support.

UPDATE:  Looks like we have a new DS8700 press release for your enjoyment.

Update 2:  Fixed a fat-finger error in the system memory size for 2-way controllers.

Posted in DS8000, IBM.

Tagged with , , , , .


Announcement Day – October 20th Edition

Dynamic Infrastructure

Dynamic Infrastructure

It’s Announcement Day at IBM and today is a big one!  Today’s announcements are all about IBM’s Dynamic Infrastructure initiatives.  For those who’ve missed it, Dynamic Infrastructure is all about building a smarter, more flexible, more efficient, and more cost-effective computing foundation to enable the next generation of information management.  It’s worth taking a look at the DI and Smarter Planet websites.  There are some cool ideas there that are worthy of consideration.

Today is a mega-announcement day.  There are over a dozen separate hardware and software announcements today.  The last time I saw an announcement this big was when we rolled out self-encrypting disks on the DS8000 earlier this year.  There’s a lot to cover, and I’ll be writing a series of posts this week covering the highlights of the storage hardware portions of the announcement.

In a nutshell, here are the new storage hardware announcements from IBM:

  • DS8700 – The next generation of the DS8000 has arrived.  With Power6 processors, a new internal I/O topology, significantly improved performance, and improvements in Reliability, Availability, and Serviceability (RAS), it’s the right box for our most demanding customers.
  • N series – SAS drives, high-density storage expansions, SnapManager for Hyper-V, and native FCoE support.
  • DS5000/DS4000 – 600 GB fibre channel disk drives are now available.

Lots going on, and lots of reading to do.  Check back later this week for more details on the announcements.

Posted in Announcements, IBM.

Tagged with , , , , , , , , .


Back from Vacation

South Arch at Twin Arches

South Arch at Twin Arches

I’ve just returned from a much needed vacation at the Big South Fork National River and Recreation Area.  For those who haven’t heard of it, BSFNRRA is 125,000 acres of wilderness straddling the Kentucky/Tennessee border in the eastern part of both states.  It’s full of sandstone arches, deep river gorges, historic sites, hiking trails, and wildlife.

I spent three days in the backcountry with my family.  We got a little wet and muddy, but a good time was had by all.  Now I have to start digging out from all of the email that piled up while away.  Busy week this week!

Posted in Other.

Tagged with , .


Announcement Day – October 6th Edition

It’s Tuesday, and that means that it’s time for IBM product announcements.  October is a big month for storage announcements, as is evidenced by the number of online training sessions I’ve been instructed to attend.  Today’s announcements are just the start.

So what’s new this week?  Five new DS5000 enhancements are now available for order.

  1. Dual 1 gigabit ethernet iSCSI host ports for the DS5100 and DS5300.  This is in addition to the existing 4 gigabit Fibre Channel and 8 gigabit Fibre Channel connectivity options.  With the addition of iSCSI, the DS5000 now has the ability to fit just about any host connectivity need.
  2. 32 GB & 64 GB cache memory options for the DS5100 and DS5300.  These are available as both initial order and field upgrade options.
  3. 73 GB Solid State Drives (SSDs).  A 73 GB high-performance SSD option is now available for the EXP5000 storage expansion.  SSDs are supported on both the DS5100 and DS5300.
  4. Attachment support for up to 448 disk drives on the DS5100.  The DS5100 is now capable of supporting 448 drives (up from the prior limit of 256 drives).  Drives can be a mixture of SSD, Fibre Channel, and SATA.
  5. High-bandwidth OM3 fiber optic cable option.  These are 10m host-side OM3 fiber optic cables with LC/LC connectors.

All of these enhancements are planned to be available on October 16, 2009 with the exception of the 64 GB cache option.  The 64 GB cache option has a planned availability date of December 21, 2009.

For more information on these announcements or any other IBM product, please contact your local IBM representative.

Posted in Announcements, IBM.

Tagged with , , , .