Skip to content


Getting the Word Out – Creative Uses of Technology

Traditionally, getting the word out was one of the largest challenges in emergency management. Now our nearly ubiquitous communications devices give us a multitude of ways to both send and receive information updates during emergencies. (So long as the infrastructure holds up, but that is a topic for other times.)

There is now so much information out there, the problem has shifted to organizing and processing it all. Over the past few months, I’ve seen a marked increase in the creative use of technology and social media to both get the word out and help people organize the flood of information that is coming in.

One of the oldest initiatives is Google’s Crisis Response project. They produce the excellent Crisislanding site. I spent a lot of time watching Crisislanding during Hurricane Irene and was impressed by the depth of information available in a single location.

Twitter also played a significant role during Hurricane Irene. I appreciated how quickly Twitter provided verified account status to many of the Red Cross accounts. Having “verified” accounts made it easier to quickly identify trusted sources of information. I hope that other social media services will follow twitter’s lead.

So now we have the wildfires in Texas. I’m struggling to imagine a fire that has burned an area the size of Connecticut, but that’s what people are dealing with. Again, connecting people with information and services is of paramount importance. I came upon a public Google Docs spreadsheet that is tracking needs, offers, shelter locations, and other real-time information concerning the fire.

Google Docs Spreadsheet

These are only a few examples of how the information game is changing in the context of emergency management. I’m looking forward to seeing how social media continues to play a role in getting the word out and helping those affected by disasters better communicate with friends, family and loved ones.

Posted in Emergency Mgmt.


Hurricane Irene Overhyped? Required Reading Below

Like many, I closely followed the progress of Hurricane Irene last week. I was not surprised as it weakened and ceased to become a coastal threat. I will admit to having badly underestimated the extent of the inland flooding that followed. Irene was a devastating storm, just not in the way the media expected.

Now that the storm is over, the Washington Monument is still standing, and Armageddon failed to visit NYC, it seems that many are complaining that Irene was overhyped. I disagree, but have found it instructive to examine why people feel that it was overhyped.

Rather than restate what others have already written, I’m going to post links to the best rebuttals and analyses of Irene’s alleged ‘overhyped-ness’. They make for interesting food for thought.

 

Posted in Emergency Mgmt.


First Look at VIOS 2.2 SP01

As expected, the first phase of what has been called “VIOS Next Generation” or “NextGen VIOS” was released on December 9th as “VIOS 2.2 SP01″.  I recently installed it on my test cluster and put it through its paces to see what was included in the nearly 900MB download.

First of all, pay close attention to the README prior to installing this code.  There are more than a few caveats that are important to pay attention to.  Some notable one are:

  • The reject option of updateios is not supported in this release.  Once you install this service pack, you are committed.
  • The new shared storage pool functionality requires 4 GB of RAM in the VIO server.
  • There is a maximum of one (1) VIOS node per shared storage cluster in this release.
  • VIO servers that host shared storage pools may not participate in Live Partition Mobility operations or Partition Suspend/Resume Operations.
  • VIO clients that make use of storage from shared storage pools are not supported for Live Partition Mobility

There are more caveats that may apply in your environment, so again, please carefully read the README before applying the code.

First of all, this VIOS level identifies itself as:
$ ioslevel
2.2.0.11-FP-24 SP-01
The underlying AIX is:
$ oem_setup_env
# oslevel -s
6100-04-07-1036

So we’re dealing with a very recent AIX 6.1 TL as the underlying system.  It’s probably not a coincidence that this AIX 6.1 TL introduced support for Cluster Aware AIX.

I built a one-node cluster and assigned two LUNs to it as members of a storage pool.  It is obvious from naming conventions that the VIOS clustering code is built on Cluster Aware AIX.  This, I think, is a good thing.  As mentioned in the README, only a single node “cluster” is supported.  In fact, there is no way to add a second node via the standard VIOS interface.  I was able to coerce a second node into the cluster by dropping into the oem_setup_env and running AIX commands, but this rendered the VIOS level environment inoperable.  Removing the cluster became problematic at this point as well.

Given the limitations of the environment, I didn’t experiment with assigning storage.  Nagger has an overview of the storage assignments on his AIX Expert blog.  Thankfully, legacy vSCSI and NPIV storage management techniques still work in this release, so it is safe to use in production.  Only the new shared storage pool functionality is limited.

Summary:

The new shared storage pool functionality is obviously not yet ready for production use.  I share Nigel’s assessment of this being a “Preview release”.  I do not believe that this will be useful until at least 2-node clusters are supported and the limitations on Live Partition Mobility lifted.  According to the technology roadmaps I have seen, these functions are in the works and will be released over the next year.

Overall, I remain enthusiastic over the NextGen VIOS strategy.  My largest environments mostly use vSCSI for storage management, and I’m ready for better storage management techniques to be a part of VIOS.  This release shows me that IBM is hard at work at making this goal a reality.

Posted in AIX, Virtualization.

Tagged with , , .


VIOS 2.2 SP1. A Third Way to Manage Virtual Storage?

One of the more powerful features of IBM’s PowerVM product suite is the ability to share I/O resources via the Virtual I/O Server (VIOS).  VIOS is a virtual appliance running in its own partition(s) and is responsible managing shared access to network and storage devices. For the purposes of this post, I am going to focus on storage virtualization and the coming changes in VIOS 2.2 SP01 due out later this month.

Continued…

Posted in AIX, IBM, Storage, Virtualization.

Tagged with , , , , .


HSCLA29A Error During AIX Live Partition Mobility

AIX VIOS and HMC error messages sometimes leave a lot to be desired.  As an example, I received the following while validating that one of my partitions would successfully migrate between POWER 595 frames.

Google didn’t come up with too many hits, and the official IBM online documentation says “Contact your next level of support.”  Not exactly what I was looking for.

As it happens, this lpar recently had new LUNs assigned to it.  On a hunch, I ran ‘lsdev -vpd’ on the destination VIO servers and searched for the serial number of the new LUNs.  They didn’t exist.  A quick ‘cfgdev’ on each of the destination VIO servers fixed the problem.  Apparently, I forgot to do that when we added the storage.

To summarize: Error HSCLA29A appears to mean that LUNs are missing from the destination VIO servers.

Posted in AIX, Virtualization.

Tagged with , , .


Hanging up the Tech Blog

After months of telling myself, “I’ll update the tech blog soon”, I am finally admitting that I don’t enjoy owning a tech blog. And since I don’t enjoy it, it seems silly to keep it around as a testament to the fact that I don’t write very often.

I enjoy my job as an IT consultant. I enjoy presenting, teaching, and learning new things about servers, storage, and virtualization. I’ve discovered, however, that I don’t enjoy continuing my day job into the little bits of my day that aren’t consumed by work. Thus, the end of this failed writing experiment.

To my twitter followers…
First, thank you for following me. I am going to continue a twitter presence, but my tweets will likely drift away from things IT into things that interest me on a personal level. So, if you find tweets along the lines of “Sunny and hot in Louisville today, wish I were on the river.” to be an annoyance, you probably want to unfollow me. I’m not saying that I’ll never tweet IT again, but it won’t be a focus.

To the (dozen or so) people who have been reading this blog, again thank you. It will be going away soon if for no other reason than to save the $36/mo it costs in hosting fees.

To the real techbloggers out there, especially the independent ones, keep up the good work. I look forward to continuing to read your thoughts and ideas.

—–
Updated 2010-07-15: I’ve been prevailed upon not to remove this. I may even update it every now and then. Thanks for the kind words.

Posted in Other.


Is Your AIX Up to Date?

Nigel Griffiths (a name that should be familiar to all AIX administrators) has updated the AIXpert Blog with a reminder that new AIX service packs recently became available.  He’s recommending the following:

  • AIX 5.3:
    • TL09 SP7
    • TL10 SP4
    • TL11 SP4
    • TL12 SP1
  • AIX 6:
    • TL02 SP7
    • TL03 SP4
    • TL04 SP4
    • TL05 SP1

If you’re not on one of these levels, you probably should be.  While good conservative systems administration practices are a virtue, being too conservative is a vice.  IBM only supports a given technology level (TL) for around two years.  Being forced into an upgrade because fixes are no longer being created for your release is not a comfortable situation.

In general, I advocate running the latest SP of the N-1 TL.  This model has worked well for me.  It keeps my systems reasonably fresh while minimizing the upgrade risks.  I’ve also found that going from one AIX TL to another is almost always a painless upgrade.

For more information on AIX technology levels be sure to check out Fix Central and the Fix Level Recommendation Tool.

Posted in AIX, IBM.

Tagged with , .


Back in the Server World

My short hiatus away from blogging accidentally turned into a three month long sabbatical.  I have a new employer, a new set of responsibilities, and most importantly, a new focus.  I will no longer be focusing on storage.  Instead, I will be looking at servers and systems.  Storage will still be featured from time to time, but it will no longer be the primary topic of my writings.

In many ways, servers and systems are “home” for me.  I started my career on the mainframe, moved to *nix, dabbled in Windows, and only fairly recently was focused exclusively on storage.  I’ve spent time with OSF/1, HP-UX, AIX, and of course, Linux.  My HP experience is pre-Itanium, so it is quite dated.  My Linux experience dates back to kernel 0.99, and my AIX experience goes back to AIX/6000 v3.

Despite my employer being an IBM value added reseller, I am going to try to stay away from marketing and focus on technology.  My efforts at being a product cheerleader could be charitably described as “unfortunate”.  In my new job, I work for the services side of the house, so I am much more hands on with the technology and spend less time reading marketing glossies.  I am hoping that this approach will work better.

Posted in Other.


Moving On…

I’m preparing to change the focus of this blog.  There is a lot that can be said about enterprise storage, but for the most part, the topic is being well covered by other people.  I’ve come to the realization that while storage system design is a fascinating topic, storage is a means to an end and not the end in itself.  Please don’t misunderstand me.  I remain passionate about the DS8000 and its capabilities.  I know it better than any other device I have worked with in my 14 years as an IT professional.  It’s an amazing machine.  The DS8000 is only now beginning to realize the capabilities inherent in its design.  The best is yet to come.  IBM may not be able to out market it’s competitors, but we out engineer them at every turn.

I am not entirely certain what my new focus will be.  It will certainly be broader than just storage.  I am going to take a couple of weeks off to decide what direction to take.

I am also leaving IBM.

I will be back in a couple of weeks with a new focus, a new employer, and new topics to discuss.  I hope to see all of you then.

Posted in Other.


RAID in the 21st Century

Storagebod’s musings on storage availability got me thinking about RAID (Redundant Array of Independent Disks) technologies and how they have evolved to handle ever larger drive and array sizes.  RAID is, after all,  a risk mitigation technique.  Disk drives fail.  Sometimes this is a pure mechanical failure.  Other times, the drive media may develop bad sectors which render portions of the drive unreadable.  In either case, data has been lost that must be recovered via redundant data stored within the array.

Historically, the most popular RAID technology has been RAID-5 (striping with distributed parity).  RAID-5 performs well, has excellent storage efficiency, and is reliable enough for most common uses.  RAID-1(mirroring) and RAID-10 (mirroring and striping) are also common and have typically been used where RAID-5 does not perform well enough.  RAID-1/10 are also considered to be more resilient to failure than RAID-5.  This additional performance and resiliency comes at the expense of greatly reduced storage efficiency.

Recently, RAID-5 has been showing its age.  As drive sizes have become ever larger, the amount of time required to reconstruct the data from a failed drive has increased as well.  This has led to uncomfortably long periods of time where a single bad sector discovered during array reconstruction can wipe out an entire RAID array.  Statistically speaking, RAID-5 still seems to be working well for enterprise fiber channel drives, but I have become uncomfortable with RAID-5 arrays constructed from large SATA drives.  (I define large as 500+ GB.)  I expect my discomfort to increase as drive sizes continue to grow.

RAID-6 (striping with two independently calculated parity values) is one possible solution to the problem of data integrity exposure during array reconstruction.  With RAID-6, three things (as opposed to two with RAID-5) have to go wrong before data is lost.  This is dramatically more reliable than RAID-5, and still much more efficient than RAID-1.  RAID-6, however, is not a perfect solution.

The problem with RAID-6 is that most implementations are slow.  The additional I/O operations and parity calculations required by RAID-6 pose a significant performance penalty on write operations.  Clever implementations such as Intelligent Write Caching (available in DS8000 R4.2+), or the hybrid WAFL/RAID-DP approach taken by Data ONTAP (available in NetApp FAS and IBM N series arrays) significantly reduce the performance penalty of RAID-6.  In fact, DS8000 Intelligent Write Caching makes RAID-6 arrays on the DS8000 perform almost as well as pre-Intelligent Write Caching RAID-5 arrays.

So what about XIV?  XIV uses a completely different storage scheme named RAID-X.  RAID-X is a radical re-think of the way we store data in an array.  It is a hybrid of mirroring, massive parallelism, and dynamic balancing of system resources.  RAID-X’s goals are simple: make commodity level hardware perform at enterprise levels, make storage administration dramatically simpler, provide consistent performance in the face of wildly varying I/O requirements, and be able to seamlessly adapt to ever increasing hard drive sizes.

There’s a lot of misinformation about how RAID-X works.  In an attempt to clear matters up, I offer the following.

A fully populated XIV frame contains 15 modules.  Each module contains 12 disk drives, cache, and processor resources.  This gives us a total of 180 disk drives.  Today’s shipping XIVs use 1 TB SATA disk drives, so the raw capacity of the frame is approximately 180 TB.  (In reality, it is a bit smaller since a 1 TB drive doesn’t actually hold 1 TB of data, but that is a topic for another post.)  All data is mirrored internally and some space is set aside for spare capacity and system metadata.  This gives a usable capacity of 79 TB per fully populated XIV frame.

When data comes in to XIV, it is divided into 1 MB “partitions”.  Each partition is allocated to a drive using a pseudo-random distribution algorithm.  A duplicate copy of each partition is written to another drive with the requirement that the copy not reside within the same module as the original.  This protects against a module failure.  A global distribution table tracks the location of each partition and its associated duplicate.  When a failure occurs, the system knows exactly which partitions are no longer protected and immediately begins creating new copies to restore redundancy.  This is where the parallelism of the design comes into play.  The entire machine goes to work re-creating the missing redundancy, so very little work has to be done by any one component.  This allows XIV to rebuild a failed 1 TB drive in minutes as opposed to the hours it would take in traditional RAID implementations.

The most common FUD point raised against RAID-X is that it is vulnerable to a double-drive failure and since data is spread across the entire machine, the failure of any two drives will cause data loss.  While this makes a great talking point on a competitive slide, it is simply not the case.  Allow me to explain.

As I pointed out above, RAID is a risk mitigation technique.  The most common ways to mitigate the risk of data loss are to decrease the probability that a critical failure combination can occur, and/or decrease the window of time where there is insufficient redundancy to protect against a second failure.  RAID-6 takes the former approach.  RAID-10 and RAID-X take the combination approach.  Both RAID-10 and RAID-X reduce the probability that a critical combination of failures will occur by keeping two copies of each bit of data.  Unlike RAID-5, where the failure of any two drives in the array can cause data loss, RAID-10 and RAID-X require that a specific pair of drives fail.  In the case of RAID-X, there is a much larger population of drives than a typical RAID-10 array can handle, so the probability of having the specific pair of failures required to lose data is even lower than RAID-10.

Another differentiation between RAID-X and RAID-10 is the window of vulnerability between the time a drive fails, and the time when full redundancy is restored.  RAID-10 has to copy the entire contents of the surviving member of the pair to a spare drive.  The copy process is directly proportional to the size of the drive since only one source volume is copying to one target volume.  While much faster than RAID-5, it can still take a while to copy a 1 TB drive.  The limiting factor is the transfer rate supported by the single source and target volume pair.

RAID-X operates differently.  When a drive fails, the global distribution table immediately knows the locations of all data that are no longer redundant.  This non-redundant data is evenly distributed among 168 drives.  The system immediately goes to work creating redundant copies of all exposed data.  This is done as a fully parallel, any-to-any operation with all surviving 179 drives participating.  Each drive only has to carry out 0.6% of the effort required to restore redundancy.  In addition, and this is key, only partitions that actually contain user data need to be copied.  If the system is only 50% full, RAID-X only needs to copy 500 GB worth of data to fully recover from a failed drive.  Contrast this to RAID-10,  which has to copy the entire drive regardless of the amount of user data actually stored on the drive.  Between the inherent parallelism of the design and the intelligence of the copy process, RAID-X can completely recover from a failed 1 TB drive in as little as 15 minutes.

I hope this sheds some light on how XIV’s RAID-X really works.  As with many new and creative approaches to old problems, there are a lot of misunderstandings, misinformation, and outright FUD in the marketplace concerning RAID-X.  I firmly believe that RAID-X is at least as reliable as any other mirroring technology and has further advantages, not all of which I have been able to include here.

Thank you for reading this rather lengthy post.  I look forward to continuing the conversation.

Posted in IBM, Storage.

Tagged with , , , , , .