From: Ed Wilts [ewilts@ewilts.org]
Sent: Thursday, September 26, 2002 1:01 PM
To: Info-VAX@Mvb.Saic.Com
Subject: Re: Hardware Mirroring 'vs' Software Mirroring ?

"Main, Kerry" <Kerry.Main@hp.com> wrote in message news:<BE56C50EA024184DAF48F0B9A47F5CF402660A23@kaoexc01.americas.cpqcorp.net>...
> Re: HBVS vs. HW RAID
> 
> Both have advantages and disadvantages. 

I guess it's time for me to step and discuss the realities of life...

In an ideal world, a single drive fails cleanly, notifies you that
it's failed, you come in the next business day, do a hot swap, the
data cleanly copies, and users never notice the difference.  Now wake
up and smell the roses.

This year alone, I've seen the following:
- a single drive fails and causes every drive on that shelf to fault. 
In this particular case, no raidset had two drives on the shelf or I
would have lost the raidset.  2 of the raidsets picked up hot spares
in the cabinet and rebuilt cleanly - the other 4 waited for manual
intervention and ran in reduced mode.
- a drive in a raidset failed on a Friday night.  A hot spare kicked
in and the raidset was rebuilt.  A 2nd drive in the raidset failed
Sunday night and when I came in Monday I had 2 dead drives and a
reduced raidset.
- a raidset that was used exclusively for archive data logged soft
read errors every month or two.  However, when that raidset got used
for more active data, drives suddenly started dying.  One drive failed
hard and a hot spare was rolled in.  While the raidset was being
rebuilt, another drive failed.  I lost the raidset.  Fortunately, my
data was mirrored to another data center via volshad, so I lost no
data.  If I didn't shadow, I would have lost about 50GB of active
production data and would have been spinning tape.  I expect I would
have been down for >12 hours.

I've seen bugs in the OS dealing with disk drives - shadow copies
going the wrong way, drives merging when they shouldn't, crash dumps
going to never-never land, etc.  I've seen a single non-shadowed drive
fail and take out a cluster while 2-3K users were online and had some
good chats with Engineering folks about this should not have happened
but did.

My drive hanging the bus situation is the 2nd time in less than 5
years that I've seen it.  The first time, with older HSJ firmware, the
HSJ would reboot itself if it can't reset the bus, at which point the
raidset fails over to the redundant controller, it tries to reset the
bus, can't, reboots itself, and you're screwed.  At the time (end of
1997) I was told that the problem was rare enough that it wasn't
worthwhile fixing, but I helped push a solution along and the
controllers now won't ping-pong.  Good thing too or I would have lost
a data center last week.

Disk drives are evil little pieces of spinning metal just waiting to
hurt you.  Be afraid....very, very afraid.  Assume you'll experience
double-disk failures, hung SCSI busses, and failures right after you
go home for the weekend.  The more drives you have, the bigger the
hurt, and unfortunately the bigger the probability of getting hurt. 
Now tell your bosses what the risks are and mitigate against those
risks.  You won't all build redundant data centers, but you should be
prepared for what can happen and what the impact will be when it does.
 Document your recovery plans.  That little disk drive in the corner
is wanting to fail as you're reading this posting :-).

I've been managing VMS systems for 20 years and can sleep soundly. 
Paranoia is a wonderful thing to have when you're the system manager.

   .../Ed
mailto:ewilts@ewilts.org