From: Ed Wilts [ewilts@ewilts.org] Sent: Thursday, September 26, 2002 1:01 PM To: Info-VAX@Mvb.Saic.Com Subject: Re: Hardware Mirroring 'vs' Software Mirroring ? "Main, Kerry" wrote in message news:... > Re: HBVS vs. HW RAID > > Both have advantages and disadvantages. I guess it's time for me to step and discuss the realities of life... In an ideal world, a single drive fails cleanly, notifies you that it's failed, you come in the next business day, do a hot swap, the data cleanly copies, and users never notice the difference. Now wake up and smell the roses. This year alone, I've seen the following: - a single drive fails and causes every drive on that shelf to fault. In this particular case, no raidset had two drives on the shelf or I would have lost the raidset. 2 of the raidsets picked up hot spares in the cabinet and rebuilt cleanly - the other 4 waited for manual intervention and ran in reduced mode. - a drive in a raidset failed on a Friday night. A hot spare kicked in and the raidset was rebuilt. A 2nd drive in the raidset failed Sunday night and when I came in Monday I had 2 dead drives and a reduced raidset. - a raidset that was used exclusively for archive data logged soft read errors every month or two. However, when that raidset got used for more active data, drives suddenly started dying. One drive failed hard and a hot spare was rolled in. While the raidset was being rebuilt, another drive failed. I lost the raidset. Fortunately, my data was mirrored to another data center via volshad, so I lost no data. If I didn't shadow, I would have lost about 50GB of active production data and would have been spinning tape. I expect I would have been down for >12 hours. I've seen bugs in the OS dealing with disk drives - shadow copies going the wrong way, drives merging when they shouldn't, crash dumps going to never-never land, etc. I've seen a single non-shadowed drive fail and take out a cluster while 2-3K users were online and had some good chats with Engineering folks about this should not have happened but did. My drive hanging the bus situation is the 2nd time in less than 5 years that I've seen it. The first time, with older HSJ firmware, the HSJ would reboot itself if it can't reset the bus, at which point the raidset fails over to the redundant controller, it tries to reset the bus, can't, reboots itself, and you're screwed. At the time (end of 1997) I was told that the problem was rare enough that it wasn't worthwhile fixing, but I helped push a solution along and the controllers now won't ping-pong. Good thing too or I would have lost a data center last week. Disk drives are evil little pieces of spinning metal just waiting to hurt you. Be afraid....very, very afraid. Assume you'll experience double-disk failures, hung SCSI busses, and failures right after you go home for the weekend. The more drives you have, the bigger the hurt, and unfortunately the bigger the probability of getting hurt. Now tell your bosses what the risks are and mitigate against those risks. You won't all build redundant data centers, but you should be prepared for what can happen and what the impact will be when it does. Document your recovery plans. That little disk drive in the corner is wanting to fail as you're reading this posting :-). I've been managing VMS systems for 20 years and can sleep soundly. Paranoia is a wonderful thing to have when you're the system manager. .../Ed mailto:ewilts@ewilts.org