PDA

View Full Version : AH residence



John Nebel
03-17-2004, 06:44 PM
Dual-redundant fibre-channel disk controllers run the AH disks and all the AH data is stored redundantly, ie two real-time copies.

It's pretty difficult to get reliability in the computer realm which is anything close to the level of reliability to which JBL afficianados are accustomed.

johnaec
03-17-2004, 06:58 PM
AH disks?

John

John Nebel
03-17-2004, 07:14 PM
John

Sorry, Audio Heritage - Ann and I have got accustomed to the shorthand "ah"

We help Don with the technical computer details.

John

JuniorJBL
03-17-2004, 08:49 PM
What part of CO are you in.
Maybe I can help out with some things AH as well. I am in the buisness as well. PM me and let me know if anything is needed. Nice Rig!!
Shane:D

Mr. Widget
03-17-2004, 09:08 PM
Oh sorry, wrong thread.:eek:

Don McRitchie
03-18-2004, 07:38 PM
We help Don with the technical computer details.

They do more than that. For those that aren't aware, John hosts our entire site at no cost to us. This is a very considerable contribution. We have two domains, that between them, have around 800MB of data and a monthly bandwidth that exceeds 30GB per month. John began hosting our site late last fall and I am not aware of any time that the site has been down since then. John and Ann have been invaluable in maintaining and supporting this site.

BTW, the new disk array is kewl.

johnaec
03-18-2004, 07:46 PM
Being IT manager where I work, I can really appreciate what John's doing! Kudos!!

boputnam
03-19-2004, 07:18 AM
Originally posted by John Nebel
Dual-redundant fibre-channel disk controllers Dang it John!! Now I need one of THEM, too!! :banghead:

John Nebel
03-19-2004, 07:54 AM
Bo,

Heh, the cache backup batteries are lead-acid. You'd have to consider the disposal issues - just went through that with a pile of them - they have to be replaced every two years.

John

boputnam
03-19-2004, 08:01 AM
We could market them as "alternative energy" sources - probably worth a bundle out here... :rotfl:

Hell, down in the Haight, they'll pay-up for anything with "acid" in it... :shock:

John Nebel
03-19-2004, 09:14 AM
Bo,

Your good cheer and sense of humor are great!

Continuing on in my role as straight man...

Yesterday a one of the disk shelves went offline, essentially 8 disks disappeared, so to speak.

The three AH volumes - vBulletin, web, and ftp - fell over onto the surviving copy - that much was supposed to happen. The interesing bit is the the controller then reestablished redundancy on the remaining disks since they had enough free space.

The volumes which were not redundant disappeared from the hosts, but in the case of VMS, the important one was shadowed - it's more than a million e-mail messages - and it just dropped out of the shadow set and the operating system said goodbye and continued.

%%%%%%%%%%% OPCOM 18-MAR-2004 16:40:39.70 %%%%%%%%%%% (from node HERA at 18-MAR-2004 16:40:39.68) Device $1$DGA300: (HERA PGB, CRONUS) is offline. Mount verification is in progress

%%%%%%%%%%% OPCOM 18-MAR-2004 16:41:02.31 %%%%%%%%%%% (from node HERA at 18-MAR-2004 16:41:02.30) $1$DGA300: (HERA PGB, CRONUS) has been removed from shadow set.

This weekend to find out why what broke, did.

Now I know why vraid1 is the default option. Disks are cheap - wine drinking time is dear.

John

boputnam
03-19-2004, 09:34 AM
Originally posted by John Nebel
...The three AH volumes - vBulletin, web, and ftp - fell over onto the surviving copy - that much was supposed to happen. The interesing bit is the the controller then reestablished redundancy on the remaining disks since they had enough free space....
%%%%%%%%%%% OPCOM 18-MAR-2004 16:40:39.70 %%%%%%%%%%% (from node HERA at 18-MAR-2004 16:40:39.68) Device $1$DGA300: (HERA PGB, CRONUS) is offline. Mount verification is in progress

%%%%%%%%%%% OPCOM 18-MAR-2004 16:41:02.31 %%%%%%%%%%% (from node HERA at 18-MAR-2004 16:41:02.30) $1$DGA300: (HERA PGB, CRONUS) has been removed from shadow set.... Reminds me of that dog cartoon, where owner is going on-and-on and all the mutt hears is...

:blah: :blah: :blah: "nipper" :blah: :blah: :blah:

You, sir live in a very strange world. I now am begining to believe you can read the Matrix code as it runs down the screen. And, you have yet to pleasure us with photos of your incredible set-up - the JBL's, that is...

Hofmannhp
03-19-2004, 09:46 AM
Originally posted by Don McRitchie
They do more than that. For those that aren't aware, John hosts our entire site at no cost to us. This is a very considerable contribution. ......John and Ann have been invaluable in maintaining and supporting this site.

Hi All,

thats a fine job John is doing here for zero bucks.....I can imagine that John is good for some forum members donations, after the job for Don (the speaker project) is done.

HP

John Nebel
03-19-2004, 10:49 AM
HP

Thanks, but it's better direct funds to the general good through the mechanism established by Don.

I'm doing it for the fun of it. The pics were posted because we are, how does one say this, somewhat or maybe somewhat, crazy and the pics seemed to fit in.

That was an intentional understatement.

The stuff about disks does have me a little baffled and it is curious to watch these things which have small minds of their own. Rest assured that I DON'T TRUST THEM and I don't believe the manufacturer's representations and there is backup strategy.

Merely having a 2nd array sitting next to the first is not enough - one has to be prepared to give up the entire concept and use something old fashioned and slower and more reliable if need be and I was certainly thinking of that yesterday while watching the automatic processes play out and wondering what the hell was to happen next.

John

Hofmannhp
03-19-2004, 11:54 AM
ok John,

I like to see how it works with your machines, between the JBL stuff.
BTW ..In which RAID level do you use the disks? and what kind of controler is used?

HP

John Nebel
03-19-2004, 12:25 PM
HP,

http://h18006.www1.hp.com/products/storageworks/enterprise/index.html

The controller is called HSV110 and they are sandwiched between the loop switches in the pic at the beginning of the thread. Each disk shelf - one is at the top of the pic - has two power supplies - white and black cables - connected to separate power controllers on separate power circuits. The shelf is on two fiber loops connected to separate loop switches. There is a separate controller on each loop. The four switches are for the left and right sides of the cab, bottom and top.

The controllers are connected to separate SAN switches and each host computer has a separate fibre-channel card on a separate I/O bus connected to each SAN switch.

The object is to survive the failure of any single component including a disk shelf which comprises up to 14 x 306 GB disks. A cab can have 12 shelves - the one pictured has 6 and has lower-capacity disks as the cost/storage density trade off very much favors that.

I'm using RAID 0, 1 and 0+1 - 0+1 in DEC parlance being stripes + controller based mirroring. The critical data use VMS host-based shadowing with 3-member shadow sets, one member at a DT site at what was formerly the USAF Space Command :)

A lot less complicated than passive crossover networks which completely confound me.

John

Hofmannhp
03-19-2004, 12:38 PM
Originally posted by John Nebel
HP,
Each disk shelf - one is at the top of the pic - has two power supplies - white and black cables - connected to separate power controllers on separate power circuits. The shelf is on two fiber loops connected to separate loop switches. There is a separate controller on each loop. The four switches are for the left and right sides of the cab, bottom and top.

The controllers are connected to separate SAN switches and each host computer has a separate fibre-channel card on a separate I/O bus connected to each SAN switch.

John

thanks John,

more security can not be done......an interesting system (virtual Raid). Dont stopp going this way for a stable AH
:)

HP

John Nebel
03-19-2004, 01:00 PM
HP,

Nice 4435 avatar!

It was almost scary when I realized that the controllers were working around the shelf failure to reestablish redundancy.

John

jtgyn
03-19-2004, 03:20 PM
G'Day John,
I am impressed, you have the right tools for the job.
An EVA and VMS... is it a cluster?

Regards Scott

John Nebel
03-19-2004, 03:54 PM
Scott,

Yes, a VMS cluster for the critical applications and Tru64 for all else. The cluster has a member in a remote location for DT and the gov't agency owning that site uses us reciprocally.

The VMS and Tru64 hosts both share the HSGs and EVAs and use an ESL for backup.

John

John Nebel
03-23-2004, 09:12 AM
So it looks like the AH disk residence is better than I'd thought. I'd bought a spare set of contollers - and everything else - but the manual doesn't address the issue of controller replacement.

A fellow from DECUS, Germany provided the helpful bit below and someone else slipped a pdf of the HP internal document on the subject over the transom.

Interesting that a controller can be swapped without interupting I/O.

Where is that knocking on wood smilie?

"John,
it is just a matter of swapping the controller! I have worked with EVA since firmware version 1 and we had controllers changed several times. You can even replace both controllers at the same time, because the metadata is on the disks (no, I haven't seen that myself, yet

Best regards,
Uwe"

mikebake
03-26-2004, 07:11 AM
John, thank you very much for your kind donation of gear and expertise.

MBB

Hofmannhp
03-26-2004, 10:34 AM
Originally posted by John Nebel
....................of the HP internal ...............


Hi All,

I have to tell you, that it's not me......and I got no harddrive in me...:cool:


HP

John Nebel
07-29-2004, 08:09 AM
It looks like the EVA finally works right - there have been no problems for a couple of months now. "Looks like" is used as a qualification so as to not tempt the gods into striking it with lightning. One can't be too careful about those things.

It took about a year from when the device was first purchased to getting it really working in production. Recently it has cheerfully withstood a power failure while it was releveling data and two separate disk failures. Releveling is the black magic which occurs whenever disks are added or removed - all the data in a disk group are moved around so every physical disk has the same amount. This is concurrent with normal operation and there are doubly redundant batteries to protect cache memory against power failures.

I confess I don't have a clue what MBTF means. SCSI disks are meant to have a 1,000,000+ MBTF and I've seen several disk failures in the last month.

I'm always surprised how difficult it is to put a complex piece of equipment into a production operation, usually a lot more work than one imagines.

... and until recently I'd not realized how complex a speaker system is - it's just a box, right?

Don C
07-29-2004, 09:27 AM
It is MTBF. Mean Time Between Failures, and as applied to disk drives, it is a useless figure. I did some reading up on this a while back. The time is calculated by estimating what would happen if you ran the disk for the suggested life time, usually three to five years. They don't exactly advertise how long they actually suggest the disk be used though. Anyway, they then would(estimate) replacing it with a new one. And continue. At some theoretical point in the future, half of the drives used in the test will have failed at some point in their useful life. That's the MTBF. Well, that's my understanding of how they get those absurd numbers, I'm not actually in that business or anything. Since the drive manufacturers cannot test the actual MTBF, as it would take years, they use this method that they made up. They might as well just make up the numbers. A resulting number that far exceeds the useful life of the disk sounds good in the ads. I guess it's legal.

Robh3606
07-29-2004, 09:51 AM
There are many ways to accelerate life tests to determine MTBF. The most common are elevated temperature, vibration levels, humidity. Basically an enviorment that is intentionally hostile and beats the daylights out of the device. The numbers they do publish, at least for Established Reliability Parts, ERMIL Mil parts, are real with years of testing to back them up.

Rob:)

John Nebel
07-29-2004, 01:27 PM
Thanks Don and Rob, and it looks like MBTF may not mean a lot for a disk drive.

I did read a Seagate paper on disk reliability a while back and it made things sound like the disks would run forever.

They are not cheap disks. The 10K disks were this price until the 15K versions came out...

$2,367.39 Hewlett Packard 293568-B22 72GB 15K RPM 2GB FC HD UPG Fibre Channel 15K RPM

I don't pay that much, but it is indicitive of the fact the disks are made to high standards and they are the ones the military buys.

If PC disks are on the same price/reliability curve, save your data!

IBM appears to have quit the disk manufacturing business do to reliability problems verging on scandal.

John

PS.

It may be that the operating enviornment beats the disks up with a high I/O load well beyond that which the MBTF tests subject them to and the disks are more sensitive to I/O load than to temperature or other environmental factors.

Ian Mackenzie
07-29-2004, 01:53 PM
John,

I well recall the cooks tour I did of the AH back office in Boulder.

Out of sight but certainly not out of mind.

Say hi to Ann for me.


Ian

Steve Schell
08-01-2004, 09:34 PM
John, I'd like to thank you from the bottom of my heart for hosting the site and doing such a marvelous job of it. As far as I'm concerned, the Lansing Heritage site now has four co-founders!

jandregg
08-02-2004, 06:12 AM
John, if you meant MTBF it stands for mean time between failures.

John Nebel
08-02-2004, 06:42 AM
Originally posted by jandregg
John, if you meant MTBF it stands for mean time between failures.

Sorry for calling it MBTF, but it makes me unhappy thinking about it. :mad:

What is a puzzle is that the observed MTBF is a factor of 1000 lower than the published number of 1M+ hr for a set of 80 recent disks, not all bought at the same time from the same supplier. Makes me think the MTBF figure is bogus. If one should expect 5/80 = 6% of disks to fail within 1000 hours, this would have a curious implication for PC users who typically don't use redundant configurations and for the disk manufacturer, one of the few left in the US.

John

"The following reliability specifications assume correct host/drive operational interface, including all interface timings, power supply voltages, environmental requirements and drive mounting constraints (see Section 8.4).

Seek Errors Less than 10 in 10**8 seeks

Read Error Rates [1]

Recovered Data Less than 10 errors in 10**12 bits transferred (OEM default settings)

Unrecovered Data Less than 1 sector in 10**15 bits transferred (OEM default settings)

Miscorrected Data Less than 1 sector in 10**21 bits transferred

MTBF 1,200,000 hours

Service Life 5 years

Preventive Maintenance None required

[1] Error rate specified with automatic retries and data correction with ECC enabled and all flaws reallocated."

John Nebel
08-16-2004, 06:02 AM
... just to make sure things were backed-up. After last night's VCS upgrade, the controllers started complaining about disk drive firmware.

Even a disk drive has an operating system. Humans no longer are in control.

"This drive code load procedure will bring all drives in the Hewlett-Packard StorageWorks Enterprise Virtual Array to the current versions.

This version is EXTREMELY DANGEROUS!!!!! It should only be used if the drive firmware needs to be upgraded immediately.

The version loads all drives in parallel. Under rare circumstances, it is possible for many (or ALL) drives to permanently fail. It should only be used if the data has been backed up and time is available to restore the data.

This file updates the table to version 3. It works for releases VCS v3.02x.

It includes the following models and versions:

MODEL - - - - REVISION
========================

----(10K RPM DRIVES)----
BD01853526 - 3BEG

BD03653525 - 3BEG
BD03654499 - 3BE3
BD03655B28 - HP05
BD03656ABA - HP09
BD03658223 - HP00

BD07254498 - 3BE3
BD07255B29 - HP05
BD07256ABB - HP09
BD07257582 - HP06
BD07258224 - HP00

BD14655B2A - HP05
BD14656ABC - HP09
BD14657583 - HP06
BD14658225 - HP00

BD30058226 - HP00

----(15K RPM DRIVES)----
BF03654564 - 3BE6
BF036574C9 - HP05
BF03655B2B - HP05

BF0725754B - HP05
BF07255B2C - HP05

----(NEARLINE DRIVES)---
ND2505823A - HP00"