2014/06/15

RAID++: Erasures aren't Errors

I'm following the Communications / Transmission theory nomenclature of "Erasures" to mean drive failure ('no signal') and "Errors" to mean incorrectly sent & received symbols.

With 5,000 x 2.5" 2TB drives in a single rack, what hardware problems will we experience?
Worldwide, it's assumed a single vendor might sell 100,000 of these arrays.
  • How many read errors should we expect? "Errors"
  • How many drives will fail in a year? "Erasures"
    • I don't have numbers for other parts like fans, PSU's, boards, connectors, RAM.
    • How many RAID rebuilds?
    • How long will they take?
    • How many dual-failures might we expect?
We can use RAID-1 to set the minimum baseline {cost, performance, data-protection/data-loss} to compare other schemes against. Calculations, here.

Specs

I'm going to assume 100% power-on time (24/7 operations) and a 64% Duty Cycle, excluding RAID activity (rebuilds and data scrubbing).
With average 64KB transfers, this gives throughput of 20TB/drive/year of user data. Only hi-spec drives are designed for this duty cycle and 24/7 usage.

AFR, Annualised Failure Rates, the inverse of MTBF, is spec'd at 0.8%-1.2%, while real-world reports suggest 3% AFR's (250,000hr MTBF), used here.

Drive design life is thought to be 5 years, accounting for both wear&tear and effects like lubricant evaporation. Vendors cannot control the use or environment of their products.

Drive Bit Error Rate (BER or UBER for Unrecoverable BER) is quoted at 10-14 for consumer-grade drives and 10-16 for hi-spec drives. Exactly what this means with 4KB (32kbit) blocks is unclear: drive vendors don't publish on this topic. This piece will use the raw figure and assume only single-bit errors within a block, though the literature suggests media defects cause long error bursts, just as scratches and pinholes do on Optical disks.

For around 20 years, since drives moved to LBA (Logical Block Addressing), they've silently handled "bad blocks" or 'hard' media errors. Data is probably collected by S.M.A.R.T. monitoring, but I've no good studies to guide me. One of the large 2009 studies (Google?) said there was little correlation between S.M.A.R.T. errors and sudden drive failure. Firmware on consumer drives does automatic retries (6-8 times) in the face of errors, resulting in occasional long response times, as each retry loses a full drive revolution (8.4-11.1msec). Storage Arrays disable this automatic drive retry/recovery because they have higher-level error recovery. However, drives still need to handle 'soft' errors due to tracking errors induced by outside factors, such as the noise and vibration found in high-density Arrays. This piece ignores 'soft' errors and bad-block handling.

There are two main drive speeds available: 5,400RPM and 7,200RPM, with 10,000RPM available at a power & cooling, price and maybe reliability premium. The rough throughput figures are below, with average seek time guessed at 3msec. With ~1Gbps sustained transfer rates, a 4KB block (32K bits) would take 0.032msec to read/write, below our precision, while a 64KB transfer (0.5Mbit) takes 0.5msec, affecting IO/sec.

RPMAvg latency0ms seek
IO/sec
3ms seek
IO/sec
3ms seek
+0.5ms trf
IO/sec

5,4005.6msec180116.88110.43
7,2004.2msec240139.54130.44
10,0003.0msec333166.67153.85
Drive throughput vs RPM (Approx)



No comments: