SteveJ-on-IT: 2012-01

2012/01/31

Why Fibre Channel SAN's will be dead in 5 years.

I won't be buying shares in any Fibre Channel-based tech stocks as I think the technology will be dead within 5 years for two reasons:

Enterprise Storage Arrays will stop being used for high-intensity random I/O, instead being used for "seek and stream", and
PCI Flash or SCM Storage will become the low-latency "Tier 0" Storage of choice because of speed (latency), cost and simplicity.

Update 1-Feb-2012:
Another article by Fusion-io, Getting the most out of flash storage provides extra links:

Flash storage moves closer to CPUs: STEC is moving into PCI Flash cards.
EMC: Flash could spell doom for Fibre Channel but don't talk about PCI Flash challenges.
Flash storage in post-PC devices advances, general background on Flash memory.

Elsewhere I've written Jim Gray's observation, "Disk is the new Tape" should be "Disk is the new CD".
That is, Enterprise Storage is best suited for "Seek and Stream", not random I/O.
Enterprise Storage Arrays will need to provide in this future:

reliable persistent storage and archives,
high capacity and best $ per MB, and
high bandwidth streaming IO.
and if you were being honest, vendor-neutral management and protocols, in-place upgrades, any-time snapshots/backups and flexible zero-downtime expansion and reconfiguration.
The current need for "fork-lift upgrades" and vendor incompatibility are disadvantageous to customers.

How to create a Storage Network that needs to be fast, cheap, simple, robust/reliable, secure and scalable when "super low-latency, zero jitter and non-blocking IO" is taken out of the mix?

Ethernet and nothing else.

Whether Layer 2 protocols, like Coraid's ATA-over-Ethernet, or Layer 3 protocols, e.g. the slower, higher overhead but routable iSCSI, dominate is still an open question.
Both have strengths and weaknesses and can be used together very effectively without conflicts to maximise ROI's and minimise Enterprise Storage costs, both CapEx and OpEx.

Ethernet is around 10-100 times cheaper than Fibre Channel to install and configure and requires only a fraction of the Administration and support, because Enterprises already have well resourced and competent Networking teams. Network Engineers are in much better supply than SAN specialists, so wages are more reasonable and availability much, much higher.
Their competency/capability is also much better able to be assessed by technical managers when both hiring and firing.

As well, Ethernet has a current growth path of 40Gbps and 100Gbps with 10Gbps widely available now for servers.
Fibre Channel may improve sometime in the future to 12Gbps, but that's an uncertain roadmap and with a global market in 10,000's vs millions for ethernet, the cost differential will only grow.

Fibre Channel has become a very poor choice when bandwidth is the primary "figure of merit".

In access latency, PCI-based Flash Memory, such as Fusion-io's, will always beat SAN-based Storage Arrays by a rather large margin.

It's there in the physics and unbeatable...

All the interfaces, line delays, buffering and switching - out and back on a SAN - means if even the Storage Array SSD's had zero latency, it would be many-times slower.

This Q and A with Matt Young of Fusion-io on "Making Flash Fast", says it well:

Q: How does your ioMemory technology differ from Solid State Disks? And how does it compare performance wise?

A: Solid State Disks or SSDs are used to store data with the intention of constant use – similar to that of a hard drive. These SSDs generally use disk-based protocols that introduce unnecessary latency into the system. Fusion’s ioMemory technology differs in that it doesn’t act as a hard drive. It performs as an extension of the memory hierarchy for servers. This means that they provide a tighter integration with host systems and applications, helping you to work more productively.

Fusion-io products offer the industry’s lowest latencies, which maximise performance and scalability, while delivering enterprise reliability.

and

Q: Can you provide some typical I/O performance figures for ioMemory compared to DRAM, and solid state disk?

A: With some generalisation, the order of memory, fastest first is as follows,

DRAM with 100-300 nanosecond access,

ioMemory with 15 microsecond access,

NAND appliances with around 500 microsecond access and

then SSD’s with around 1ms [1000 microsecond] access. [66 times slower...]

There are of course a number of factors that need to be considered in these times such as payload size, load, etc, however, in simple terms with all things equal and a well-designed product, latency is ultimately affected by the distance data must travel to get to where it is useful.
So, the closer your technology resides in relation to the CPU the better the response time.

That’s why that even though two products may use the same NAND chips and be connected on the PCI Express bus, you see markedly different latency characteristics.

finally:

Q: As well as I/O performance, what other attributes of ioMemory are finding favour among customers?

A: One of the things that our customers tell us provides a major benefit in addition to performance is the reliability of ioMemory and the cost savings generated from implementing Fusion-io solutions. Fusion-io products are uniquely reliable enough to be offered by all major OEM manufacturers, including Dell, HP and IBM.
[snip]

Finally, many customers tell us that they save a lot of money on CapEx and OpEx, since ioMemory takes so much less power, cooling and real estate than traditional, scaled-out storage infrastructures.

Declaration of Interest:

I have no shares or other financial interest in Fusion-io or any companies or their competitors mentioned in this piece.
I am not employed now, nor have ever been, by Fusion-io or any of its related/associated entities.
I receive no remuneration for writing these opinions/analyses.

(signed) Steve Jenkin, 31-Jan-2012.

How they got to 1 Billion IO per second.

Is Fusion-io's demonstration of "1 Billion IO operations per second" the same sort of game-changer that the 1987/8 RAID paper by Patterson, Katz and Gibson was?

Within 5 years all "Single Large Expensive Disks" (SLED's) were out of production, will we see Flash disks in Storage Arrays and low-latency SAN's out-of-production by 2017?

A more interesting "real world" demo by Fusion-io in early 2012 was loading MS-SQL in 16 virtual machines running Windows 2008 under VMware. They achieved a 2.8-fold improvement in throughput with a possible (unstated) 5-10 fold access-time improvement.

Updated 16:00, 31-Jan-2012. A little more interpretation of the demo descriptions and detailed PCI bus analysis.

Fusion used a total of 64 cards in 8 servers running against a "custom load generator", or 16 million IO/sec per card.
There are two immediate problems:

How did they get the IO results off the servers? Presumably ethernet and TCP IP. [No, internal load generator, no external I/O.]
The spces on those cards (2.4TB ioDrive2 Duo) only rate them for 0.93M or 0.86 sequential IO/sec (write, read) with 512 byte, a 16-fold shortfall.

The IO's used in the Demo were small, 64 byte, plus they avoided the Linux block driver sub-system, using their Direct Memory Mapping scheme, "Auto Commit Memory" (ACM) versus their Read-only Cache, "ioTurbine".
The card spec sheet also quotes throughput of 3GB/sec (24Gbps) for large read and 2.5GB/sec (20Gbps) for large writes (1MB I/O).

There was no information on the read/write ratio of the load, nor if they were random or sequential.
From a piece on Read Caching also speeding up writes by 3-10 fold, Fusion show they are savvy systems designers, they use a 70:30 (R:W) workload as representative of real-world loads.

That nothing was said about the workload suggests it may have pure read or pure write - whichever was faster with ACM under Linux. If the cards' ACM performance tracks the quoted specs (via a block mode interface), this would be pure sequential write.

The workload must have been 50:50 to allow full utilisation of the single shared PCI bus on each system, otherwise the bus would've saturated.

Also, as this is as much a demonstration of ACM, integrated with the Virtual Memory system to cause "page faults", the transfers to/from the Fusion cards were probably in whole VM pages. The VM page size isn't stated.
In Linux pages default to 4KB or 8KB, but are configurable to at least 256MB. Again, these are savvy techs with highly competent kernel developers involved, so an optimum page file for the Fusion card architecture, potentially 1-4MB, was chosen for the demo. [Later, 64KB is used for PCI bus calculations.]

The Fusion write does not say how they checked the IO's were correctly written. With 153.6TB total storage and 64GB/sec in the test work load, the tests could've run 2,500 seconds (40min) before filling the cards. Perhaps they read-back the contents and compared that to the generated input, though nothing is said. In the best of all worlds, there would've been a real-time read-back check, i.e. a 50:50 R:W workload.

The 64GB/sec total I/O throughput gives 8GB/sec, or 64Gbps, per server.
The HP ProLiant DL370 servers used in the tests, according to the detailed specs, only support 9 PCI-e v 2 cards, most slots are 4 lane ('x4'), at 4Gbps per lane, bi-directionally. ~~With 8 slots taken by the Fusion-io cards, only one (x16?) slot was available for the network card needed to supplement the 4 1Gbps on-board ethernet ports.~~
~~80-100Gbps ethernet capacity would normally be needed to support 64Gbps of IP traffic.~~

Reading the "datasheet" carefully, including inferring from the diagram which has no external connections between systems and no external load-gen host, an internal load-generator was used, one per host. There may have been some co-ordination between hosts of the load generators, such as partitioning work units. From the datasheet commentary:

Custom load generator that exercises memory-mapped I/O at a rate of approximately 125 million operations per second on each server. Each operation is a 64-byte packet.

A little more information is available in a blog entry.

PCI Express v2 is ~4Gbps per lane, or 64Gbps for x16, each direction. ~~Potentially just enough if an on-board ToE was used either with a 8-way 10Gbps card, a dual-port 40Gbps card or a single-port 100Gbps card.~~

~~From this, we're can't be sure if the load generator was internal to the Linux servers or external.~~

Even though not used, HP support dual 10Gbps ethernet cards, PCI-e v2 x8, in the DL370 G6, but with a maximum of two per system. This suggests a normal operational limit of the PCI backplane of 20-40Gbps per direction. The aggregate 64Gbps is achievable if split into 32Gbps in each direction.

~~The Fusion cards "half-length" and are x8 (so will work with x1, x2, and x4 slots as well).~~
From the DL370 specs, the system has 8 available full-length slots:

2 * x16,
1 * x8 and
5 * x4.

~~The other half-length slot is probably x8.~~

The per-server load is 64Gbps, spread amongst 8 cards, or 8Gbps/card, which is only 2 lanes (x2).
The per-card bandwidth would be possible if the single shared PCI bus wasn't saturated.
Per direction, the x4 PCI-e lanes would only support 16Gbps and the x8 32Gbps.
The simple average (5 * 16 + 3 * 32)/8, or 22Gbps, is insufficient for the load.

The two x16 slots and x8 slot could support the maximum transfer rate of the Fusion cards, 32Gbps per direction, or an aggregate of 64Gbps. The cards' spec sheet allows 20-24Gbps large (1MB) transfers per card, which with some load-generator tuning, could've resulted in 60Gbps aggregate from just 3 cards.

If the I/O total load, 64Gbps, is split evenly between cards, each card must process an aggregate 8Gbps, with equal read/write loads, or 4Gbps per direction.

If 64KB pages are read/written, then each card will need to process 64K (65,536) pages per second per direction.

The x4 slots, with 16Gbps available in each direction (aggregate of 32Gbps), will transfer 64KB in 15.28 usec.
The x8 slots will transfer a 64KB half that time, 7.63 usec.

The average 64KB transfer time for the mix of cards (5 * x4, 3* x8) in the system is:

( (5 * 15.28) + (3 * 7.63)) / 8, or 12.40 usec,

or 80,659 64KB pages per second per direction, leaving 25% headroom for other traffic and bus controller overheads.

The required 32Gbps per direction seems feasible.

~~This either says the DL370 has multiple PCI-e buses, not mentioned in the spec sheet, or something else happened.~~