Storage Elasticity and the Cloud’s Noisy Neighbors

NoisyIf you’re looking at setting up private cloud services — or public ones, for that matter — you should already to have a handle on how to automate the critical functions of creating, managing and migrating virtual servers. That’s because if you haven’t achieved this level of service elasticity, it is going to cost a fortune in admin time whenever your customer needs to rapidly spin up another server, or burst to a more powerful virtual machine to handle workload peaks.

But what about the other two key elements of any IT infrastructure, namely networking and storage? Networking elasticity is addressed to various degrees by SDN, virtual LANs, and of course network optimization technologies, and in any case, it is relatively cheap to over-provision the network element — it should be cheaper than over-provisioning server or storage capacity, anyhow.

Storage is another kettle of fish, however. Sure, automated management platforms can pull blocks from a pool, build them into a logical volume, and allocate that to a new virtual server instance. But can they guarantee performance on multi-tenanted storage? Or are you selling the cloud user a bigger virtual server with a 40Gbit network pipe, yet still expecting them to suck data through a straw?

I was reminded of this recently while talking with Jason Carolan, the CTO at US-based cloud and co-lo hosting provider ViaWest. He says the challenge is compounded by storage’s susceptibility to what he calls the noisy neighbor syndrome — it is all too easy for a cloud application to burst up, consume all the available I/O, and swamp other users allocated to the same array.

In the past ViaWest has solved this problem for some of its enterprise hosting customers simply by buying each one its own storage array. However, while this removes the potential for interruption by noisy neighbors and guarantees a service level, it doesn’t do a lot for storage utilization rates. That is made a whole lot worse by enterprise storage being notoriously lumpy stuff to buy: when the smallest practical purchase is maybe 25TB or 30TB, you really do want your users to share wherever possible. If you can add thin provisioning and data deduplication, shared storage becomes an even better value.

Storage QoS

So in a true cloud environment, where users and workloads can come and go at almost a moment’s notice, and where you can’t afford the luxury of dedicated hardware (which breaks the cloud model anyhow), what are the options for guaranteeing storage quality-of-service? Because if you don’t, people are not going to use your cloud for more than their non-critical, “Hey, it’s good enough” applications — and those are just a small portion of the total.

Over-provisioning is an obvious one, and probably the easiest and quickest to implement. It will work for some, but although disk is cheap, the SAN connectivity, replication, storage management, and all the other infrastructure needed to support a cloud storage deployment most certainly aren’t.

Storage tiering can help on the cost front by moving applications between differently-performing storage types — e.g. SSD, fast SAS and bulk SATA — depending on their needs. However, it can also result in less predictable QoS as it shifts noisy and quiet applications between tiers.

Prioritization can deal with some of the problem. By allocating applications to tiers such as mission-critical, medium, or low you can at least avoid a low-priority application swamping a mission-critical one. The problem for your tenants is they may not know what each tier actually means or who else has which allocations — and it can actually make the noisy-neighbor problem worse if it means there is even less to stop a high priority tenant from turning up the volume.

Rate-limiting is another option: put a hard limit on how much I/O bandwidth each cloud app can use; pay more, and your app gets a higher limit. This can work well for keeping the neighbors a little quieter and protecting the storage from overloading. The caveat is that users may assume they are buying the right to have that much bandwidth consistently, and in a multi-tenanted cloud that is probably not feasible. Beware, too, that rate-limiting could annoy a customer with a bursty application by adding noticeable latency to it.

Scale-out storage

The last option is “shared nothing” scale-out storage. The scale-out element means that as demand rises you can add more nodes to the storage cluster, with each node adding both storage capacity and storage bandwidth, while shared nothing means there is no single choke point. Typically this involves a global namespace or file system, and examples of scale-out storage are everywhere, from giants such as EMC, IBM, and NetApp, and newcomers like Caringo,  Coraid, Gridstore, and Xiotech.

There are also several free open source scale-out options, including OpenStack Swift. And increasingly, scale-out storage will be at least partly based on solid-state Flash memory — indeed, Carolan says ViaWest went for an entirely solid-state version, from storage QoS specialist SolidFire.

When it comes down to it, the perceived lack of predictability and performance are major reasons why people are reluctant to move heavyweight enterprise applications into the cloud. Lighter, more latency-tolerant applications such as email and personal productivity tools, yes, but not the mission critical stuff. Dealing with that perception will be key if you want people to deploy enterprise applications in your cloud.