Remember the old days? When one could look under the hood of a car when the Check Engine light went on? That feels like eons ago, given that cars today basically have a “start button” to fire up the car and get the computer going. The Check Engine light today means a mandatory visit to the service department of your dealer to run detailed computer diagnostics to figure out if the problem is digital or mechanical. As a result, even simple problems come with timely and often costly resolution.
I find the check engine light analogy is a very useful one as I talk to IT leaders about the concept of troubleshooting the cloud. The parallels are there: two steps forward sometimes come with one step back. There are many strong business arguments for the cloud, including economies of scale, but positive gains are soon lost when the cloud light pops on and there isn’t any good answer as to why. This problem will be even more acute as we start to migrate real time communications and collaboration tools to the cloud. Even worse, what if the light doesn’t even go off and there is a performance issue with a critical application, video session, or a C-level exec’s virtual desktop? Will we really need to bring experts in to diagnose the problem to find even the simplest of problems? As we continue to integrate things up and down the IT stack and fully automate many of the manual tasks we do today, troubleshooting cloud problems will become more and more difficult. And unfortunately, there’s no actual light — our “check cloud” light will be users complaining about performance issues, leaving the IT department possibly fighting more fires than before. Sure the “old way” of running a data center was highly expensive, operationally intensive, wasted data center resources but at least IT had control over the environment.
I hate to be the downer in the cloud party, as everyone is now gaga over the cloud, but unlike most analysts I started my career working in internal IT for a financial services firm — I have seen with my own eyes how IT can shoot itself in the foot if it doesn’t look before it leaps (as I did several times).
Sure, there are basic questions about security, data integrity, provisioning and so on, but they pale in comparison to knowing about the operational management and processes surrounding issues of performance and availability. And I can speak to my own canvassing of cloud providers that the answers are underwhelming when we get into a conversation about what they do when their cloud underperforms — or when it just isn’t there at all. That is because the performance management frameworks they have in place are inadequate for the cloud. Most of the current management tools were built for a different era of computing and if we’re going to move IT to the cloud, then our industry needs some better cloud management tools.
I find that most cloud providers have some kind of framework management solution that has modules or connectors that get performance information about the different infrastructure silos: e.g. server, network, storage, and desktop (in the case of desktop as a service). These modules worked fine when infrastructure was dedicated and fairly static, but are completely inadequate in an environment where all resources are virtual and shared, and where relationships to deliver services come together and disband at the drop of a hat. Add to that the concept of workload mobility and automation, where the virtual resources are automatically moved from one place to the other, and the problem is compounded. Lastly, there is generally no model of management transparency between an enterprise and the cloud provider and you can see how the provider/customer relationships can easily become rocky.
What’s needed today are better visibility tools that are focused on cross-IT functions. A must today is a predicative profiling engine that learns the behavior of the infrastructure down to the individual VMs. Additionally, better contextual information is needed through the coupling of alerts with historical data. For example, is the cloud running slow because of latency or because of a code change?
Lastly, cloud providers need to provide better transparency into the cloud. There’s no reason cloud providers can’t give customers a mini dashboard allowing IT to see what’s going on instead of having to go through the providers customer service group.
It seems with all IT trends we tend to make management the last thing we think about it. Being proactive with cloud management will create the acceleration of cloud services, which is what we’ve all been waiting for.