Data Center Management Tips:

Meeting performance standards and SLAs in the cloud

By Tom Nolle, Contributor

searchDataCenter.in

Service-level agreements are common in network services and they measure the parameter set popularly called "QoS", but for cloud computing or platform services it's difficult to find helpful precedents for negotiating an SLA.

At the high level, the issues are the same; you must define criteria to be met and remedies if they are not. The devil is in the details, and to get there it's essential that you begin with the parameters of the application experience at the user level. The business case for cloud computing will be based on some expected range of availability and performance, and that's what the SLA must address.

The 6 critical elements of cloud behavior to be considered in performance SLAs:
How quickly does the cloud allocate resources when an application is first requested? This could be critical for applications that load and run a brief period.

How quickly does the cloud allocate new resources to an application if usage of that application increases, or if a currently allocated resource fails or becomes degraded?

What is the difference in application response time across the range of resources that could be allocated to an application? Resources spread out geographically may impact network access.

What is the delay associated with the cloud's virtualization process; the mapping of "logical" resources like URLs to the actual servers, and how reliable and available is this critical component?

Are there SLAs for the application-on-the-cloud as a whole, or are SLAs available for each instance of the machine image or application being run? Users with only "whole-cloud" SLAs may have problems with availability or performance on their specific application image without violating cloud SLAs.

What management tools does the provider offer to monitor each of the elements of the cloud SLA to insure compliance? Do these work both at the whole-cloud and per-machine-image basis? Are the accessible even if the "cloud" they manage is down?

The first point to address in a cloud SLA is that everything associated with an application experience isn't part of cloud computing. Cloud performance as measured at the point of application use is the sum of network performance, application performance, and cloud infrastructure performance. The cloud provider can be accountable for the last of these and not the first two, so it's important to understand what both the other factors contribute to overall performance when writing an SLA.

Accessing cloud services over the Internet or other best-effort service will make it very difficult to create a meaningful cloud SLA because the network contributes a completely variable delay, loss, and failure rate. If you want to guarantee transaction/application performance as experienced by the user, you'll need to somehow limit this variable. That may be possible if you can negotiate an SLA with a specific ISP with whom your cloud provider has a direct connection. If you expect to access cloud applications randomly from multiple locations and ISPs, a tight and meaningful application performance metric will be very hard to obtain.

Getting the application's performance variables out of the equation will normally mean running the application using local server and network resources to measure performance under ideal circumstances. The measurement should also include noting how variations in memory, storage, etc. impact performance because those same factors may vary in cloud computing services. It's important to duplicate as much of the cloud's IT resources as possible to get a good measurement.

When both application and network performance factors have been handled, the resulting information can be used to set cloud computing performance limits. For example, if an application running locally generates a 1-second transaction response time and the network connection adds a half-second delay in both directions (not unreasonable for Internet or VPNs), there is a total delay of 1.5 seconds accumulated. If your operating departments want a 2 second response time guarantee, you can afford to add only another half-second in cloud computing delay.

The next critical step is to convert application performance to a set of parameters that can be measured on your cloud provider's infrastructure. This can only be meaningful if you have a specific configuration for your cloud service, so work with your cloud provider (or provider candidates) to devise the best cloud configuration to meet your needs.

This would include whether you used reserved or ad hoc cloud resources, the number of images of your application that would be run at a time, the geography in which they would run, the database used, the system type and memory, etc. This configuration should be tested to insure that it meets the basic requirements for performance established by the applications' users. The configuration exercise will also help define the features of the cloud that have a direct bearing on performance and reliability—such as failover from bad application instances or load balancing among instances.

From this configuration, you must now establish a set of resource usage, availability, and performance metrics based on the management tools/capabilities of the cloud provider. The presumption is that if these metrics are all met, the configuration is performing as you designed it to, and thus the application user objectives are being met.

Remember that while transaction or application performance is the goal of your cloud SLA, it will likely be your own responsibility to create a performance standard for user experience and then apply that standard to creating cloud computing and network performance objectives.

In order to apply your cloud SLA effectively you'll have to have problem isolation tools to separate issues with the network or you application from those of the cloud. These should be integrated with the cloud management tools available from your provider to build a monitoring portfolio you can use to proactively monitor performance and respond to user complaints.

Tom Nolle, is president of CIMI Corporation, a strategic consulting firm specializing in telecommunications and data communications since 1982. He is a member of the IEEE, ACM, Telemanagement Forum, and the IPsphere Forum, and is the publisher of Netwatcher, a journal in advanced telecommunications strategy issues.

22 May 2009

Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.