-->

Friday, July 15, 2016

Cloud outage audit update: The challenges with uptime

Public cloud vendors improved uptime percentages in 2015, while customers have learned to better handle cloud outages and what works best in their environments. A cloud outage gets plenty of headlines, but the reality of how it impacts customers is much more nuanced.

Users increasingly are finding ways to get around outages that occur with their providers or at least coming to terms with the reality that no public cloud will be close to 100% available without some heavy lifting. Meanwhile, potential customers remain keen to compare vendors' uptime consistency despite the limitations of publicly available data.


CloudHarmony, owned by analyst firm Gartner, is one of the more prominent sites to track uptime among public cloud vendors. It uses a simple, nonperformance-related methodology to track vendor availability across regions by provisioning services with providers in four categories: infrastructure as a service, compute and storage, content delivery network and domain name system.

The fairest comparison among the platforms CloudHarmony tracks is uptime percentage, because over a full year, outage totals include scheduled maintenance windows, said Jason Read, CloudHarmony founder. Some vendors also have far more regions than others, meaning a provider such as Amazon could have more outage time collectively, but customers would see better overall availability because the downtime is spread across data centers.

Generally, top vendors saw better uptimes in 2015 compared with 2014. Microsoft garnered the biggest improvement due in large part to not repeating the massive late-year outage that skewed results in 2014. Azure Virtual Machines uptime improved from 99.9339% to 99.9937%, according to CloudHarmony.

"[Cloud vendors] are continually learning from their mistakes, and over time, their services improve," Read said.

Among the biggest public cloud vendors, Amazon had the best uptime percentage for compute at 99.9985% with Elastic Compute Cloud, while CenturyLink had the worst at 99.9647%, according to CloudHarmony. Google Cloud Storage had the best storage uptime at 99.9998%.

Still, Read is quick to note that the CloudHarmony data shouldn't be viewed as a holistic view of cloud uptime, but rather a "measurement of network availability from the outside." Not all services are tracked, so some outages, such as the cascading AWS database as a service event in September, either don't register or show less of an impact than what users observed.

"[The] cloud is a very complex, intricate set of services that should be working together in various ways," Read explained. "It's just impossible to represent availability for all the infinite possible ways you might be using cloud."

Some vendors underreport outages, while others, such as CenturyLink, offer a tremendous amount of data and push transparency even with the smaller outages that customers might not notice, Read said.

Sometimes, outages aren't the fault of the cloud provider. The most severe, frequent and underreported outages are network-related, he added.

Uptime for Google Compute Engine actually fell slightly in 2015, from 99.9859% to 99.9796%, according to CloudHarmony. While Google didn't provide specific figures, company officials claimed they achieved considerable reliability improvements in 2015, according to internal metrics. Tracking cloud uptime is but one measurement of the complexity of network, compute and storage all working properly, said Ben Treynor Sloss, vice president of engineering at Google. Customers could be using 20 instances at any given time, and one of those could have a maintenance issue without impacting the overall performance of the application. If that lagging instance is the one service being tracked, unavailability would be over-reported at the server level and availability underreported at the system level.

Other issues, such as autoscaling and load balancing, come into play as service demand rises and falls throughout the day, and can considerably impact performance of larger applications, he added.

"It's still better than having no data, but it isn't sufficient to actually tell you what the customer is going to be experiencing," Sloss said.

Becoming savvier about risks, tradeoffs

Most cloud vendors' service-level agreements (SLAs) add the same 99.95% uptime guarantee. Private data centers aren't immune to outages, nevertheless the lack of control and insufficient availability from public clouds deter some IT shops. That's also partly why most public cloud workloads aren't used by production or mission-critical applications.

CityMD, an urgent care organization with 50 facilities in New York and New Jersey, uses Google for email collaboration and hosts several of its external websites with AWS, but none of that infrastructure is incorporated in the public cloud.

"We are almost a 24/7 shop, therefore, the uptime is a large priority for all of us," said Robert Florescu, vp of IT at CityMD. "It was more cost-efficient to place our infrastructure inside the data center."

IT pros are near different maturity stages together with the cloud, so there's still some concern about unavailability, said Kyle Hilgendorf, an analyst with Gartner. While most usually are not putting their most essential assets inside the public cloud, they can be confident in vendors' track records and are also willing to be proactive inside their design.

"They so desire cutting-edge features and services that [public clouds] offer which they can't build into their own data centers they're ready to take a number of the risk," Hilgendorf said.

Very few customers seek higher-level uptime guarantees than public cloud vendors' typical SLAs, for the reason that benefit could well be marginal compared with all the additional cost and engineering required, Sloss said.

"When you really have to make tradeoffs -- and also to be fair to Amazon, we're all inside same boat -- with anybody, if you would like go beyond running in one zone, you are going to have to perform additional work," Sloss said.

Still, cloud vendors head to great lengths in order to avoid disruption. At probably the most recent re:Invent, the annual Amazon Web Services user conference, Jerry Hunter, v . p . of infrastructure, discussed the extent that Amazon adjusted to control uptime, including steps to further improve building design and usage. The company also built a unique servers, network gear and data center equipment to lessen potential breakdowns, and improved automation to scale back human errors and increase reliability.

AWS has purpose-built networks in each data center, with two, physically separated fiber paths extending from each facility. Amazon has built its very own Internet backbone to further improve Web performance and monitoring capabilities.

"We've turned our supply chain from [a] potential liability right into a strategic advantage" using input in the retail side from the business in Amazon fulfillment centers, and learned the best way to modify processes and workflows, Hunter said.

Several in years past, there was considerable discrepancies between vendors, but right now, each of the big providers have industrialized and automated their strategies to close the gaps, said Melanie Posey, an analyst with IDC in Framingham, Mass. Customers also provide put in more redundancies to redirect users and strategies to another region if there's an outage into their primary data center.

How industry is becoming more savvy

IT pros who use Amazon not simply are increasingly savvy about ways to protect themselves from downtime, they're dismissive of people who decry the cloud dependant on availability.

For example, after an AWS outage last September, IT pros said essentially the most disruption got their start in people who watched it as evidence that cloud computing is inherently more risky than on-premises deployments. One user compared it to some single accident using a freeway leading drivers in conclusion that highways generally speaking are less safe than surface roads.

"Despite the inconvenience and the many press attention, you would be hard-pressed to seek out corporate customers or consumer owners who are so fed up together with the AWS outage how they would abandon their cloud services," wrote AWS expert and TechTarget contributor Jeff Kaplan soon after the September issues.

Proactive users should make the most of multiple availability zones and multiple regions, if at all possible, to guard against natural disaster or even a region heading down, Hilgendorf said. To take it a measure further, the most effective practice to manage cascading software bugs or expiring certificates is by using multiple vendors.

"All with this comes down to the amount cost are you happy to absorb with the insurance of maybe blocking the low chances of an unavailability event," Hilgendorf said.

A more reactive stance involves management and monitoring with items like CloudWatch, though this will help to more with unavailability of your specific application, as opposed to the provider, Hilgendorf explained.

"Most from the unavailability events we hear from clients are, 'Oh, we screwed our application up, we misconfigured something,' or, 'We stood a developer go ahead and change a setting, and everything went awry,'" he was quoted saying.

Additionally, you will find resources for example CloudTrail for logging API requests along with changes to try and do root-cause analysis.

Any cloud service may go down at any moment, but because of the number of tools made available to users now, in the majority of situations, the fault probably lies while using customer, Hilgendorf said.