High availability refers to a system or component that is operational without interruption for long periods of time.
High availability is measured as a percentage, with a 100% percent system indicating a service that experiences zero downtime. This would be a system that never fails. It’s pretty rare with complex systems. Most services fall somewhere between 99% and 100% uptime. Most cloud vendors offer some type of Service Level Agreement around availability. Amazon, Google, and Microsoft’s set their cloud SLAs at 99.9%. The industry generally recognizes this as very reliable uptime. A step above, 99.99%, or “four nines,” as is considered excellent uptime.
But four nines uptime is still 52 minutes of downtime per year. Consider how many people rely on web tools to run their lives and businesses. A lot can go wrong in 52 minutes.
So what is it that makes four nines so hard? What are the best practices for high availability engineering? And why is 100% uptime so difficult?
These are just a few of the questions we’ll aim to answer here in our guide to high availability.
With great complexity comes great responsibility
Imagine a set of concrete steps in your neighborhood. With some simple maintenance and upkeep, and barring some dramatic change to their environment, those stairs will basically last forever. The historic “uptime” of that set of stairs is excellent. We still use stairs today that were constructed thousands of years ago and have rarely, if at all, been “unavailable.”
Now imagine an escalator. You’ve got a product solving the same problem as the stairs (getting people up and down) with some great added features and benefits (the product burns energy, not the user). And you’ve got all the variables the stairs face (regular use, environmental conditions) along with a whole stack of new variables; moving parts, power sources, belts, gears, lubricants.
It’s easy to see what product will have stronger uptime. The escalator will need to come offline for regular repair and maintenance. Sometimes it will just plain break down. Eventually the whole system will need to be replaced. The best escalator engineers in the world can’t build an escalator with the same uptime as stairs. It’s not even worth trying. The best way to keep an escalator running as often as possible is to intentionally take it offline for routine maintenance.
You can think of web services the same way. Adding complexity can lead to some great features and benefits, but it decreases the chance for extreme uptime. An average engineer can build a static web page with a few pieces of text that enjoys much higher uptime than Facebook. Even though Facebook has a bunch of engineers and other resources, they’re dealing with way more complexity.
Building for high availability comes down to a series of trade-offs and sacrifices. Complexity vs. simplicity is one of the first decisions someone needs to make when considering building for high availability.