by Sundarram P. V.

Friday, June 06, 2008

Five9s Availability - Is it too much to ask for?

Imagine logging into GMail, and suddenly getting a Oil change page. No way thats gonna happen now right?? Nearly 4 years back I used to see this page quite often, but over a period of time GMail has certainly matured as a product. They are still adding features seamlessly and making releases without any downtime of the service. That makes me look in awe at many services which achieve this and I sometimes wonder what it takes to achieve 99.999(Five9s) i.e. approximately 6 minutes of downtime in a year.


Message from ABC: OOPS, No donuts for you.
Developer of ABC: The possibility of this is infinitesimally small and guess what, Shit happens!!!
User: Damn!!! It always happens when I am in middle of something important
But hey why go through such a ordeal.. Let me try XYZ...

Real donuts for guessing the service ;)

Why?

For starters, if there are lot of outages users will loose trust in the service and might start looking at it as a liability. Twitter is a classic example of that. I would hate to see GMail go down while i am using it. In the Web 2.0 world there are lot of 'me too' services to take your place, the only differentiator can in many instance be speed and availability of your service.
This is normally in a later stage of a start-up as feature takes precedence over availability. But a use of little common sense initially will prolong the availability issues.

Why measure?

It is always better to measure than not to. It can be used to compare the availability of service month over month and progress made in terms of availability. In extreme cases, one can boast in some scalability and availability conference. ;)

Why Five9s and not any other number?

Six minutes of downtime is not very huge and not very small, though this depends on the type of outage and service. Choose a number that suits your goals. It is also a measure of availability of service during its business hours(24x7 for most web 2.0 startups). It is a number which is quite difficult to achieve but not impossible. Every year very few services achieve Five9s. Personally I look at it as a benchmark as only a selected few make it to this league.

The downtimes can be categorized into two, viz. predictable and unpredictable ones.

1. Unknown/Unpredictable ones
a. DB/web server needing a restart(often with windows based environs.)
b. Hardware failures

The service should have at least have two point of failures for the whole service to fail completely. It pays to have redundancy but its expensive too. RAID depending on the configuration will only protect the data, but the time required to recover from such a disaster and to go up online again will be huge. Its better to have another redundant h/w and s/w which is hot swappable. This will require a lot of design and architecture considerations. Its always better to assume that some failure like this will always happen and being prepared to tackle it. It is not financially viable for many early stage startups to have redundancy for everything, but being prepared for such a eventuality will not hurt the company.

c. A spike in traffic, taking everything down with it.
Its difficult to be prepared for spike, it is given that there will at least be one spike in a year you cant handle. Having a architecture and design for rainy days will pay off in the longer run. Services like Amazon EC2 will certainly help in handling spike, but it all depends on the preparedness for a spike.

d. Some goofy in datacenter, restarts the server by accident (I am not making it up. it has happened twice.)
To avoid such a scenario don't have your server in India, unless you are having redundant counterpart elsewhere. Not kidding!!! India is light years behind for providing any serious hosting services. If at all you want to have a server in India for various reasons, avoid resellers. If the number of servers are more go for a co-location.

continued...

0 comments:

___