Reflecting on Netflix, Instagram, Pinterest Downtime
If you were at staying at home last night trying to enjoy a wholesome show or movie on Netflix or perhaps you were out snapping photos with Instagram that you might later share on Pinterest then you would have quickly found out that all three services were down for a couple of hours due to electrical storms in North Virginia which cause a few “Availability Zones” to go offline but not the entire facility.
If you have a background in Systems Administration or Networking then you must be asking yourself how a major company like Netflix or a Instagram (owned by Facebook) could have such an outage. In reality these services should not have had an outage that ever lasted that long because Amazon Web Services (AWS) not only promotes steps to do High Availability Deployments but further because High Availability and Fault-Tolerant deployments is not something that is new.
More importantly why were people on the West Coast of the United States affected by an outage in North Virginia when its known to some that Netflix maintains large deployments in Northern California and Oregon? I for instance did not have service but if Netflix had used Anycast to appropriately route traffic to the nearest data center much like Cloudflare does with their service then I would have had little or no disruption to my movie watching experience.
I think these companies need to invest some time into considering what went wrong and how to make sure it does not occur again or at least that they have something in place to make such and outage so minimal that their stock prices are not affected come Monday morning. A good place to start would be looking at technologies like Quagga (BGP Anycast Routing), Ifenslave, HAProxy (High Availability Load Balancing), and perhaps Heartbeat and Pacemaker. (All available on Ubuntu 12.04 Server LTS which is on AWS)
In closing I applaud Twilio and the other services out there that had competent people working for them to make sure their infrastructure and services fault-tolerant and prevented downtime.