Reflecting on Netflix, Instagram, Pinterest Downtime


lightning noaa Reflecting on Netflix, Instagram, Pinterest DowntimeIf you were at staying at home last night trying to enjoy a wholesome show or movie on Netflix or perhaps you were out snapping photos with Instagram that you might later share on Pinterest then you would have quickly found out that all three services were down for a couple of hours due to electrical storms in North Virginia which cause a few “Availability Zones” to go offline but not the entire facility.

If you have a background in Systems Administration or Networking then you must be asking yourself how a major company like Netflix or a Instagram (owned by Facebook) could have such an outage. In reality these services should not have had an outage that ever lasted that long because Amazon Web Services (AWS) not only promotes steps to do High Availability Deployments but further because High Availability and Fault-Tolerant deployments is not something that is new.

More importantly why were people on the West Coast of the United States affected by an outage in North Virginia when its known to some that Netflix maintains large deployments in Northern California and Oregon? I for instance did not have service but if Netflix had used Anycast to appropriately route traffic to the nearest data center much like Cloudflare does with their service then I would have had little or no disruption to my movie watching experience.

I think these companies need to invest some time into considering what went wrong and how to make sure it does not occur again or at least that they have something in place to make such and outage so minimal that their stock prices are not affected come Monday morning. A good place to start would be looking at technologies like Quagga (BGP Anycast Routing), Ifenslave,  HAProxy (High Availability Load Balancing), and perhaps Heartbeat and Pacemaker. (All available on Ubuntu 12.04 Server LTS which is on AWS)

In closing I applaud Twilio and the other services out there that had competent people working for them to make sure their infrastructure and services fault-tolerant and prevented downtime.

  • Stephan Adig

    Ben,

    the real issue is that enterprises like netflix or instagram still rely on cloud services.
    I agree, starting up a company with a very good idea like Instagram, Cloud services can be ok, but when your enterprise is working, you have income, or a company like Facebook in your back, you should move your business to a real datacenter with Infrastructure which is backed by a knowledged group of Operations People.

    Netflix is even worse.

    So, for me, it’s again a sign of not using cloud services when you have the money to invest into baremetal and datacenter space and providing your own environment.

    Yes, OPS people are not that easy to get, but actually it’s much better to invest into human resources with knowledge than into a virtual infrastructure where the Company behind it already showed the world, that they have issues with electricity etc.

    So, a good advise is: move Instagram and similar service into a real datacenter. Having people dealing with cooling, electricity and especially the DR scenario, is better.
    When your own company screws it, you can have a serious chat with your people, when another company screws up, like Amazon, there is no way to deal with them.

    Don’t get me wrong, Cloud services can help to establish your business, they can help to handle spikes in your capacity, but relying on them is not a good idea.

    • http://benjaminkerensa.com/ Benjamin Kerensa

      The issue is that other services and sites that also 100% rely on the Cloud were able to have zero downtime. This is more about lack of best practices for high availability.