How Netflix’s migration to Amazon Web Service is protected by Chaos Monkey

“We are happy to report that in early January of 2016, after seven years of diligent effort, we have finally completed our Cloud migration and shut down the last remaining data center bits used by our streaming service”

Netflix blog post announcing the completion of their migration to AWS

Netflix, in anticipation of the increased Global traffic following their plan to go global as laid out in Netflix expansion to 130 Countries as Global Streaming in 2016, has beefed up their Streaming.

aws netflix

This via a massive migration to AWS (Amazon Web Services), which they completed on Friday February 12th, 2016. AWS is owned by Amazon’s CEO Jeff Bezos, whose company runs a rival streaming service called Amazon Prime Video. So in essence, they’re literally sleeping in the enemy’s back yard.

This move might seem counter-intuitive, being as Amazon Prime Streaming Service is a major competitor with Netflix. However, this is just business and actually stems from a decision Amazon made seven (7) years ago to choose AWS as their Cloud Hosting Service, as it’s reliable, robust and scalable.

So although they’re totally dependent on AWS, their contractual agreement benefits Amazon, as with Netflix’s expansion they’ll make money both the global adoption as well as from the monthly rental for Cloud Storage from Netflix.

But how exactly will Netflix ensure that they can survive a total failure of AWS’s cloud? After all, aren’t they essentially putting all their eggs in one basket?

Netflix’s Chaos Monkey – Simulating disaster to prevent the unthinkable

First things first, all of Amazon’s eggs are in one basket at all as Netflix had considered this problem over the seven (7) years it took to make the move to Amazon. They’ve learned quite a bit from a streaming failure on Christmas Eve in 2012. Good to note back then, Amazon was operating in just one Amazon region.

Their current streaming service divide the world into twelve (12) regions worldwide with Data Center providing streaming service in sib-domains called Availability Zones each of those regions that look back to the AWS.

So if an entire region goes done, they can redirect traffic through these Data Centers to other available Data Center, to quote Netflix’s VP of Cloud and platform engineering Yury Izrailevsky: “We can instantaneously redirect the traffic to the other available ones. It’s not that uncommon for us to fail over across regions for various reasons”.

These Data Centers for their Availability Zones are located in the following regions:

  • Northern Virginia
  • Oregon
  • Dublin, Ireland

Using a series of simulation tools called Chaos Monkey, they can randomly simulate the failure of virtual machines that represent whole regions of the world. They also have scaled up versions of their Chaos Monkey that simulate different degrees of Network Failures:

  • Chaos Gorilla – disables an entire Amazon availability zone
  • Chaos Kong – simulates an outage affecting an entire Amazon region and shifts workloads to other regions

But what if a total failure occurs?

Armageddon Monkey – Google Cloud backup in case of a catastrophic failure

They also use backups, with their data being replicated on S3 [Amazon’s Simple Storage Service] system, to quote Yury Izrailevsky: “Customer data or production data of any sort, we put it in distributed databases such as Cassandra, where each data element is replicated multiple times in production, and then we generate primary backups of all the data into S3 [Amazon’s Simple Storage Service]. All the logical errors, operator errors, or software bugs, many kinds of corruptions—we would be able to deal with them just from those S3 backups”.

They also keep backups on Google Cloud Storage in case of an Armageddon Monkey, which is a catastrophic failure of all of their twelve (12) regions. This may be due to:

  • Natural disaster
  • Self-inflicted failure that somehow takes all of Netflix’s systems down
  • Catastrophic Security Breach

It would take them hours of even a few days to recover from a total failure of AWS, but Netflix claims it can be done, to quote a Netflix spokesperson: “So that’s not the scenario we’re planning for. Rather it’s a catastrophic bug or data corruption that would cause us to wipe the slate clean and start fresh from the latest good back-up. We hope we will never need to rely on Armageddon Monkey in real life, but going through the drill helps us ensure we back up all of our production data, manage dependencies properly, and have a clean, modular architecture; all this puts us in a better position to deal with smaller outages as well”.

Still, this scenario is highly unlikely, being as the regions are isolated from each other and have no virtual machines on any of their servers that overlap, ensuring Netflix never goes offline. Sharing their Open Source Chaos Monkey tools with other secretive Cloud networks makes it easier to achieve their Global Streaming agenda!

Hopefully this scenario never really occurs and Netflix doesn’t reveal where it would operate from in such a catastrophic failure of their systems.

The following two tabs change content below.
Lindsworth is a Radio Frequency and Generator Maintenance Technician who has a knack for writing about his work, which is in the Telecoms Engineering Field. An inspired writer on themes as diverse as Autonomous Ants simulations, Power from Lightning and the current Tablet Wars.