On Clouds and SPOF’s (or the Great AWS Outage of April 2011)

Just a couple of days after posting about cloud native applications Amazon raised the bar by having some issues in one of their data center regions. These issues primarily affected EBS and RDS from what I’ve read. So, pretty much everything one way or another since using AWS EC2 without EBS in any form for most applications that exist today is a little wacky for most folks. This is because your EC2 AMI won’t persist through a reboot in the absense of the use of EBS. Most folks have not reached the operational nirvana yet of full automated configuration management and application fault tolerance that makes this acceptable for them.

What level of SPOF (Single Point of Failure) are you are willing to tolerate. So, I wanted to “scale up” the idea of the SPOF then bring it back down again. Here we go.

If the earth stops working, so will your web application (admittedly there might be some satellite networks that don’t have this problem... but who cares at that point?)

So, let’s keep going. Each of these is a potential single point of failure.

Earth > Continent > Country > State/Region > City > Neighborhood > Building > Floor > Room > Rack > Server > Server Component

And, at each tier, there are numerous dependencies and contexts to keep your service running at any given time. There are the obviously ones like the above example where if the earth explodes the neighborhood is pretty much shot to hell also. But, that’s obvious. It’s gets less obvious when you dig deeper into the data center and see that there are 5 servers so that’s okay right? Maybe. Maybe not. If it is something like.

Dynamic Name Service > Load Balancer > Web Server > Application Server > Database Server

Then those 5 servers/services might be in that one rack per data center per room per building per neighborhood per city per state per country per continent per planet is looking pretty vulnerable. In the grand scheme of things the loss of one power supply in one machine could impact the entire planet’s capacity to retrieve whatever is on that DB that is so globally important; like a picture of your kid making a funny face on his 2nd birthday.

Do you think it is Amazon AWS’s fault if you put that database on one server in one rack in one place with no reasonable SLA and it goes away forever? Not so much. You are accountable and responsible. You made that choice.

Now, how can we change this for the better? We can develop applications that are able to tolerate the loss of a single point of failure at a sufficient granuality (Earth is a bit extreme today) such that our applications keep running when bad things like the AWS outage occur. I call these Cloud Native Applications. They have certain traits that should look a little familiar to cloud folks.

You cannot create a cloud native application doing things the same way you always have before. It simply will not work. The necessary software architecture and systems architecture has changed if you want your application to run on the cloud w/ no SPOFs.

Just needed to get that off my chest. Some related links for good reading:

http://blog.basho.com/2011/04/21/Amazons-outage-proves-riaks-vision/

http://www.thestoragearchitect.com/2011/04/22/so-your-aws-based-application-is-down-dont-blame-amazon/

http://highscalability.com/blog/2011/4/22/stuff-the-internet-says-on-scalability-for-april-22-2011.html

http://www.infoq.com/news/2011/04/amazon-ec2-outage

And if your REALLY keen to write some CNA’s (contact me) and read...

http://www.infoq.com/presentations/Actor-Thinking
http://www.infoq.com/presentations/1000-Year-old-Design-Patterns

Update on 2011-04-25 02:39 by Kent Langley

I follow George Reese on twitter and just ran across a tweet about this post. I thought it was worth noting:

The AWS Outage: The Clouds Shining Moment
http://broadcast.oreilly.com/2011/04/the-aws-outage-the-clouds-shining-moment.html

While I do not necessarily agree with everything posted there I do like George's way of thinking. I would say that he said it all in the last sentence.

"These kinds of failures don't expose the weaknesses of the cloud—they expose why the cloud is so important."

Update on 2011-04-26 21:03 by Kent Langley

And, another good write up imho.

http://stu.mp/2011/04/the-cloud-is-not-a-silver-bullet.html

Key point?

"A lot more effort needs to go into documenting proper cloud architecture. The cloud has changed the game. There are new tools and, as a result, new ways of building systems in the cloud. Case studies, diagrams, approved tools, etc. should all be highlighted, documented, and preached about accordingly."

He says that, "The Cloud is not a silver bullet..." I certainly agree. But, it is a gold mine of opportunity if you choose to avail yourself of it's strengths and deploy cloud native applications like Netflix, Amazon themselves, and SmugMug appear to have done fine jobs of for themselves.

On Clouds and SPOF’s (or the Great AWS Outage of April 2011)

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112