I had a football coach once tell me to make sure to “hit the biggest guy” on the field. When I pressed him on this somewhat scary assignment, he reasoned, “it’s typically the bigger guys that are the most fragile.”
It took me a while to understand this concept but it served as a nice metaphor for the business I am currently in. If I take a look at most Web infrastructures, the bigger they are, the more complexity gets introduced thereby making the whole thing a bit less stable.
Ironic isn’t it?
Adding more redundancy and real time availability makes the environment somewhat less stable.
Today’s advanced websites take advantage of great innovative solutions to serving more data to more people in more regions. Technologies like cloud storage, elastic computing servers, memory caching techniques, content delivery networks and don’t even get me started on creating multithreaded nodes with a Cassandra ring.
All that said, a fault at any one of these spots, however minute, can create a ripple effect that can eventually bring down a server. Case in point, how about Amazon.com going down yesterday for approximately 30 minutes. Amazon not only has one of the most advanced infrastructures available, they also make this reliability a focus of their core business they offer to customers. Estimates from Forbes.com suggest Amazon lost about $66,000 PER MINUTE!
So if Amazon can go down, pretty much anyone is at risk.
Luckily there are advanced monitoring and performance solutions available to help customers try to replicate and understand where things might be going wrong. Being able to simulate varying load at a number of different locations proves key to trying to pinpoint issues. Mixing this with tools that identify when users are experiencing reduced response times will give IT & network admins all the forensic information they need to fix a problem. I tell our customers, 90% of the “effort” is finding the problem. Fixing it is the easy part. Moral of the story, just like hitting the “big guy” on the football field, make sure to performance test your websites and nodes under a variety of situations.