AWS Server Firefighting Pro Tip 3 - AWS Metrics And Dashboards

When the servers are down, the customers are seeing 500s, and your team is losing its mind what is the first thing you do to diagnose the problem in your massive AWS infrastructure?
Spreadsheets Whisper, Graphs Scream: An author I recently started following, Dan Martel, introduced me to a saying “Spreadsheets Whisper, Graphs Scream” and I couldn’t agree more. While the logs tell a story it can be time consuming to comb through them and difficult to discern a pattern.
Each graph tells a story, a series of events. AWS CloudWatch metrics are no different.
When a client comes to me with a problem oftentimes the CloudWatch Dashboards are the first place I look.
Literally at the beginning of most Office Hour Advisory sessions I start by reviewing the bills and the CloudWatch Dashboards proactively to make sure everything is in order. Oftentimes I catch discrepancies before my clients even notice them.
Establish A Baseline:
I try to establish a baseline. For example if I see that one of my client’s services generally gets 100 million requests per hour during the busy part of the day and closer to 30 million at night that gives me a good baseline for what normal traffic looks like.
Look For Outliers:
From there I look for outliners. Continuing that example above if I were to see a spike of 500 million requests I would flag that as a potential issue.
I have used this to pin point the exact minute someone in my client’s org flipped a switch on something that cost them more money, slowed down their website, or just straight up broke something.
Question For You: What CloudWatch Metrics do you use to sniff out server fires?