In the afternoon of the 15th September 2022 UTC, the online zoom.us collaboration platform suffered an outage that impacted users globally.
For one of our customers, they identified problem domain immediately. Their service desk was prepared to respond proactively to users within a minute or two, an informative service status update was posted to the corporate internet page spreading awareness.
Just think how much time this would have saved on calls, emails or answering incidents. It didn't fix zoom.us, but it did rule out the company's own infrastructure and network. It gave users an opportunity to use an alternative collaboration platform and inform those attending meetings that they were on top of things. The customer actually knew before zoom.us's own status page was updated.
How did they know? Well, luckily for them, we recently deployed Kentik's comprehensive network observability solution that included a very cool feature called "state of the internet". This feature is built into every subscription and monitors all of your critical saas services like zoom.us, cloud services like AWS and Azure, and dns services like google or opendns.
State of the internet monitors all of the above services at 1 minute frequency using HTTP, TCP, BGP and DNS synthetics, plus combine netflow data. A mesh runs behind every subscription on their behalf performing continuous health monitoring with machine learning and anomaly detection built in. None of your subscription units are spent monitoring those services, meaning your units focus on your network, your sites and your applications.
There was a small tremor before the bigger outage that followed. Some connections were impacted, some were ok (half the dc ok, half not?). This was also picked up as an early warning indicator by Kentik, alerts were sent to the customers monitoring team. A few minutes later, 100% availability resumed, but it had primed the monitoring team and zoom was placed on the watch list.
Earlier tremor at 1430UTC:
Later, and at 15:00UTC, for around 25 minutes every location reported http 502 errors. This status code confidently points all failures to within the zoom.us infrastructure (no need to explain this). The path view and dns checks were ok, BGP unchanged, connectivity from the customers own network reported the same http 502 status codes. Problem domain identified.
Once Zoom.us fixed their issues, the Kentik alert closure was went. The IT teams tested a quick meeting to confirm, checked the status page and provided an update to their business. Meetings at the later 1600UTC would perform as expected. No need to check from every office, Kentik already had 25+ locations confirming reachability.
Check out the Kentik pricing and features page for a full list of features. We think no other vendor comes close to the level of goodies that come out of the box with Kentik.
The Kentik UI menu, which explains the above for you techies:
If you would like to know more about how we can help your organisation identify problem domain, just like we did for this customer, fire us over an email. We can get you up and running in a few hours and provide a single source of truth for your network.