World leading logistics company urgently needed more insight into what was really happening with a legacy application and downstream services.
We all go through it, too many places to look or teams to involve to gather evidence when troubleshooting mission critical incidents. It can be painstaking, disrupts business and impacts score cards - no-one looks good at the end of hour long calls!
Value from Visibility Platforms
“there are too many tools and opinions, no-one has access to everything, it takes far too long to find anything” Application Owner
Once we had time to assess existing tooling and understand the challenges, the problem became immediately clear and a delve deeper into roles and use cases surrounding this application was required. Many teams were involved, they all used different tools, the whole end to end process of monitoring this application was disjointed. Some of the findings: - Legacy infrastructure tools were in place doing a great job, just could not see beyond the servers
- Tech teams had attempted to extract reports or write scripts over time, they were not maintained
- APM solutions could not instrument the technology (C and C++) to extract traces - Everything was in different places and no one team had access to everything. Infrastructure, logs, database reporting, synthetics.
- The customer found the issues in the end, but lacked that first step to triage effectively, taking far too much time to gather evidence. - Lack of an end to end view that delivered a technical or business dependency map
Our first challenge was to map available telemetry to form some relationships, assess whether the data we needed to map together was portable enough to move to a new centrally - without disrupting any existing flows or tooling overheads.
Using Elastic, we began ingesting some of the first sets of log data from the front line servers. Right away, this gave us a sound view of weight distribution and throughput per server and instance. The assessment data we collected earlier gave us an indication of message queues that related to critical business services. Within just a few days, we progressed to not only having an idea of system performance and weight distribution - but starting to see each business process and how they were performing in the same way - per server, per instance, per service.
In the next phase, it was about understanding what else was happening externally on remote dependent services. First port of call, the central event management system where we enabled ingestion to understand outside event weather conditions. Some fine tuning was needed to sound out the many non-related events, but managed to correlate anything meaningful to reach around 75% suppression. Machine learning was able to self tune itself and learn what "normal" actually meant, meaning whatever we deployed would self teach as we move forward. The goal here was to correlate outside events, inside performance and deliver an indication of root cause analysis when deviation is detected - saving hours of troubleshooting later on (#1 use case).
Within no less than four weeks, we managed to collect all our telemetry in one unified datastore and started to align our deployment of elastic to the inner workings of the application. To perfectly understand the logic of each log line we had to sit alongside the analysts who had nursed the application for many years - an important task, getting to know habits and triage that takes place, allows us to cut time and get to root cause quicker.
We delivered a vendor and team agnostic single pane of glass that could be consumed by everyone who needed access. This critical application relied on developers and architects with deep knowledge to keep the wheels moving, every time something went wrong.
We delivered transparency and actionable alerting that gave operations team the all important early warning signals they needed to be preventative.There was no hiding place, no-one else to ask for help, each insight we developed was tailored based on the lens of each use case.
The project was delivered within 90 days, during this time as we learnt more about the role of the application and how critical outages could have been avoidable, through basically looking at the wrong things and causing delay.
Environment: Linux, Oracle and SQL Databases, IBM MQ Queues, Servicenow Event Management, Zabbix Infrastructure Monitoring, Azure Monitor for Cloud.
Solution: Elasticsearch, Filebeat, Metricbeat, Kibana, Logstash, Grafana Front Ends.