Reducing Log Ingest by 90%

Greg O'Reilly
Feb 16, 2022
6 min read

Updated: Dec 6, 2022

We are excited to announce that we have successfully completed the project with the customer who contacted us via Linkedin in late 2021. We were able to show them how to reduce log ingestion costs and regain control of their data. This customer had a complex set of logging platforms, which they had built over time for various reasons.

Our team was able to work with the customer's technical teams to identify opportunities for cost savings, as well as implement changes that would give the customer more control over their data. We believe this successful project is a testament to our commitment to providing quality solutions and helping customers reach their goals.

OLD Tooling Landscape

The review process went smoothly and nothing unexpected was found. We had three different platforms and data in other people's clouds as well as on-premise systems. Each tool represented a different technical domain, making it straightforward to understand.

Datadog provides observability in production, allowing businesses to ingest agent and 3rd party logs, metrics and trace telemetry. However, with increased on-boarding demand, the costs of ingesting, retaining and re-hydrating web, database and api logs have grown, causing a low retention rate. Fortunately, when users see what Datadog has to offer they tend to want it immediately.

Elastic serves non-production cloud and on-prem logs and were becoming unmanageable. With Elastic being used by many developers and load testing teams, it was difficult to forecast capacity and we experienced low retention and performance issues. Furthermore, there was no single team that owned or managed this environment, leading to a lack of control.

Splunk is a great tool for security log collection and integration with cyber tools. However, compliance and security posture meant logging everything was becoming the norm, leading to increasing costs to ingest and store data for longer periods of time.

Note: Our team love the capabilities of Splunk, Datadog, and Elastic. However, the scope of this engagement only covered logs, controlling costs and delivering control. Not a feature comparison.

NEW Tooling Landscape

It became evident early on that introducing a data pipeline between the log sources and the three existing tools would be beneficial. While this doesn't alter the level of intelligence being fed to the current platforms, having a pipeline allows us to control and regulate the flow, as well as identify meaningful signals among the noise. The advantages of using a Logiq.ai data pipeline doesn't stop here; we will provide more insight into its numerous benefits later on.

Stage ONE - Setup Forwarders Setting up the environment meant we needed to first pre-configure a few things so that the logs being pumped directly from Log Sources to Logging Solutions are 'inline". This just means the Logiq.ai platform will forward everything as is. Within the tool, we add forwarders, which are the vendor API's to push logs too. As you can see below, we added Elastic, Datadog, Splunk On-Prem and Splunk Cloud (API Credentials+ location/index).

** Instastore is the out of box storage within Logiq that the customer provides. (bring your own storage). Anything that passes from source to forwarder in the above diagram will be stored here indefinitely. This crucial component means that a raw copy of all data being passed from agents/beats/hubs to the corresponding vendors cloud - is now in the customers cloud too. As Datdog or Elastic roll over their data based on retention rules, Logiq.ai never rolls over.

Stage TWO - Configure / Re-point Sources

Each log source will require minor configuration changes to redirect the output towards Logiq. For example, filebeat will require a http-output-plugin configured. We approached this server by server, cloud by cloud and base lined normal hourly/daily ingest volumes. Although this project was focused on three vendors, there are 100's of other integrations available.

Stage THREE - Apply Mapping

Once a forwarder is configured and in place, agents/beats primed ready to be restarted, a mapping is required to act as the middle man between source and forwarder. This translates filtering rules and allows you to apply enhanced decision making as to what log lines are forwarded, which are not. Nothing is dropped. When we first introduce a mapping, we apply no rules other than forward everything and then use this to compare log lines in Logiq.ai vs vendor platform.

IMPORTANT When making changes to Splunk, Datadog, Elastic Beats or event hubs, no data is lost during this process unless logs rollover in the time between the change. An agent restart is required. Logs still exist on the source and any gap in ingestion is taken care of using watermark/timestamp features(standard practice). We tested all configuration on non critical agents before applying anything in production. This helps clarify what changes are required, how to automate and that the plumbing is in place.

The benefits

We ensured that 100% of expected log lines were present at source, Logiq.ai and the vendor platforms (forwarders) by allowing the changes to run for 7-10 days. This gave us operational confidence that the pipeline did not introduce any difference as the logging vendors would never know a pipeline was in the middle.

We recommend walking through each source one by one to tailor the default mappings. This will enable us to decide which lines should be forwarded and which should stay behind. To do this, we can use basic log level options or more granular text filters.

The image below was a result of applying a simple mapping to NOT FORWARD any log lines that contain the attribute loglevel=debug AND loglevel=none. This was applied for Elastic in non production environments(which is larger than production combined) where we observed 46% of log lines matched the criteria. As a result, Elastic ingested 46% fewer log line events, Logiq.ai stored 100% of data (including debug and none), so nothing is lost. If we need to replay that data we can later.

We implemented more mappings to Splunk (syslog, firewall, endpoint, cloud nsg) and Datadog (Azure cloud, Event hubs, Datadog Agent web + api + db logs) to reduce the Events Per Second (EPS). As a result of these changes, the EPS dropped by around 90%.

One of the major issues with agent-to-platform solutions is that you can't check if data has been collected until it's already ingested - an issue that was very evident in our non-production environments. Developers and testers often forgot to turn off high volume log settings after periods of investigation. In some cases, we even had debug levels of logging enabled in production - but it went unnoticed.

Thanks to the use of Logiq.ai our customer has seen improved ingest volumes for all three key logging platforms. This has enabled them to gain better control over their costs and retain all log lines within Logiq.ai. Moreover, it has also allowed them to become more compliant with data retention requirements and report an improved user experience using each of the tools.

It doesn't stop here - as promised, some more hidden benefits

Now that the basic use cases were answered, we are now starting to explore some of the other brilliant features within Logiq.ai:

> Rewrite Rules Rewrite sensitive data from log lines first.last@domain.com to foo.bar@foo.com.This has been used when providing data extracts for forensic teams from Production.

> Data Exchange / Intelligence Sharing

Forward data from Datadog to Splunk (login failures, dns failures, brute force attempts, HTTP Methods)

> Ingesting Metrics Now starting to onboard metrics that would have previously been sent directly to Datadog.

> Widen On-boarding With more available capacity, the customer is now looking to broaden on-boarding to more environments and business units where previously capacity was a the main blocker. Each new service presents a smaller logging cost/footprint.

> Unified Reporting Using tools like PowerBI or Tableau they have started to use the Logiq.ai API to report on whole datasets in just one solution. They no longer need to maintain separate connections to each logging platform. Furthermore, Logiq.ai Instastore easy to query and uses open data formatting to pull large extracts. Each query is executed in milliseconds and does not end up a queue somewhere. Whether you want 5 lines or 5M lines, the response is linear via multi dimensional indexing. It really is a game changer, faster execution means more time for analysis, you can slice and dice until the cows come home.

> Parallel running critical systems Now that logiq.ai is in place, you have the ability to replay ANY data to ANY forwarder ANY time you need too and at no extra cost on Logiq.ai licensing. Where the customer may have been previously nervous about swapping(risk, cost, capabilities) out a legacy SIEM solutions with new modern technology in the past, they now have the luxury of comparing in real-time. (the same for Datadog vs Elastic vs Splunk vs something else).

Thank you for reading. If you require more information or would like to discuss how we can help your organisation take back control of your data and costs, please send an email to hello@visibilityplatforms.com. We look forward to hearing from you!

Reducing Log Ingest by 90%

Recent Posts

Comments

Contact Us