Coding Station

Reducing Log Ingest by 90%

Updated: Feb 24

In late 2021 we kick started a project with a customer who had shown an interest in reducing log ingestion costs and reached out to us via Linkedin.


Like many others, this customer had a combination of popular logging platforms in place that were built and put together for various reasons over time. No real reason why, just different departments, skills, tooling budgets and business needs influenced their position.


They finally allowed us to engage with their technical teams and asked us to prove that we could reduce spend on tools and show them how to get back control of their data. Challenge accepted.

 

OLD Tooling Landscape


Nothing unexpected found, they had three different platforms and data in other peoples clouds and on-prem. Each tool represented a technical domain and was straightforward to understand.


Datadog - Observability in production. This tool is used to ingest agent and 3rd party logs and other metric and trace telemetry. Pain Point - increased on-boarding demand has meant increasing costs to ingest, retain and re-hydrate. Web, database and api logs were growing. Low retention. Not really a pain point, but everyone wants it once they see it.


Elastic - Non production cloud and on-prem logs. All types. Pain Point - Lacked control. Elastic was being used by many developers and load testing teams. Difficult to forecast capacity and experiences low retention and sometimes low performance. No one team owned or managed this environment.


Splunk - Security log collection and integration with cyber tools. Pain Points - Compliance and security posture had meant logging everything became the norm. No choice but to accept growing costs to ingest and store for longer.


Note: all three tools above are absolute powerhouses in their own domains, our team loves each of them. This POC was not about comparing features or picking holes in the customer capabilities, but to bring order and control to the environment allowing the customer to forecast costs. Although Logiq.ai can support logs and metrics, the scope for this engagement covered logs only.

 

NEW Tooling Landscape


It is important to understand that there is no change to the level of intelligence data being delivered to the existing platforms. Adding a pipeline gives you control of the flow & the power to distinguish signal from noise. In fact, its a very basic feature associated with data pipelines, but we will explain how Logiq.ai adds many other unexpected benefits later.

Stage ONE - Setup Forwarders Setting up the environment meant we needed to first pre-configure a few things so that the logs being pumped directly from Log Sources to Logging Solutions are 'inline". This just means the Logiq.ai platform will forward everything as is. Within the tool, we add forwarders, which are the vendor API's to push logs too. As you can see below, we added Elastic, Datadog, Splunk On-Prem and Splunk Cloud (API Credentials+ location/index).

** Instastore is the out of box storage within Logiq that the customer provides. (bring your own storage). Anything that passes from source to forwarder in the above diagram will be stored here indefinitely. This crucial component means that a raw copy of all data being passed from agents/beats/hubs to the corresponding vendors cloud - is now in the customers cloud too. As Datdog or Elastic roll over their data based on retention rules, Logiq.ai never rolls over.


Stage TWO - Configure / Re-point Sources

Each log source will require minor configuration changes to redirect the output towards Logiq. For example, filebeat will require a http-output-plugin configured. We approached this server by server, cloud by cloud and base lined normal hourly/daily ingest volumes. Although this project was focused on three vendors, there are 100's of other integrations available.


Stage THREE - Apply Mapping

Once a forwarder is configured and in place, agents/beats primed ready to be restarted, a mapping is required to act as the middle man between source and forwarder. This translates filtering rules and allows you to apply enhanced decision making as to what log lines are forwarded, which are not. Nothing is dropped. When we first introduce a mapping, we apply no rules other than forward everything and then use this to compare log lines in Logiq.ai vs vendor platform.

IMPORTANT When making changes to Splunk, Datadog, Elastic Beats or event hubs, no data is lost during this process unless logs rollover in the time between the change. An agent restart is required. Logs still exist on the source and any gap in ingestion is taken care of using watermark/timestamp features(standard practice). We tested all configuration on non critical agents before applying anything in production. This helps clarify what changes are required, how to automate and that the plumbing is in place.

 

The benefits

Before making changes we had to clarify that 100% of expected log lines were present at source, Logiq.ai and the vendor platforms (forwarders). We allowed this to run for 7-10 days just to give operational confidence that the pipeline introduces zero difference. (Remember, those logging vendors will never know a pipeline is in the middle!)


Next, the default mappings that we applied requires tailoring / fine tuning with the customer, they tell us what lines to forward and what lines to keep behind (not lose or drop). This is best achieved by walking through source by source. Basic options can be things like specific log levels, or we can be more granular and apply line by line txt filters.


The image below was a result of applying a simple mapping to NOT FORWARD any log lines that contain the attribute loglevel=debug AND loglevel=none. This was applied for Elastic in non production environments(which is larger than production combined) where we observed 46% of log lines matched the criteria. As a result, Elastic ingested 46% fewer log line events, Logiq.ai stored 100% of data (including debug and none), so nothing is lost. If we need to replay that data we can later.


As we applied more and more mappings over time to both Splunk (syslog, firewall, endpoint, cloud nsg) and Datadog (Azure cloud, Event hubs, Datadog Agent web + api + db logs) the amount of EPS (Events Per Second) started to drop over all by ~90% (fluctuates by hour of day). The problem with agent to platform solutions is that you don't know you've collected it until its there(you have already paid to ingest). In the case of the elastic non production environments, this was very sporadic, developers and testers left high volume log settings for periods of investigation and forgot to turn off. In some cases, we had logging levels at debug levels in production, it was simply overlooked. Logiq.ai gives you that layer of protection and enforces sensible rules.


We continue to apply further mappings and get more acquainted with priority content across the environment, the customer has retained every single log line within Logiq.ai and has seen improved ingest volumes for all three key logging platforms. They are in a much better position in terms of cost control, more compliant in terms of data retention (or will be) and report an improved user experience using each of the tools.



 

It doesn't stop here - as promised, some more hidden benefits

Now that the basic use cases were answered, we are now starting to explore some of the other brilliant features within Logiq.ai:


> Rewrite Rules Rewrite sensitive data from log lines first.last@domain.com to foo.bar@foo.com.This has been used when providing data extracts for forensic teams from Production.


> Data Exchange / Intelligence Sharing

Forward data from Datadog to Splunk (login failures, dns failures, brute force attempts, HTTP Methods)


> Ingesting Metrics Now starting to onboard metrics that would have previously been sent directly to Datadog.


> Widen On-boarding With more available capacity, the customer is now looking to broaden on-boarding to more environments and business units where previously capacity was a the main blocker. Each new service presents a smaller logging cost/footprint.


> Unified Reporting Using tools like PowerBI or Tableau they have started to use the Logiq.ai API to report on whole datasets in just one solution. They no longer need to maintain separate connections to each logging platform. Furthermore, Logiq.ai Instastore easy to query and uses open data formatting to pull large extracts. Each query is executed in milliseconds and does not end up a queue somewhere. Whether you want 5 lines or 5M lines, the response is linear via multi dimensional indexing. It really is a game changer, faster execution means more time for analysis, you can slice and dice until the cows come home.


> Parallel running critical systems Now that logiq.ai is in place, you have the ability to replay ANY data to ANY forwarder ANY time you need too and at no extra cost on Logiq.ai licensing. Where the customer may have been previously nervous about swapping(risk, cost, capabilities) out a legacy SIEM solutions with new modern technology in the past, they now have the luxury of comparing in real-time. (the same for Datadog vs Elastic vs Splunk vs something else).



Thank you for reading. If you require more information on the above or would like to discuss how we can help your organisation take back control of your data and costs: fire over an email info@visibilityplatforms.com