top of page
Coding Station

Data journeys made shorter with pipelines

Over the last couple of weeks we have been working with a customer that had some brilliant reporting capabilities, in fact some of the best visuals we have seen actually. But, as popularity of reporting has grown, the need to push more data through one end and out the other was starting to overwhelm the team and push up infrastructure costs in Azure. There was huge amounts of duplication going on, on-boarding took too long, frequent batches of data moving from one end to the other whilst being reshaped and squeezed in between was making it unmanageable.

They reached out to see how we can help after reading some of our other posts on data pipelines. First step was to map out sources and a typical journey, then understand how many gates needed to be opened up in order to reach the destination.

We came up with the below diagram to represent the ecosystem in its current form: Fig1

Fig1 - diagram of customer existing env

In the current design, there are 10 different silos (gates), each responsible for their own small part of the journey from source to screen. We explained that introducing a data pipeline solution like would not alter the visual output so much(they have this nailed!) , but the journey times and number of gates to pass through could be streamlined. Note: Back-end data from applications in this diagram were not part of the data lake solution or reporting in Databricks, which meant they were missing that particular vantage-point.

In the same session with the customer, having explained again how a data pipeline improves control and reduces complexity, we came up with a design that would potentially remove Nifi, Data Factory, Event Hub and the entire data lake. That's four of the 10 silos removed. Within 3 seconds of looking at our proposed approach, they thought we were on drugs and could not see how anything could flow from left to right. We were deadly serious; actually experience the same reaction from every customer when we propose such dramatic changes. is all four of those silos in one(and more), in the below diagram we have proposed that they replace much of the top layer due to being components that are costing a fair bit and the most complex to maintain. By allowing to control ingestion of data, filter and route, store and forward to Databricks or any other platform in a less complicated fashion, we solved the first problem. Once data is ingested, they would have the power to select a destination within Databricks without having to jump through any gates at all. ie one set of data can easily be replicated in any workspace without any complex configuration.

Fig2 - diagram of proposed design

The icing on the cake was that if was to be introduced, we could actually include missing application telemetry into the mix very very easily. This was the second challenge they mentioned, also the easiest to fix.

Adding application logs and metrics(prometheus) would allow the customer to reduce the retention/disk size on local disk for those components and push non filtered data into a unified store that was visible from end user reporting tools if they wanted. Either go direct to the UI, use Grafana or one of the other end user reporting tools.

The POC is well underway, we have plumbed in and already proving we were not high on drugs at the time. More datasources like Microsoft 365, Syslog, Firewall and Endpoints are some of the things we are looking to include in an extended scope. One step at a time eh.

Thanks for reading, if you want more information on the above, just do what this customer did: mailto:


bottom of page