Data Platforms
At some point, running Databricks stops being the hard part. Understanding it becomes the problem.
A warehouse that has been stable for weeks starts queueing. Costs increase without any clear change in usage. Dashboards slow down, but no one touched the pipelines behind them. You open system tables, pull query history, compare time ranges, and still end up with an answer that feels incomplete.
You can see what is happening. You just can’t explain it cleanly.
That gap—between visibility and explanation is what led us to build dCat.
Databricks already exposes detailed system tables. You have query history, execution metrics, usage, and billing inputs. The problem isn’t access.
The problem is how much work it takes to turn that data into something coherent.
A dashboard refreshing every few minutes generates hundreds of queries per day. A job processing partitions produces a long stream of nearly identical queries. These show up as separate entries even though they represent the same logical operation. Until you group them, the output looks like noise.
At the same time, attribution is inconsistent. Queries might come from dashboards, jobs, notebooks, or external tools, and the identity attached to them may be a service principal or shared account that doesn’t immediately map to a team or owner.
So even basic questions require reconstruction:
None of these answers exist directly. You have to derive them.
The first thing we changed was how queries are treated.
Instead of analyzing each query individually, we normalize them into templates by removing literal values such as dates and identifiers. Queries that share the same structure collapse into a single pattern.
This is where the system starts to make sense.
Instead of thousands of entries, you now see a manageable set of patterns. You can measure how often each one runs, how long it takes, and how that changes over time. A dashboard that used to look like noise becomes a single pattern with clear behavior. A recurring job becomes something you can track as one unit instead of a stream of unrelated executions.
This also makes comparison possible. When you look at two time windows, you are no longer comparing individual queries, but shifts in patterns.
Once queries are grouped, the next step is understanding where they come from.
A shared warehouse is rarely used by a single workload. Dashboards, scheduled jobs, notebooks, and ad hoc queries all contribute, and they behave differently. Without separating them, everything gets mixed together.
To make this usable, we map query templates to:
This requires an additional identity layer, because system tables do not always provide clean mappings. Service principals, connectors, and shared accounts need to be translated into something readable and stable.
Once that is in place, the analysis changes. Instead of seeing that a query pattern increased, you can see that it is tied to a specific dashboard or job, and that it is driven by a specific group of users. That is what allows teams to act instead of guess.
Performance issues are often treated as a single problem, but they are not.
A query can be slow because it is waiting, because it is doing too much work, or because it is expensive to plan. Treating all of that as “slow” leads to generic fixes that don’t always help.
We break query time into:
Each of these points to a different cause. Increased queue time usually means contention or capacity limits. Increased execution time often means inefficient queries or larger scans. Compilation time can indicate complexity or planning overhead.
This distinction matters because it changes the response. Scaling a warehouse can reduce queueing but will not fix inefficient queries. Optimizing a query will not help if the warehouse is saturated.
Without separating these signals, it is easy to apply the wrong fix.
Cost is usually the trigger for investigation, but it is also the least actionable by default.
Databricks provides usage data and pricing inputs, but it does not directly assign cost to queries in a way that supports prioritization. To make this usable, we estimate cost distribution across query templates based on how much time they consume within a given period.
The logic is simple. For each time window, we calculate total warehouse cost and distribute it across queries proportionally to their execution and compilation time. The result is not billing-grade precise, but it is accurate enough to show where cost concentrates and which patterns matter most.
That is enough to answer the question teams actually care about: what is driving spend.
The most useful capability turned out to be the simplest conceptually.
Instead of looking at a single period, we compare two. A stable window and a problematic one.
This makes change visible in a structured way. For each query template, you can see how execution count and runtime shifted, whether the pattern is new, and whether its impact increased or decreased. You can also see how activity changed across users and sources.
This removes the ambiguity that usually surrounds these situations. Instead of debating whether something changed, you can point to exactly what did.
Finding issues is not the hard part. Deciding what matters is.
Some queries are slow but rare. Some are fast but run constantly. The ones that matter are the ones that combine frequency and cost.
By ranking query templates based on how often they run and how much time they consume, priorities become clear. Patterns that sit at the intersection of high frequency and high cost are the ones worth fixing first.
This is also where common issues show up clearly. Missing partition filters, queries scanning too much data, or unnecessary column selection often look minor in isolation but become expensive when repeated at scale. In some cases, fixing these reduced scan volume by an order of magnitude, with direct impact on both runtime and cost.
The key is that these are not just observations. They form a concrete list of actions.
Once everything is structured around query patterns, attribution, and comparable time windows, investigations stop being open-ended.
You are no longer scanning query history hoping to notice something unusual. You are comparing patterns, seeing exactly which ones changed, and tying those changes back to specific workloads and users. From there, the next step becomes clear, whether that means fixing a query, adjusting a workload, or scaling capacity for the right reason.
That is the difference dCat is meant to make. Not more visibility, but a shorter path from change to explanation to action.
We’re walking through this exact approach in a live session, using a real warehouse scenario where costs increased and performance degraded while the team initially believed nothing had changed.
Instead of staying at the concept level, we’ll show how the investigation actually unfolds:
If you’ve ever had to explain a Databricks cost spike or a slowdown and found yourself digging through system tables longer than expected, this will feel familiar.
Join us on April 29 at 19:00 CET to see the full workflow in action.