Open Positions
Tech Blog
Case Study
News

Data Platforms

Databricks Cost and Performance Optimization with dCAT: How We Built It and Why

Zoran Krdžić

Mar 30, 2026

At some point, running Databricks stops being the hard part. Understanding it becomes the problem.

A warehouse that has been stable for weeks starts queueing. Costs increase without any clear change in usage. Dashboards slow down, but no one touched the pipelines behind them. You open system tables, pull query history, compare time ranges, and still end up with an answer that feels incomplete.

You can see what is happening. You just can’t explain it cleanly.

That gap—between visibility and explanation is what led us to build dCat.

Where the Friction Actually Is in DataBricks Cost and Performance Optimization

Databricks already exposes detailed system tables. You have query history, execution metrics, usage, and billing inputs. The problem isn’t access.

The problem is how much work it takes to turn that data into something coherent.

A dashboard refreshing every few minutes generates hundreds of queries per day. A job processing partitions produces a long stream of nearly identical queries. These show up as separate entries even though they represent the same logical operation. Until you group them, the output looks like noise.

At the same time, attribution is inconsistent. Queries might come from dashboards, jobs, notebooks, or external tools, and the identity attached to them may be a service principal or shared account that doesn’t immediately map to a team or owner.

So even basic questions require reconstruction:

Which workload actually increased?
Did existing queries run more often, or did new ones appear?
Is the slowdown coming from queueing or from heavier execution?
Which patterns are responsible for most of the cost?

None of these answers exist directly. You have to derive them.

Changing the Unit of Analysis

The first thing we changed was how queries are treated.

Instead of analyzing each query individually, we normalize them into templates by removing literal values such as dates and identifiers. Queries that share the same structure collapse into a single pattern.

This is where the system starts to make sense.

Instead of thousands of entries, you now see a manageable set of patterns. You can measure how often each one runs, how long it takes, and how that changes over time. A dashboard that used to look like noise becomes a single pattern with clear behavior. A recurring job becomes something you can track as one unit instead of a stream of unrelated executions.

This also makes comparison possible. When you look at two time windows, you are no longer comparing individual queries, but shifts in patterns.

Reconstructing Workload and Ownership

Once queries are grouped, the next step is understanding where they come from.

A shared warehouse is rarely used by a single workload. Dashboards, scheduled jobs, notebooks, and ad hoc queries all contribute, and they behave differently. Without separating them, everything gets mixed together.

To make this usable, we map query templates to:

user identity
source type (dashboard, job, notebook, ad hoc)
recurring workload patterns

This requires an additional identity layer, because system tables do not always provide clean mappings. Service principals, connectors, and shared accounts need to be translated into something readable and stable.

Once that is in place, the analysis changes. Instead of seeing that a query pattern increased, you can see that it is tied to a specific dashboard or job, and that it is driven by a specific group of users. That is what allows teams to act instead of guess.

Separating Performance Signals

Performance issues are often treated as a single problem, but they are not.

A query can be slow because it is waiting, because it is doing too much work, or because it is expensive to plan. Treating all of that as “slow” leads to generic fixes that don’t always help.

We break query time into:

queue time
execution time
compilation time

Each of these points to a different cause. Increased queue time usually means contention or capacity limits. Increased execution time often means inefficient queries or larger scans. Compilation time can indicate complexity or planning overhead.

This distinction matters because it changes the response. Scaling a warehouse can reduce queueing but will not fix inefficient queries. Optimizing a query will not help if the warehouse is saturated.

Without separating these signals, it is easy to apply the wrong fix.

Allocating Cost in a Useful Way

Cost is usually the trigger for investigation, but it is also the least actionable by default.

Databricks provides usage data and pricing inputs, but it does not directly assign cost to queries in a way that supports prioritization. To make this usable, we estimate cost distribution across query templates based on how much time they consume within a given period.

The logic is simple. For each time window, we calculate total warehouse cost and distribute it across queries proportionally to their execution and compilation time. The result is not billing-grade precise, but it is accurate enough to show where cost concentrates and which patterns matter most.

That is enough to answer the question teams actually care about: what is driving spend.

Comparing Time Windows Instead of Guessing

The most useful capability turned out to be the simplest conceptually.

Instead of looking at a single period, we compare two. A stable window and a problematic one.

This makes change visible in a structured way. For each query template, you can see how execution count and runtime shifted, whether the pattern is new, and whether its impact increased or decreased. You can also see how activity changed across users and sources.

This removes the ambiguity that usually surrounds these situations. Instead of debating whether something changed, you can point to exactly what did.

Turning Analysis Into a Backlog

Finding issues is not the hard part. Deciding what matters is.

Some queries are slow but rare. Some are fast but run constantly. The ones that matter are the ones that combine frequency and cost.

By ranking query templates based on how often they run and how much time they consume, priorities become clear. Patterns that sit at the intersection of high frequency and high cost are the ones worth fixing first.

This is also where common issues show up clearly. Missing partition filters, queries scanning too much data, or unnecessary column selection often look minor in isolation but become expensive when repeated at scale. In some cases, fixing these reduced scan volume by an order of magnitude, with direct impact on both runtime and cost.

The key is that these are not just observations. They form a concrete list of actions.

What This Turns Into in Practice

Once everything is structured around query patterns, attribution, and comparable time windows, investigations stop being open-ended.

You are no longer scanning query history hoping to notice something unusual. You are comparing patterns, seeing exactly which ones changed, and tying those changes back to specific workloads and users. From there, the next step becomes clear, whether that means fixing a query, adjusting a workload, or scaling capacity for the right reason.

That is the difference dCat is meant to make. Not more visibility, but a shorter path from change to explanation to action.

See the Full Investigation Walkthrough

We’re walking through this exact approach in a live session, using a real warehouse scenario where costs increased and performance degraded while the team initially believed nothing had changed.

Instead of staying at the concept level, we’ll show how the investigation actually unfolds:

how a baseline vs incident comparison isolates what changed
how query templates reveal patterns hidden in raw history
how cost and performance signals converge on the same drivers
how that turns into a prioritized backlog of fixes

If you’ve ever had to explain a Databricks cost spike or a slowdown and found yourself digging through system tables longer than expected, this will feel familiar.

Join us on April 29 at 19:00 CET to see the full workflow in action.

Table of Contents

Databricks Cost and Performance Optimization with dCAT: How We Built It and Why

Where the Friction Actually Is in DataBricks Cost and Performance Optimization

Changing the Unit of Analysis

Reconstructing Workload and Ownership

Separating Performance Signals

Allocating Cost in a Useful Way

Comparing Time Windows Instead of Guessing

Turning Analysis Into a Backlog

What This Turns Into in Practice

See the Full Investigation Walkthrough

Related blog posts:

How to optimize the costs of using AI at your online marketplace

SmartCat Values: Start with “Why?”

Talk to your data! Introducing CMA – AI chatbot powered by SmartCat

Databricks Cost and Performance Optimization with dCAT: How We Built It and Why

Where the Friction Actually Is in DataBricks Cost and Performance Optimization

Changing the Unit of Analysis

Reconstructing Workload and Ownership

Separating Performance Signals

Allocating Cost in a Useful Way

Comparing Time Windows Instead of Guessing

Turning Analysis Into a Backlog

What This Turns Into in Practice

See the Full Investigation Walkthrough

Stay up to date!

Related blog posts:

How to optimize the costs of using AI at your online marketplace

SmartCat Values: Start with “Why?”

Talk to your data! Introducing CMA – AI chatbot powered by SmartCat