From Cost Spikes to Root Cause in Minutes: Building Clarity on Top of Databricks with dCat
When a data platform team relies heavily on Databricks SQL Warehouses, cost spikes and performance regressions are somewhat of a given.
But understanding why they happen and what to fixremains one of the hardest problems.
Our client, a data-driven organization with growing Databricks usage, had strong observability tools in place. However, when costs increased or performance dropped, their team still struggled to quickly identify root causes, because it took a manual investigation and several hours each time.
They had the data, but they struggled to turn them into answers, decisions, and specific actions.
In a short engagement, our SmartCat team helped them transform raw Databricks signals into a structured, repeatable system for cost and performance forensics using dCat.
Here’s how we approached it.
Problem
Despite having access to Databricks System Tables and dashboards, the client faced several challenges:
Lack of clear root-cause analysis – When costs spiked, teams had to manually investigate query logs, usage tables, and dashboards without a clear path to answers.
No attribution across users and workloads – It was difficult to determine which users, jobs, or dashboards were responsible for increased spend or degraded performance.
Fragmented analysis workflows – Each investigation required stitching together multiple tools, queries, and manual comparisons across time periods.
No prioritization of optimization efforts –no structured way to decide what to fix first.
Over-reliance on raw observability – Databricks lacked an opinionated layer that translates those metrics into actionable decisions.
The team needed a way to move from data → explanation → action, without reinventing the process every time.
Solution
We introduced dCat, an opinionated analytics layer built on top of Databricks System Tables, designed to make cost and performance issues explainable and actionable.
We started with a focused diagnostic phase, analyzing a specific incident window and comparing it against a baseline period.
From there, we built a structured approach to answering three key questions:
What changed?
Who or what caused it?
What should be fixed first?
Our approach included:
Time-window comparison (“what changed?”) – We implemented a baseline vs incident analysis to identify which query patterns, users, and workloads changed in execution count and runtime.
Query fingerprinting into templates – Instead of analyzing raw SQL queries, we grouped them into repeatable query templates, making patterns visible and comparable at scale.
Attribution layer (who caused it?) – We mapped queries to users and sources (jobs, dashboards, notebooks, ad-hoc usage), creating clear accountability across the platform.
Cost showback modeling – We estimated warehouse spend and allocated it across query templates and users based on resource usage, enabling prioritization and ownership
Optimization backlog generation (what to fix first?) – We created a ranked list of high-impact query patterns based on frequency, duration, and cost contribution.
By the end of the engagement, the client had a structured, repeatable workflow for investigating and optimizing Databricks usage.
Results
The engagement produced a clear and actionable system for cost and performance management:
Immediate visibility into cost and performance drivers – Teams could quickly identify which query patterns and users were responsible for changes.
Structured “what changed?” investigation workflow – Incident analysis became repeatable, consistent, and significantly faster.
Clear prioritization of optimization efforts – High-impact query templates were ranked based on frequency and cost, enabling focused improvements.
Improved accountability across teams – Attribution made it possible to connect platform usage to specific users and workloads.
Foundation for continuous optimization – The system could be extended into a production-grade setup with pre-aggregated tables and dashboards
Most importantly, the client moved from reactive firefighting to proactive performance and cost optimization.
SmartTip
Don’t stop at observability.
Raw metrics are only the starting point – real value comes from building a layer that explains changes and prioritizes actions.
SmartFact
Databricks already provides detailed system-level observability, but without query fingerprinting and attribution, teams are often left analyzing individual queries instead of identifying patterns.
Turning queries into reusable templates is what makes large-scale optimization possible.
About the Client
Our client is a data-driven organization with a growing reliance on Databricks SQL Warehouses for analytics and business intelligence.
As usage scaled, Databricks became a significant cost driver, making performance optimization and cost visibility critical to their data platform strategy. As a Databricks partner, SmartCat had an in-depth knowledge of the platorm.
Their goal: gain clear insight into cost drivers and establish a repeatable system for managing performance and spend.
Technologies Used
dCat leverages native Databricks capabilities and a lightweight analytics layer to transform raw system data into actionable insights.
Databricks SQL Warehouses
Databricks System Tables (usage, query history, billing)
SQL-based transformations and modeling
Query fingerprinting and workload attribution logic
Pre-aggregated tables and dashboards for performance and scalability
If your team is struggling to understand cost spikes, performance regressions, or simply doesn’t know where to start with optimization, dCat gives you a clear, structured way forward.
Register for our webinar and see how to turn Databricks observability into actionable decisions using dCAT.