January 30, 2024
Stop Paying Snowflake for Failing Workloads
A step-by-step guide for identifying and managing failing workloads, and their associated costs.
Sahil Singla
Co-founder of Baselit
Understanding Failing Workloads in Snowflake
Failing workloads within Snowflake, defined as queries that either run into errors or fail to complete successfully within a set timeframe, can become a significant cost factor, especially in large organizations with multiple teams. A simple failing workload that runs for 30 minutes daily on a small warehouse can incur costs up to $1,200 annually.
While Snowflake does a great job in ensuring these failing workloads do not compromise data integrity, they still contribute to credit consumption. This blog aims to explore the reasons behind analyzing failed workloads, the methods for doing so, and the strategies for effective resolution.
Why Focus on Failing Workloads?
Addressing consistently failing workloads is a critical step in early optimization stages within Snowflake. In this environment, tasks either add full value upon successful completion or none at all. Consider the following example: Query 3 in the diagram below spends 60 seconds on a task before failing. Despite the reversion of any work done in that time, the cost for those 60 seconds is still incurred. This illustrates the importance of early detection and rectification of failing workloads.
How to Identify Failed Queries?
To identify the failed queries, we will use the Query History table, SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
. We’ll limit our analysis to last 30 days of Snowflake usage. You can adjust this number based on your needs. When you run the following query, you get a table of failed queries with relevant details.
Note: We also have an EXECUTION_STATUS = ‘INCIDENT’ but we’ll skip this for the present analysis, as they are quite rare compared to ‘FAIL’.
We have sorted our results in descending order of execution time because it is a good measure of the cost incurred. However, it doesn't paint the compete picture. Each increment in warehouse size leads to a doubling of the spending rate in Snowflake. Therefore, in addition to execution time, we need to take into account the warehouse size on which the failed query ran.
Here’s how we will proceed:
Step 1: Create a CTE that assigns the credit units associated with each distinct warehouse size.
Step 2: Calculate the adjusted execution time. This is achieved by multiplying the EXECUTION_TIME by the corresponding credit unit of the warehouse. This adjustment standardizes the warehouse as if all queries were executed on warehouses of the same size.
Now, let’s calculate the results.
When you run the above query, you should get something that looks like the table below.
Identifying Recurring Workloads that Fail
Let’s pause and take a step back. Our goal is to reduce costs. We’ve identified the costliest queries that have failed in the last 30 days. But if it was just a one-time runaway query, there’s not much that we can do about it. So, it makes sense to spend our time only on queries that might fail again in the future.
To do this, we’ll have to slightly adjust our code. We want to spot queries that run frequently, check how often they fail, and see when they last failed or succeeded. This information will give us practical steps for remediation, in the next section.
Let’s understand each column in the resulting table.
QUERY_TEXT
: Exact text of the query.USER_NAME
: User who ran the query.TOTAL_FAILS
: Total number of times a particular query has failed in the last 30 days.TOTAL_SUCCESS
: Total number of times a particular query has succeeded in the last 30 days.PERCENTAGE_FAILS
: Percentage of failure against total runs in the last 30 days.LAST_FAIL_TIMESTAMP
: Timestamp when the query last failed.DAYS_SINCE_LAST_FAIL
: Number of days from today since the query last failed.LAST_SUCCESS_TIMESTAMP
: Timestamp when the query last succeeded.EFFECTIVE_EXECUTION_TIME
: Sum of the total execution time multiplied by the warehouse credits based on the size of the warehouse it ran on.EFFECTIVE_ANNUAL_COST
: A rough estimate of the total cost a failed workload would incur in the year if run independently of other workloads.
Filtering out the Noise
We have applied several filters to target the right workloads. You can adjust these numbers as appropriate.
LAST_SUCCESS_TIMESTAMP
can be null or older than 10 days.PERCENTAGE_FAIL
should be at least 50%.TOTAL_FAILS
should be at least 3.DAYS_SINCE_LAST_FAIL
can be a maximum of 10 days.EFFECTIVE_ANNUAL_COST
should be at least $200.
With these filters, we have tried to cover scenarios where there might be a good reason to not worry about certain failing workloads, such as:
A particular workload was failing but the user moved it to a larger warehouse and now it started succeeding. This would mean that the user has actively looked into it and remediated it.
A workload was failing 30 days ago but hasn’t been run since. We also need to filter out these workloads. The frequency of such workloads is low enough that their contribution to costs would be insignificant.
Analyzing the Results
Here’s what we have so far: a list of all queries that failed at least 3 times in the last month (including at least once in the last 10 days), and did not succeed even once in the last 10 days. Moreover, they failed at least 50% of the time, and cost more than $200 annually. We can now be confident that these are prime targets for us to address.
We now know the USER_NAME
of the user who ran the query. The assumption here is that the same user would be firing the query, more often than not using a tool like dbt, Airflow, or Looker.
As Account Admins, usually, the only way to remediate this is to identify the team who ran a particular query, and notify them. Here are two ways to identify the right team:
For organizations that allocate different warehouses to different teams,
WAREHOUSE_NAME
is sufficient to identify the team responsible for a workload.If the warehouses are shared across teams, identifying the user can be tricky. Copy parts of the
QUERY_TEXT
, search it on GitHub, and identify the team responsible for it. You can also use the git blame command to identify the user more granularly.
Proactive communication and collaboration with the teams involved are key to successful remediation of failing workloads.