January 30, 2024

Stop Paying Snowflake for Failing Workloads

A step-by-step guide for identifying and managing failing workloads, and their associated costs.

Sahil Singla

Co-founder of Baselit

Understanding Failing Workloads in Snowflake

Failing workloads within Snowflake, defined as queries that either run into errors or fail to complete successfully within a set timeframe, can become a significant cost factor, especially in large organizations with multiple teams. A simple failing workload that runs for 30 minutes daily on a small warehouse can incur costs up to $1,200 annually.

While Snowflake does a great job in ensuring these failing workloads do not compromise data integrity, they still contribute to credit consumption. This blog aims to explore the reasons behind analyzing failed workloads, the methods for doing so, and the strategies for effective resolution.

Why Focus on Failing Workloads?

Addressing consistently failing workloads is a critical step in early optimization stages within Snowflake. In this environment, tasks either add full value upon successful completion or none at all. Consider the following example: Query 3 in the diagram below spends 60 seconds on a task before failing. Despite the reversion of any work done in that time, the cost for those 60 seconds is still incurred. This illustrates the importance of early detection and rectification of failing workloads.

How to Identify Failed Queries?

To identify the failed queries, we will use the Query History table, SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY. We’ll limit our analysis to last 30 days of Snowflake usage. You can adjust this number based on your needs. When you run the following query, you get a table of failed queries with relevant details.

-- Sorted on descending order of execution time
SELECT QUERY_TEXT, 
       EXECUTION_STATUS, 
       WAREHOUSE_SIZE,
       ERROR_CODE, 
       ERROR_MESSAGE, 
       EXECUTION_TIME, 
       USER_NAME
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE EXECUTION_STATUS = 'FAIL'
AND START_TIME BETWEEN DATEADD('day', -30, CURRENT_DATE()) AND CURRENT_DATE()
ORDER BY EXECUTION_TIME DESC

Note: We also have an EXECUTION_STATUS = ‘INCIDENT’ but we’ll skip this for the present analysis, as they are quite rare compared to ‘FAIL’.

We have sorted our results in descending order of execution time because it is a good measure of the cost incurred. However, it doesn't paint the compete picture. Each increment in warehouse size leads to a doubling of the spending rate in Snowflake. Therefore, in addition to execution time, we need to take into account the warehouse size on which the failed query ran.

Here’s how we will proceed:

Step 1: Create a CTE that assigns the credit units associated with each distinct warehouse size.

Step 2: Calculate the adjusted execution time. This is achieved by multiplying the EXECUTION_TIME by the corresponding credit unit of the warehouse. This adjustment standardizes the warehouse as if all queries were executed on warehouses of the same size.

Now, let’s calculate the results.

-- Step 1: Create a CTE for the WAREHOUSE_CREDITS_MAP.
WITH WAREHOUSE_CREDITS_MAP AS (
    SELECT * FROM (
        VALUES
            ('X-Small', 'STANDARD', 1),
            ('Small', 'STANDARD', 2),
            ('Medium', 'STANDARD', 4),
            ('Large', 'STANDARD', 8),
            ('X-Large', 'STANDARD', 16),
            ('2X-Large', 'STANDARD', 32),
            ('3X-Large', 'STANDARD', 64),
            ('4X-Large', 'STANDARD', 128),
            ('5X-Large', 'STANDARD', 256),
            ('6X-Large', 'STANDARD', 512),
            ('Medium', 'SNOWPARK-OPTIMIZED', 6),
            ('Large', 'SNOWPARK-OPTIMIZED', 12),
            ('X-Large', 'SNOWPARK-OPTIMIZED', 24),
            ('2X-Large', 'SNOWPARK-OPTIMIZED', 48),
            ('3X-Large', 'SNOWPARK-OPTIMIZED', 96),
            ('4X-Large', 'SNOWPARK-OPTIMIZED', 192),
            ('5X-Large', 'SNOWPARK-OPTIMIZED', 384),
            ('6X-Large', 'SNOWPARK-OPTIMIZED', 768)
    ) AS t (WAREHOUSE_SIZE, WAREHOUSE_TYPE, CREDITS_PER_HOUR)
)
-- Step 2: Multiply the credits based on the warehouse_size to calculate
-- the effective execution time.
SELECT
    qh.QUERY_TEXT, 
    qh.EXECUTION_STATUS, 
    qh.WAREHOUSE_SIZE,
    qh.WAREHOUSE_TYPE,
    qh.ERROR_CODE, 
    qh.ERROR_MESSAGE, 
    qh.EXECUTION_TIME,
    qh.EXECUTION_TIME * wcm.CREDITS_PER_HOUR AS EFFECTIVE_EXECUTION_TIME,
    qh.USER_NAME
FROM
    SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY qh
JOIN
    WAREHOUSE_CREDITS_MAP wcm
ON
    wcm.WAREHOUSE_SIZE = qh.WAREHOUSE_SIZE
    AND wcm.WAREHOUSE_TYPE = qh.WAREHOUSE_TYPE
WHERE
    qh.EXECUTION_STATUS = 'FAIL'
    AND qh.START_TIME BETWEEN DATEADD('day', -30, CURRENT_DATE()) AND CURRENT_DATE()
ORDER BY EFFECTIVE_EXECUTION_TIME DESC

When you run the above query, you should get something that looks like the table below.

Identifying Recurring Workloads that Fail

Let’s pause and take a step back. Our goal is to reduce costs. We’ve identified the costliest queries that have failed in the last 30 days. But if it was just a one-time runaway query, there’s not much that we can do about it. So, it makes sense to spend our time only on queries that might fail again in the future.

To do this, we’ll have to slightly adjust our code. We want to spot queries that run frequently, check how often they fail, and see when they last failed or succeeded. This information will give us practical steps for remediation, in the next section.

-- Create a CTE for the WAREHOUSE_CREDITS_MAP
WITH WAREHOUSE_CREDITS_MAP AS (
    SELECT * FROM (
        VALUES
            ('X-Small', 'STANDARD', 1),
            ('Small', 'STANDARD', 2),
            ('Medium', 'STANDARD', 4),
            ('Large', 'STANDARD', 8),
            ('X-Large', 'STANDARD', 16),
            ('2X-Large', 'STANDARD', 32),
            ('3X-Large', 'STANDARD', 64),
            ('4X-Large', 'STANDARD', 128),
            ('5X-Large', 'STANDARD', 256),
            ('6X-Large', 'STANDARD', 512),
            ('Medium', 'SNOWPARK-OPTIMIZED', 6),
            ('Large', 'SNOWPARK-OPTIMIZED', 12),
            ('X-Large', 'SNOWPARK-OPTIMIZED', 24),
            ('2X-Large', 'SNOWPARK-OPTIMIZED', 48),
            ('3X-Large', 'SNOWPARK-OPTIMIZED', 96),
            ('4X-Large', 'SNOWPARK-OPTIMIZED', 192),
            ('5X-Large', 'SNOWPARK-OPTIMIZED', 384),
            ('6X-Large', 'SNOWPARK-OPTIMIZED', 768)
    ) AS t (WAREHOUSE_SIZE, WAREHOUSE_TYPE, CREDITS_PER_HOUR)
)
-- EXECUTION_TIME is in milliseconds
-- We have assumed a pricing of 3$/credit as compute price.
SELECT
    ANY_VALUE(qh.QUERY_TEXT),
  QUERY_HASH,
  ANY_VALUE(qh.user_name) AS USER_NAME,
    COUNT(CASE WHEN qh.EXECUTION_STATUS = 'FAIL' THEN 1 END) AS TOTAL_FAILS,
    COUNT(CASE WHEN qh.EXECUTION_STATUS = 'SUCCESS' THEN 1 END) AS TOTAL_SUCCESS,
    -- Calculating percentage of failed queries
    100.0 * COUNT(CASE WHEN qh.EXECUTION_STATUS = 'FAIL' THEN 1 END) / NULLIF(COUNT(CASE WHEN qh.EXECUTION_STATUS IN ('FAIL', 'SUCCESS') THEN 1 END), 0) AS PERCENTAGE_FAILS,
    MAX(CASE WHEN qh.EXECUTION_STATUS = 'FAIL' THEN qh.START_TIME END) AS LAST_FAIL_TIMESTAMP,
    DATEDIFF(day, LAST_FAIL_TIMESTAMP, CURRENT_DATE()) AS DAYS_SINCE_LAST_FAIL,
    MAX(CASE WHEN qh.EXECUTION_STATUS = 'SUCCESS' THEN qh.START_TIME END) AS LAST_SUCCESS_TIMESTAMP,
    SUM(qh.EXECUTION_TIME * wcm.CREDITS_PER_HOUR) AS EFFECTIVE_EXECUTION_TIME,
    EFFECTIVE_EXECUTION_TIME/(3600*1000) * 3 * 12 AS EFFECTIVE_ANNUAL_COST
FROM
    SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY qh
JOIN
    WAREHOUSE_CREDITS_MAP wcm
ON
    wcm.WAREHOUSE_SIZE = qh.WAREHOUSE_SIZE AND wcm.WAREHOUSE_TYPE = qh.WAREHOUSE_TYPE
WHERE
    qh.START_TIME BETWEEN DATEADD('day', -30, CURRENT_DATE()) AND CURRENT_DATE()
GROUP BY
    qh.QUERY_HASH
HAVING
    (LAST_SUCCESS_TIMESTAMP IS NULL OR LAST_SUCCESS_TIMESTAMP <= DATEADD('day', -10, CURRENT_DATE()))
    AND DAYS_SINCE_LAST_FAIL <= 10
    AND PERCENTAGE_FAILS >= 50
    AND TOTAL_FAILS >= 3
    AND EFFECTIVE_ANNUAL_COST >= 200
ORDER BY
    EFFECTIVE_ANNUAL_COST DESC

Let’s understand each column in the resulting table.

  • QUERY_TEXT : Exact text of the query.

  • USER_NAME : User who ran the query.

  • TOTAL_FAILS : Total number of times a particular query has failed in the last 30 days.

  • TOTAL_SUCCESS : Total number of times a particular query has succeeded in the last 30 days.

  • PERCENTAGE_FAILS : Percentage of failure against total runs in the last 30 days.

  • LAST_FAIL_TIMESTAMP : Timestamp when the query last failed.

  • DAYS_SINCE_LAST_FAIL : Number of days from today since the query last failed.

  • LAST_SUCCESS_TIMESTAMP : Timestamp when the query last succeeded.

  • EFFECTIVE_EXECUTION_TIME : Sum of the total execution time multiplied by the warehouse credits based on the size of the warehouse it ran on.

  • EFFECTIVE_ANNUAL_COST : A rough estimate of the total cost a failed workload would incur in the year if run independently of other workloads.

Filtering out the Noise

We have applied several filters to target the right workloads. You can adjust these numbers as appropriate.

  1. LAST_SUCCESS_TIMESTAMP can be null or older than 10 days.

  2. PERCENTAGE_FAIL should be at least 50%.

  3. TOTAL_FAILS should be at least 3.

  4. DAYS_SINCE_LAST_FAIL can be a maximum of 10 days.

  5. EFFECTIVE_ANNUAL_COST should be at least $200.

With these filters, we have tried to cover scenarios where there might be a good reason to not worry about certain failing workloads, such as:

  1. A particular workload was failing but the user moved it to a larger warehouse and now it started succeeding. This would mean that the user has actively looked into it and remediated it.

  2. A workload was failing 30 days ago but hasn’t been run since. We also need to filter out these workloads. The frequency of such workloads is low enough that their contribution to costs would be insignificant.

Analyzing the Results

Here’s what we have so far: a list of all queries that failed at least 3 times in the last month (including at least once in the last 10 days), and did not succeed even once in the last 10 days. Moreover, they failed at least 50% of the time, and cost more than $200 annually. We can now be confident that these are prime targets for us to address.

We now know the USER_NAME of the user who ran the query. The assumption here is that the same user would be firing the query, more often than not using a tool like dbt, Airflow, or Looker.

As Account Admins, usually, the only way to remediate this is to identify the team who ran a particular query, and notify them. Here are two ways to identify the right team:

  1. For organizations that allocate different warehouses to different teams, WAREHOUSE_NAME is sufficient to identify the team responsible for a workload.

  2. If the warehouses are shared across teams, identifying the user can be tricky. Copy parts of theQUERY_TEXT, search it on GitHub, and identify the team responsible for it. You can also use the git blame command to identify the user more granularly.

Proactive communication and collaboration with the teams involved are key to successful remediation of failing workloads.

Want to hear the latest on Snowflake optimization?

Sign up for our newsletter that we send out (roughly) every two weeks.

Want to hear the latest on Snowflake optimization?

Sign up for our newsletter that we send out (roughly) every two weeks.

Start saving today.

Don't let Snowflake costs drag your team down.

© Merse, Inc. dba Baselit. All rights reserved.

Start saving today.

Don't let Snowflake costs drag your team down.

© Merse, Inc. dba Baselit. All rights reserved.