How to Cut Storage Costs in Snowflake

A comprehensive guide for optimizing Snowflake storage.

Sahil Singla

Co-founder of Baselit

In this blog, I’ll discuss some important ways in which you can start optimizing your Snowflake storage. We’ll start by looking at the key concepts of storage in Snowflake, including the types of storage and their organization, and examine the lifecycle of micro-partitions. Understanding these elements is fundamental to managing your data storage costs. Then, we’ll look at some SQL queries and Python scripts, that will help you identify and reduce under-utilized storage, especially in large tables.

Snowflake’s Storage Architecture

Types of Storage Snowflake utilizes two primary types of storage: stage and table storage. Stage storage temporarily holds data from external sources, such as S3, Azure Blob Storage, and GCS, before it is loaded into Snowflake. Table storage, on the other hand, is used for storing data within Snowflake tables. Snowflake offers several types of tables:

Permanent Tables: These tables are created by users and are intended for long-term storage in Snowflake.
Temporary Tables: These tables exist only for the duration of the session in which they are created.
Transient Tables: Similar to permanent tables in persistence, transient tables do not include fail-safe features and offer limited time travel capabilities, typically ranging from 0 to 1 day.

Storage Organization

To understand the costs associated with table storage, it’s essential to know how Snowflake organizes its storage. A table in Snowflake is composed of numerous micro-partitions. Each micro-partition contains about 16 MB of compressed data, which can be equivalent to approximately 50 to 500 MB of uncompressed data. These micro-partitions are structured using a PAX or hybrid columnar format, providing an efficient way of storing data. The figure below is a visual representation of how data is stored in each micro-partition.

In Snowflake, column data is typically compressed to save space. Along with each micro-partition, there is a header at the top of the file. This header stores the offsets for each column’s data, which helps in the efficient processing of queries by allowing the system to skip unnecessary columns.

Each micro-partition in Snowflake is immutable, meaning that any update within a micro-partition necessitates the creation of a new micro-partition. This new partition replicates all existing data from the original, with the updated entry included. Consequently, the total active bytes associated with the table are updated to include this new micro-partition. Meanwhile, the older version of the partition enters a ‘time travel’ state.

Storage Lifecycle

Snowflake incorporates mechanisms known as ‘time travel’ and ‘fail-safe.’ Time travel allows data to be retrieved for a specified duration for recovery purposes. Following the end of the time travel period, the fail-safe period begins, which extends the availability of the data for recovery purposes. Once the fail-safe period expires, the data is permanently deleted and ceases to incur charges. The duration of both the time travel and fail-safe periods varies depending on the table type and the specific edition of Snowflake. Below is an overview of how these periods are typically structured:

Each micro-partition in Snowflake undergoes a lifecycle that starts as part of a table’s active storage. It then progresses through the stages of time travel and fail-safe, before being permanently deleted. This systematic approach enables Snowflake to efficiently restore tables by simply referencing and reactivating the relevant older micro-partitions. It’s important to understand that Snowflake imposes storage charges for micro-partitions throughout their entire lifecycle, which includes the stages of active storage, time travel, and fail-safe.

Cloning

Snowflake features a cloning capability for tables, which helps in the creation of a new table that references all the micro-partitions of an existing base table. This process, known as ‘zero copy cloning,’ doesn’t duplicate the micro-partitions. Instead, the new table merely points to the existing ones. Significantly, these micro-partitions are not subjected to double charging. Storage costs are incurred only once as part of the base table. They are only removed from the base table’s storage calculations when they are no longer included in any of its clones.

Furthermore, if the base table is deleted and the fail-safe period for its micro-partitions has passed, these partitions are still tracked under ‘retained for clone bytes.’ This accounting continues as long as at least one clone table uses these micro-partitions in its active storage, or until their own fail-safe period ends.

Optimizing Storage Costs in Snowflake

Snowflake’s storage costs are calculated based on the bytes stored in micro-partitions, encompassing active storage, time travel, fail-safe, and bytes retained for cloning. The cost calculation involves taking hourly snapshots of data usage, averaging these over a month, and then applying the storage provider’s rates, which are typically around $23 per terabyte (TB) per month.

Query for Calculating Storage Cost

To determine the specific storage cost contribution of a particular table, the following SQL query can be used:

WITH current_storage_cost AS (
    SELECT 
        currency, 
        effective_rate 
    FROM 
        snowflake.organization_usage.rate_sheet_daily
    WHERE 
        usage_type = 'storage' 
    ORDER BY 
        date DESC
    LIMIT 1
)
SELECT
    id AS table_id, -- Unique table identifier 
    id != clone_group_id AS is_cloned, -- If table is a clone
    table_catalog || '.' || table_schema || '.' || table_name AS fully_qualified_table_name, -- Full name
    active_bytes,
    time_travel_bytes,
    failsafe_bytes,
    retained_for_clone_bytes,
    (active_bytes + time_travel_bytes + failsafe_bytes + retained_for_clone_bytes) / POWER(1024, 4) AS total_storage_tb, -- Storage in TBs
    current_storage_cost.effective_rate AS storage_price_per_tb,
    total_storage_tb * storage_price_per_tb AS price,
    current_storage_cost.currency AS currency
FROM 
    snowflake.account_usage.table_storage_metrics
CROSS JOIN 
    current_storage_cost
WHERE 
    NOT deleted -- Only live tables
ORDER BY 
    price DESC

WITH current_storage_cost AS (
    SELECT 
        currency, 
        effective_rate 
    FROM 
        snowflake.organization_usage.rate_sheet_daily
    WHERE 
        usage_type = 'storage' 
    ORDER BY 
        date DESC
    LIMIT 1
)
SELECT
    id AS table_id, -- Unique table identifier 
    id != clone_group_id AS is_cloned, -- If table is a clone
    table_catalog || '.' || table_schema || '.' || table_name AS fully_qualified_table_name, -- Full name
    active_bytes,
    time_travel_bytes,
    failsafe_bytes,
    retained_for_clone_bytes,
    (active_bytes + time_travel_bytes + failsafe_bytes + retained_for_clone_bytes) / POWER(1024, 4) AS total_storage_tb, -- Storage in TBs
    current_storage_cost.effective_rate AS storage_price_per_tb,
    total_storage_tb * storage_price_per_tb AS price,
    current_storage_cost.currency AS currency
FROM 
    snowflake.account_usage.table_storage_metrics
CROSS JOIN 
    current_storage_cost
WHERE 
    NOT deleted -- Only live tables
ORDER BY 
    price DESC

Identifying and Reducing Unused Storage

Finding Unused Tables: A key strategy for reducing storage costs is identifying unused tables, those that haven’t been queried recently. This can be accomplished using Snowflake’s ‘access_history’ view, which provides metadata on table access. However, this feature is not available in the Standard edition of Snowflake. The provided SQL query examines table access, modification, creation details, and associated costs, focusing on the past 30 days. This query integrates data from various Snowflake views to calculate the cost and usage details of each table.

WITH current_storage_provider_cost AS (
    SELECT 
        currency, 
        effective_rate 
    FROM 
        snowflake.organization_usage.rate_sheet_daily
    WHERE 
        usage_type = 'storage' 
    ORDER BY 
        date DESC
    LIMIT 1
), 
table_storage_cost AS (
    SELECT
        id AS table_id, -- Unique table identifier 
        id != clone_group_id AS is_cloned, -- If table is a clone
        table_catalog || '.' || table_schema || '.' || table_name AS fully_qualified_table_name, -- Full name
        active_bytes,
        time_travel_bytes,
        failsafe_bytes,
        retained_for_clone_bytes,
        (active_bytes + time_travel_bytes + failsafe_bytes + retained_for_clone_bytes) / POWER(1024, 4) AS total_storage_tb, -- Storage in TBs
        current_storage_provider_cost.effective_rate AS storage_price_per_tb,
        total_storage_tb * storage_price_per_tb AS price,
        current_storage_provider_cost.currency AS currency
    FROM 
        snowflake.account_usage.table_storage_metrics
    CROSS JOIN 
        current_storage_provider_cost
    WHERE 
        NOT deleted -- Only live tables
), 
table_dml_details AS (
    SELECT
        objects_modified.value:objectId::INTEGER AS table_id,
        COUNT(*) AS num_times_dml,
        MAX(query_start_time) AS last_dml,
        TIMEDIFF(DAYS, last_dml, CURRENT_TIMESTAMP()) AS num_days_past_last_dml
    FROM 
        snowflake.account_usage.access_history, 
        LATERAL FLATTEN(snowflake.account_usage.access_history.objects_modified) AS objects_modified
    WHERE 
        objects_modified.value:objectDomain::TEXT = 'Table' 
        AND table_id IS NOT NULL
        AND query_start_time > DATEADD('days', -30, CURRENT_DATE)
    GROUP BY 
        table_id
), 
table_ddl_details AS (
    SELECT
        object_modified_by_ddl:objectId::INTEGER AS table_id,
        COUNT(*) AS num_times_ddl,
        MAX(query_start_time) AS last_ddl,
        TIMEDIFF(DAYS, last_ddl, CURRENT_TIMESTAMP()) AS num_days_past_last_ddl
    FROM 
        snowflake.account_usage.access_history
    WHERE 
        object_modified_by_ddl:objectDomain::TEXT = 'Table' 
        AND table_id IS NOT NULL
        AND query_start_time > DATEADD('days', -30, CURRENT_DATE)
    GROUP BY 
        table_id
), 
table_access_details AS (
    SELECT
        objects_accessed.value:objectId::INTEGER AS table_id, -- Will be null for secured views or tables from a data share
        COUNT(*) AS num_times_access,
        MAX(query_start_time) AS last_access_time,
        TIMEDIFF(DAYS, last_access_time, CURRENT_TIMESTAMP()) AS num_days_past_access
    FROM 
        snowflake.account_usage.access_history, 
        LATERAL FLATTEN(snowflake.account_usage.access_history.base_objects_accessed) AS objects_accessed
    WHERE 
        objects_accessed.value:objectDomain::TEXT = 'Table' 
        AND table_id IS NOT NULL
        AND query_start_time > DATEADD('days', -30, CURRENT_DATE)
    GROUP BY 
        table_id
)
SELECT 
    table_storage_cost.table_id, 
    fully_qualified_table_name, 
    active_bytes, 
    time_travel_bytes, 
    failsafe_bytes, 
    retained_for_clone_bytes, 
    total_storage_tb, 
    storage_price_per_tb, 
    price, 
    currency, 
    num_times_dml, 
    last_dml, 
    num_days_past_last_dml, 
    num_times_ddl, 
    last_ddl, 
    num_days_past_last_ddl, 
    num_times_access, 
    last_access_time, 
    num_days_past_access  
FROM 
    table_storage_cost
LEFT OUTER JOIN 
    table_dml_details ON table_storage_cost.table_id = table_dml_details.table_id
LEFT OUTER JOIN 
    table_ddl_details ON table_storage_cost.table_id = table_ddl_details.table_id
LEFT OUTER JOIN 
    table_access_details ON table_storage_cost.table_id = table_access_details.table_id
ORDER BY 
    CASE WHEN num_days_past_access IS NULL THEN 1 ELSE 0 END DESC,
    num_days_past_access DESC, 
    price DESC

WITH current_storage_provider_cost AS (
    SELECT 
        currency, 
        effective_rate 
    FROM 
        snowflake.organization_usage.rate_sheet_daily
    WHERE 
        usage_type = 'storage' 
    ORDER BY 
        date DESC
    LIMIT 1
), 
table_storage_cost AS (
    SELECT
        id AS table_id, -- Unique table identifier 
        id != clone_group_id AS is_cloned, -- If table is a clone
        table_catalog || '.' || table_schema || '.' || table_name AS fully_qualified_table_name, -- Full name
        active_bytes,
        time_travel_bytes,
        failsafe_bytes,
        retained_for_clone_bytes,
        (active_bytes + time_travel_bytes + failsafe_bytes + retained_for_clone_bytes) / POWER(1024, 4) AS total_storage_tb, -- Storage in TBs
        current_storage_provider_cost.effective_rate AS storage_price_per_tb,
        total_storage_tb * storage_price_per_tb AS price,
        current_storage_provider_cost.currency AS currency
    FROM 
        snowflake.account_usage.table_storage_metrics
    CROSS JOIN 
        current_storage_provider_cost
    WHERE 
        NOT deleted -- Only live tables
), 
table_dml_details AS (
    SELECT
        objects_modified.value:objectId::INTEGER AS table_id,
        COUNT(*) AS num_times_dml,
        MAX(query_start_time) AS last_dml,
        TIMEDIFF(DAYS, last_dml, CURRENT_TIMESTAMP()) AS num_days_past_last_dml
    FROM 
        snowflake.account_usage.access_history, 
        LATERAL FLATTEN(snowflake.account_usage.access_history.objects_modified) AS objects_modified
    WHERE 
        objects_modified.value:objectDomain::TEXT = 'Table' 
        AND table_id IS NOT NULL
        AND query_start_time > DATEADD('days', -30, CURRENT_DATE)
    GROUP BY 
        table_id
), 
table_ddl_details AS (
    SELECT
        object_modified_by_ddl:objectId::INTEGER AS table_id,
        COUNT(*) AS num_times_ddl,
        MAX(query_start_time) AS last_ddl,
        TIMEDIFF(DAYS, last_ddl, CURRENT_TIMESTAMP()) AS num_days_past_last_ddl
    FROM 
        snowflake.account_usage.access_history
    WHERE 
        object_modified_by_ddl:objectDomain::TEXT = 'Table' 
        AND table_id IS NOT NULL
        AND query_start_time > DATEADD('days', -30, CURRENT_DATE)
    GROUP BY 
        table_id
), 
table_access_details AS (
    SELECT
        objects_accessed.value:objectId::INTEGER AS table_id, -- Will be null for secured views or tables from a data share
        COUNT(*) AS num_times_access,
        MAX(query_start_time) AS last_access_time,
        TIMEDIFF(DAYS, last_access_time, CURRENT_TIMESTAMP()) AS num_days_past_access
    FROM 
        snowflake.account_usage.access_history, 
        LATERAL FLATTEN(snowflake.account_usage.access_history.base_objects_accessed) AS objects_accessed
    WHERE 
        objects_accessed.value:objectDomain::TEXT = 'Table' 
        AND table_id IS NOT NULL
        AND query_start_time > DATEADD('days', -30, CURRENT_DATE)
    GROUP BY 
        table_id
)
SELECT 
    table_storage_cost.table_id, 
    fully_qualified_table_name, 
    active_bytes, 
    time_travel_bytes, 
    failsafe_bytes, 
    retained_for_clone_bytes, 
    total_storage_tb, 
    storage_price_per_tb, 
    price, 
    currency, 
    num_times_dml, 
    last_dml, 
    num_days_past_last_dml, 
    num_times_ddl, 
    last_ddl, 
    num_days_past_last_ddl, 
    num_times_access, 
    last_access_time, 
    num_days_past_access  
FROM 
    table_storage_cost
LEFT OUTER JOIN 
    table_dml_details ON table_storage_cost.table_id = table_dml_details.table_id
LEFT OUTER JOIN 
    table_ddl_details ON table_storage_cost.table_id = table_ddl_details.table_id
LEFT OUTER JOIN 
    table_access_details ON table_storage_cost.table_id = table_access_details.table_id
ORDER BY 
    CASE WHEN num_days_past_access IS NULL THEN 1 ELSE 0 END DESC,
    num_days_past_access DESC, 
    price DESC

Managing Large Tables: When dealing with large, infrequently accessed tables, such as clickstream data tables, it becomes crucial to eliminate data portions that are no longer needed. The subsequent Python script, executable via Snowpark, is designed to analyze partition usage in these large tables, thereby identifying underutilized sections. The script operates by analyzing the most recent queries on a specified table (limited to a 14-day window due to the availability of query operator stats). It then calculates the percentage of partitions scanned for each query, offering insights into partition utilization.

# Purpose: This script assists in identifying and reducing under-utilized storage in Snowflake, 
# especially beneficial for large tables. It leverages Snowflake's Snowpark and SQL capabilities.

import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col
import pandas as pd
import numpy as np

def main(session: snowpark.Session):
    # Replace with your target table's fully qualified name
    table_name = "<database>.<schema>.<table>"  
    
    # SQL query to retrieve the most recent query IDs where the specified table was accessed.
    # The 14-day filter is applied as query operator stats are available only for this duration.
    # A limit of 1000 queries can be removed, here it is set because on X-Small, 
    # it takes around 5-6 minutes to run due to slow execution of GET_QUERY_OPERATOR_STATS
    query = f"""
            WITH relevant_queries AS (
                SELECT DISTINCT
                    query_id,
                    query_start_time
                FROM 
                    snowflake.account_usage.access_history,
                    LATERAL FLATTEN(snowflake.account_usage.access_history.base_objects_accessed) AS objects_accessed
                WHERE 
                    objects_accessed.value:objectDomain::TEXT = 'Table'
                    AND objects_accessed.value:objectName::TEXT = '{table_name}'
                    AND query_start_time > TIMEADD('days', -14, current_timestamp())
                ORDER BY 
                    query_start_time DESC
                LIMIT 1000
            )
            SELECT 
                relevant_queries.query_id 
            FROM 
                relevant_queries 
            INNER JOIN 
                snowflake.account_usage.query_history ON relevant_queries.query_id = query_history.query_id
            WHERE 
                query_history.start_time > TIMEADD('days', -14, current_timestamp())
                AND query_history.execution_status = 'SUCCESS'
            """

    # Execution of the SQL query and storing the results in a Pandas DataFrame
    df = session.sql(query).to_pandas()
    scanned_percentages = np.array([])
    
    # Analyze each query to determine the percentage of partitions scanned.
    # This helps in identifying underutilized partitions of the table.
    for i, row in df.iterrows():
        query_id = row['QUERY_ID']
        query_stats = session.sql(f"""
            SELECT 
                operator_statistics:pruning:partitions_scanned AS partitions_scanned,
                operator_statistics:pruning:partitions_total AS partitions_total,
                100 * div0(partitions_scanned, partitions_total) AS percent_scanned
            FROM 
                TABLE(GET_QUERY_OPERATOR_STATS('{query_id}')) 
            WHERE 
                operator_type = 'TableScan' AND
                operator_attributes:table_name::TEXT = '{table_name}'
        """).to_pandas()

        # Append the percentages of scanned partitions for further analysis.
        if not query_stats.empty:
            scanned_percentages = np.append(scanned_percentages, query_stats['PERCENT_SCANNED'].values)
    
    # Calculation of percentile values to understand data access patterns.
    percentiles = [0.1, 0.5, 0.9, 0.99, 1]
    percentile_values = np.percentile(scanned_percentages, [p * 100 for p in percentiles])
    
    # Organize the percentile data into a DataFrame for easy interpretation.
    # This DataFrame provides insights into the usage of partitions in the table.
    percentile_data = {'table_name': table_name}
    for i, p in enumerate(percentiles):
        percentile_data[f'p{int(p * 100)}'] = percentile_values[i]

    scanned_partitions_per_table = pd.DataFrame([percentile_data])
    
    # The resulting DataFrame is returned and can be viewed in the Results tab.
    # It offers a clear overview of partition usage, aiding in storage optimization.
    return session.create_dataframe(scanned_partitions_per_table)

# Purpose: This script assists in identifying and reducing under-utilized storage in Snowflake, 
# especially beneficial for large tables. It leverages Snowflake's Snowpark and SQL capabilities.

import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col
import pandas as pd
import numpy as np

def main(session: snowpark.Session):
    # Replace with your target table's fully qualified name
    table_name = "<database>.<schema>.<table>"  
    
    # SQL query to retrieve the most recent query IDs where the specified table was accessed.
    # The 14-day filter is applied as query operator stats are available only for this duration.
    # A limit of 1000 queries can be removed, here it is set because on X-Small, 
    # it takes around 5-6 minutes to run due to slow execution of GET_QUERY_OPERATOR_STATS
    query = f"""
            WITH relevant_queries AS (
                SELECT DISTINCT
                    query_id,
                    query_start_time
                FROM 
                    snowflake.account_usage.access_history,
                    LATERAL FLATTEN(snowflake.account_usage.access_history.base_objects_accessed) AS objects_accessed
                WHERE 
                    objects_accessed.value:objectDomain::TEXT = 'Table'
                    AND objects_accessed.value:objectName::TEXT = '{table_name}'
                    AND query_start_time > TIMEADD('days', -14, current_timestamp())
                ORDER BY 
                    query_start_time DESC
                LIMIT 1000
            )
            SELECT 
                relevant_queries.query_id 
            FROM 
                relevant_queries 
            INNER JOIN 
                snowflake.account_usage.query_history ON relevant_queries.query_id = query_history.query_id
            WHERE 
                query_history.start_time > TIMEADD('days', -14, current_timestamp())
                AND query_history.execution_status = 'SUCCESS'
            """

    # Execution of the SQL query and storing the results in a Pandas DataFrame
    df = session.sql(query).to_pandas()
    scanned_percentages = np.array([])
    
    # Analyze each query to determine the percentage of partitions scanned.
    # This helps in identifying underutilized partitions of the table.
    for i, row in df.iterrows():
        query_id = row['QUERY_ID']
        query_stats = session.sql(f"""
            SELECT 
                operator_statistics:pruning:partitions_scanned AS partitions_scanned,
                operator_statistics:pruning:partitions_total AS partitions_total,
                100 * div0(partitions_scanned, partitions_total) AS percent_scanned
            FROM 
                TABLE(GET_QUERY_OPERATOR_STATS('{query_id}')) 
            WHERE 
                operator_type = 'TableScan' AND
                operator_attributes:table_name::TEXT = '{table_name}'
        """).to_pandas()

        # Append the percentages of scanned partitions for further analysis.
        if not query_stats.empty:
            scanned_percentages = np.append(scanned_percentages, query_stats['PERCENT_SCANNED'].values)
    
    # Calculation of percentile values to understand data access patterns.
    percentiles = [0.1, 0.5, 0.9, 0.99, 1]
    percentile_values = np.percentile(scanned_percentages, [p * 100 for p in percentiles])
    
    # Organize the percentile data into a DataFrame for easy interpretation.
    # This DataFrame provides insights into the usage of partitions in the table.
    percentile_data = {'table_name': table_name}
    for i, p in enumerate(percentiles):
        percentile_data[f'p{int(p * 100)}'] = percentile_values[i]

    scanned_partitions_per_table = pd.DataFrame([percentile_data])
    
    # The resulting DataFrame is returned and can be viewed in the Results tab.
    # It offers a clear overview of partition usage, aiding in storage optimization.
    return session.create_dataframe(scanned_partitions_per_table)

The script’s output includes percentile values (p10, p50, p90, p99, p100) which help in understanding data access patterns. A low p99 value in a high-cost table suggests that a significant portion of the table is rarely accessed. Further investigation into the query operator stats, particularly focusing on the ‘Filter’ operator type, can provide deeper insights into whether specific data partitions (like old date ranges) are not queried.

Tables with a high percentage of unaccessed data, especially over extended periods, will be prime candidates for data archival or deletion of partitions, helping in reducing storage costs.