January 24, 2024
How to Cut Storage Costs in Snowflake
A comprehensive guide for optimizing Snowflake storage.
Sahil Singla
Co-founder of Baselit
In this blog, I’ll discuss some important ways in which you can start optimizing your Snowflake storage. We’ll start by looking at the key concepts of storage in Snowflake, including the types of storage and their organization, and examine the lifecycle of micro-partitions. Understanding these elements is fundamental to managing your data storage costs. Then, we’ll look at some SQL queries and Python scripts, that will help you identify and reduce under-utilized storage, especially in large tables.
Snowflake’s Storage Architecture
Types of Storage Snowflake utilizes two primary types of storage: stage and table storage. Stage storage temporarily holds data from external sources, such as S3, Azure Blob Storage, and GCS, before it is loaded into Snowflake. Table storage, on the other hand, is used for storing data within Snowflake tables. Snowflake offers several types of tables:
Permanent Tables: These tables are created by users and are intended for long-term storage in Snowflake.
Temporary Tables: These tables exist only for the duration of the session in which they are created.
Transient Tables: Similar to permanent tables in persistence, transient tables do not include fail-safe features and offer limited time travel capabilities, typically ranging from 0 to 1 day.
Storage Organization
To understand the costs associated with table storage, it’s essential to know how Snowflake organizes its storage. A table in Snowflake is composed of numerous micro-partitions. Each micro-partition contains about 16 MB of compressed data, which can be equivalent to approximately 50 to 500 MB of uncompressed data. These micro-partitions are structured using a PAX or hybrid columnar format, providing an efficient way of storing data. The figure below is a visual representation of how data is stored in each micro-partition.
In Snowflake, column data is typically compressed to save space. Along with each micro-partition, there is a header at the top of the file. This header stores the offsets for each column’s data, which helps in the efficient processing of queries by allowing the system to skip unnecessary columns.
Each micro-partition in Snowflake is immutable, meaning that any update within a micro-partition necessitates the creation of a new micro-partition. This new partition replicates all existing data from the original, with the updated entry included. Consequently, the total active bytes associated with the table are updated to include this new micro-partition. Meanwhile, the older version of the partition enters a ‘time travel’ state.
Storage Lifecycle
Snowflake incorporates mechanisms known as ‘time travel’ and ‘fail-safe.’ Time travel allows data to be retrieved for a specified duration for recovery purposes. Following the end of the time travel period, the fail-safe period begins, which extends the availability of the data for recovery purposes. Once the fail-safe period expires, the data is permanently deleted and ceases to incur charges. The duration of both the time travel and fail-safe periods varies depending on the table type and the specific edition of Snowflake. Below is an overview of how these periods are typically structured:
Each micro-partition in Snowflake undergoes a lifecycle that starts as part of a table’s active storage. It then progresses through the stages of time travel and fail-safe, before being permanently deleted. This systematic approach enables Snowflake to efficiently restore tables by simply referencing and reactivating the relevant older micro-partitions. It’s important to understand that Snowflake imposes storage charges for micro-partitions throughout their entire lifecycle, which includes the stages of active storage, time travel, and fail-safe.
Cloning
Snowflake features a cloning capability for tables, which helps in the creation of a new table that references all the micro-partitions of an existing base table. This process, known as ‘zero copy cloning,’ doesn’t duplicate the micro-partitions. Instead, the new table merely points to the existing ones. Significantly, these micro-partitions are not subjected to double charging. Storage costs are incurred only once as part of the base table. They are only removed from the base table’s storage calculations when they are no longer included in any of its clones.
Furthermore, if the base table is deleted and the fail-safe period for its micro-partitions has passed, these partitions are still tracked under ‘retained for clone bytes.’ This accounting continues as long as at least one clone table uses these micro-partitions in its active storage, or until their own fail-safe period ends.
Optimizing Storage Costs in Snowflake
Snowflake’s storage costs are calculated based on the bytes stored in micro-partitions, encompassing active storage, time travel, fail-safe, and bytes retained for cloning. The cost calculation involves taking hourly snapshots of data usage, averaging these over a month, and then applying the storage provider’s rates, which are typically around $23 per terabyte (TB) per month.
Query for Calculating Storage Cost
To determine the specific storage cost contribution of a particular table, the following SQL query can be used:
Identifying and Reducing Unused Storage
Finding Unused Tables: A key strategy for reducing storage costs is identifying unused tables, those that haven’t been queried recently. This can be accomplished using Snowflake’s ‘access_history’ view, which provides metadata on table access. However, this feature is not available in the Standard edition of Snowflake. The provided SQL query examines table access, modification, creation details, and associated costs, focusing on the past 30 days. This query integrates data from various Snowflake views to calculate the cost and usage details of each table.
Managing Large Tables: When dealing with large, infrequently accessed tables, such as clickstream data tables, it becomes crucial to eliminate data portions that are no longer needed. The subsequent Python script, executable via Snowpark, is designed to analyze partition usage in these large tables, thereby identifying underutilized sections. The script operates by analyzing the most recent queries on a specified table (limited to a 14-day window due to the availability of query operator stats). It then calculates the percentage of partitions scanned for each query, offering insights into partition utilization.
The script’s output includes percentile values (p10, p50, p90, p99, p100) which help in understanding data access patterns. A low p99 value in a high-cost table suggests that a significant portion of the table is rarely accessed. Further investigation into the query operator stats, particularly focusing on the ‘Filter’ operator type, can provide deeper insights into whether specific data partitions (like old date ranges) are not queried.
Tables with a high percentage of unaccessed data, especially over extended periods, will be prime candidates for data archival or deletion of partitions, helping in reducing storage costs.