In this Blog, let us see What is Micro Partitioning in Snowflake and How does it improve the query Performance and the various benefits it holds.
Snowflake Table Structure
All data in Snowflake is stored in database tables, logically structured as collections of columns and rows.
Micro-partitions and data clustering are two of the principal concepts utilized in Snowflake physical table structures.
What is Micro-partition?
Micro-partition is a reasonably sized cloud storage block where data is stored. Each micro-partition contains between 50 MB and 500 MB of uncompressed data (Actual size in Snowflake is smaller because data is always stored compressed)
Snowflake is columnar-based and horizontally partitioned, meaning a row of data is stored in the same micro-partition. Each block represents a set of rows from the table and is stored in a compressed columnar format.
Snowflake also tracks and remembers metadata about all rows stored in a micro-partition, including:
- The range of values for each of the columns in the micro-partition.
- The number of distinct values.
- Additional properties used for both optimization and efficient query processing.
Note: In Snowflake, all data in tables is automatically divided into micro-partitions. Tables are transparently partitioned using the ordering of the data as it is inserted/loaded.
Let’s assume you will have a single fact table with one year of historical sales data with a date column. You are loading daily transaction data, snowflake partitions them in the same order as the data arrives. For simplicity, we have one micro-partition per day in the below diagram. But you will have N number of micro-partitions in a day.
When the user queries a date, Snowflake knows exactly which micro-partitions has that data (Based on the micro-partition metadata). It will then only scan the portion of the micro-partitions that contain the data.
Statistics are collected for each block independently on block creation, which is handled by the Snowflake engine transparently to the users.
Tables are automatically divided into micro-partitions, we can’t partition our tables on our own. We do have a workaround for this, you can sort the table data by a field like transaction date. It will allow Snowflake to use min-max statistics of micro-partitions to prune the ones that do not contain the relevant dates for the queries that filter on the date. This is called clustered tables. Snowflake will maintain the data clustered for you transparently, but of course for a fee for the compute and storage resources needed to achieve this.
Benefits of Micro -Partitioning in snowflake
- Micro-partitions are small, which enables extremely efficient DML and fine-grained pruning for faster queries.
- Micro-partitions can overlap in their range of values, which, combined with their uniformly small size, helps prevent skew.
- Columns are stored independently within micro-partitions, often referred to as columnar storage.
- Using this technique improves the efficient scanning of individual columns. Columns that are referenced by the query are scanned.
- Columns are compressed individually within micro-partitions. Snowflake automatically determines the most efficient compression algorithm for the columns in each micro-partition.
How micro-partition improve query performance?
Data organization into micro-partitions has a significant impact on pruning performance. The micro-partition metadata collected transparently by Snowflake enables precise pruning of columns into micro-partitions at query run-time, including columns containing semi-structured data. The Query Performance can further be improved by clustering the micro partitions.
Consider a sample data which is frequently filtered in the query WHERE clause by DATE. Since the data is loaded by DATE, it gets naturally clustered based on the same date falling into one micro partition. Since Snowflake holds the minimum and maximum value of every column in each micro-partition, it can skip over micro-partitions that don’t match the query criteria.
Since Snowflake holds the minimum and maximum value of every column in each micro-partition, it can skip over micro-partitions that don’t match the query criteria. Hence Query performance is improved, and it also saves time as it skips 99 % of data using this technique.
How do DML Operations work in micro-partitions?
Snowflake is designed for analytic workloads which consist of large reads, bulk inserts and updates. Tables are transparently partitioned using the ordering of the data as it is inserted/loaded.
Any changes to the files (Micro-Partitions) performed via DML operations, can only be made by replacing it with a different file that includes the changes. No updates happen to the existing files as Amazon S3 and Azure Blob are immutable file systems. DML operations add or delete a partition file and partition files are written in full. File additions and removals are tracked in the metadata (in the global key-value store), in a form which allows the set of files that belong to a specific table version to be computed very efficiently.
In case of an initial insert, partitions are created based on the order of the data inserted/loaded.
When a record is inserted, Snowflake adds a new partition file.
In this case, when we insert a record with value “Steve”, Snowflake adds a new partition file P3.
When a delete or update operation is performed, Snowflake deletes the partition file and replaces with a new file with the changes.
In this case, we delete all the records which have a name as “Andrews”. Now Snowflake deletes the entire partition file P1 and replaces it with a new partition file P4 including the changes.
In conclusion, micro partitioning is the best way to tune snowflake for performance and there are some best practices which may be used to maximise query performance and throughput.