Snowflake Architecture includes Caching at various levels to speed the Queries and reduce the machine load. We will now discuss on different caching techniques present in Snowflake that will help in Efficient Performance Tuning and Maximizing the System Performance.
The Overall Architecture of Snowflake and the caching of data at each level is as follows
Let’s have a deep understanding of how Caching is done in Snowflake with a few examples. We will use pre-built benchmark tables that come along with Snowflake Datawarehouse. The dataset contains details about Parts and its Attributes. It contains 200 million rows that constitute to 5GB of data.
The Virtual Warehouse of size XS with Single Cluster is used. The Execution time might vary depending on the size of the virtual Warehouse and the number of clusters.
Once the Virtual Warehouse is ON, the Part Manager and Part Brand is selected along with the total Retail Price
Since the Virtual Warehouse is just turned on the Local disk cache is empty. The data will be fetched from Remote disk S3 storage, processed and then cached in Local disk SSD.
The query execution plan is as follows
Total execution time is 4.2s, 0% of data is scanned from Cache and 87% of the time is taken in Remote Disk IO
The same Query is executed with an additional column p_name. The expectation is to select the needed columns from the Local disk Cache and the new column from the Remote Disk S3.
The Query Execution plan is as follows,
This time 21.38% of data is scanned from local disk cache and only 4% of the time is spent in Remote Disk IO. For better understanding, the same query or the subset of the query can be rerun with result cache disabled in order to fetch the whole data from Local disk cache. By default, the result cache will be enabled. This can be disabled using the following command -and the same query will be rerun again.
100% of the result set is scanned from the Local Disk cache. The cache is Volatile and it is dropped once the Virtual Warehouse is suspended, this might result in slower initial performance once the warehouse is resumed. Better Performance can be achieved by segmenting the Query workload by grouping similar Users in the same Virtual Warehouse thus the query used by one user can be used by the other user. The size of the local disk cache depends on the number of servers in the warehouse. Larger the Warehouse, increased number of servers hence larger the size of Cache. Hence the awareness of losing the performance benefits of cache should be there before reducing the size or suspending the Warehouse.
The same query is now executed with the result cache enabled. The Query Execution plan is as follows.
The Result Cache is independent of the Virtual Warehouses hence any query executed by any user in the account is available in the Result cache provided the SQL query is the same. Many dashboard applications involve the re-execution of the same SQL across the various screens where data can be fetched from result cache for fast retrieval. Each query executed is retained for 24 hours and the time is reset to 31 days if the query is re-executed. Since the cache occurs in the service layer there is no cost associated with it (no storage or computation cost). Retrieval optimization and Post-processing query results can be achieved using Result cache for optimized performance.
The Result cache of the query will automatically be used only if it meets few conditions. The conditions can be found in this blog https://community.snowflake.com/s/article/Understanding-Result-Caching.
Result cache can be explicitly used while building a complex query using the following command
As discussed above the Unique features present in Snowflake caching at different levels is as follows,
- No Configuration for setting up the cache is involved
- Cache available from all the warehouse
- Result cache persists for 24 hours
- Infinite space for storage Cache(S3)
- Faster retrieval
Hence correct utilization of the cache depending upon the scenario will make the system Cost-efficient in terms of computing and storage with Optimized performance