Snowflake was built for the cloud and supports an unlimited number of Virtual Warehouses – effectively independently sized compute clusters, that share access to a common data store.
This Elastic Parallel Processing means it’s possible to run complex data science operations, ELT loading and Business Intelligence queries against the same data without contention for resources.
Snowflake was designed for efficiency, almost no performance tuning options needed for user to consider. We will be discussing few best practices to gain most of the performance in a cost-efficient way!
Data loading recommendations
Files need to be Split on Snowflake: Considering Snowflakes multi cluster and multi-threading architecture split your data into multiple small files than one large file, to make use of all the nodes in Cluster. Loading a large single file will make only one node at action and other nodes are ignored even if we have larger warehouse. Follow same practice for data unloading as well.
Have a separate large warehouse to support ingesting large files, this provides full horse power for data loading. We can turnoff warehouse after data loading.
Data querying recommendations
Dedicated warehouse for Querying: Snowflake automatically caches data in the Virtual Warehouse (local disk cache), so place users querying the same data on the same virtual warehouse. This maximizes the chances that data retrieved to the cache by one user will also be used by others. Suspending warehouse will erase this cache.
Result Cache is maintained by Global Services layer, any query executed by any user on the account will be served from the result cache, provided the SQL text is the same. Results are retained for 24 hours.
Snowflake Query Profile feature helps us to analyze queries being run from BI tools as well. In case, you have multiple BI tools and common users, having dedicated warehouse for each BI tool will help identifying queries generated from BI tool. We do not have out of the box option to identify the origin.
Consider scaling out the warehouse which will be using for BI Analytics to cater concurrent users.
Storing Semi-structured Data in a VARIANT Column: For data that is mostly regular and uses only native types (strings and integers), the storage requirements and query performance for operations on relational data and data in a VARIANT column is very similar. Load the data set into a VARIANT column in a table. Use the FLATTEN function to extract the objects and keys you plan to query into a separate table.
Date/Time Data Types for Columns: When defining columns to contain dates or timestamps, choose a date or timestamp data type rather than a character data type. Snowflake stores DATE and TIMESTAMP data more efficiently than VARCHAR, resulting in better query performance.
Set a Clustering Key for larger datasets: Specifying a clustering key is not necessary for most tables. Snowflake performs automatic tuning via the optimization engine and micro-partitioning. Set Cluster keys for larger data sets greater than 1 TB and if Query Profile indicates that a significant percentage of the total duration time is spent scanning.
Use Transient tables as needed:Snowflake supports the creation of transient tables. Snowflake does not preserve a history for these tables, which can result in a measurable reduction of your Snowflake storage costs. Consider using Transient tables as needed such as for POC’s.
Create and Maintain Dashboard for Snowflake Usage: Snowflake provides out of the box feature to Audit Cost incurred for the “Account” and other usage metrics as good visualizations under “Account” section. This information is only available to Snowflake Account Admin and others who can query.
Create and Maintain a live dashboard for developers and users who consume snowflake to better manage their usage. We can use available usage dashboards in Looker and Tableau for easy implementation.
Set SEARCH_PATH: setting SEARCH_PATH for to session can gain milliseconds to our querying as we are routing the query to right database and schema.
Snowflake’s architecture loads, stores and manages data without additional configurations, above discussed are few parameters to use Snowflake in optimized way. Follow Snowflake documentation for more information to fine tune your instance.