Data warehouse primarily acts as the core of any Business Intelligence Model. When we want to set up a Data warehouse, we focus on its architecture, how the data is loaded to the system, storage aspects, modelling, authorization and reporting. This blog takes us through these concepts with respect to Snowflake, the data warehouse built for the cloud.
Snowflake architecture has the hybrid of both shared-disk and shared-nothing architecture so that it provides users with the best from both. Snowflake consists of centralized storage, atop of which virtual warehouses are built to interact its storage. Scale-up for performance and scale-out for concurrency with no downtime. Snowflake is built on SQL and it is priced based on the usage which consists of two items – cost of storage and cost of computing resources consumed.
The architecture of snowflake comprises of 3 layers – Database storage, Query Processing, Cloud Services.
- Database storage: The loaded data is organized as micro partitions, an optimised, compressed and columnar format in cloud storage. The actual underlying file system is backed by S3 in Snowflake’s account, all data is encrypted, compressed, and distributed to optimize performance.
- Query Processing: The processing of data is done in this layer which is done by the Virtual Warehouses where each one is an independent compute cluster that doesn’t share resources with other virtual warehouses. Thus, each warehouse doesn’t impact the performance of other warehouses. These can be used to load data or run queries and can do both tasks concurrently and can be scaled according to the requirement and can be paused when not used.
- Cloud Services: The activities like sessions, authentication, SQL compilation, encryption, etc across snowflake is co-ordinated by the services available in this layer.
Data Loading in Snowflake
Though snowflake provides Bulk and Continuous loading for ingesting data, the best solution depends on the volume and frequency of data load. Data from flat files can be loaded using Web UI and bulk data can be loaded by various methods. The bulk data loading can be automated in snowflake using snow pipe. The data from the source is loaded into stage files which are then loaded into tables.
- Bulk Loading
Enables loading batches of data from any external cloud or the local system to an internal cloud location before the data is loaded into the table using the COPY command. Few transformations supported by Snowflake while loading data are column reordering, column omission, casts, truncating text strings.
- Continuous loading
Small volumes of data can be loaded using a method of continuous loading and makes them available incrementally for analysis. It is done using snow pipes.
The data can be loaded from the local system, cloud services like AWS S3, Google cloud storage, Microsoft Azure. The file formats that are supported are delimited files (CSV, TSV etc), JSON, Avro, ORC, Parquet, XML. Loading of structured and unstructured data can be done on the same table.
Data Storage in Snowflake
The compressed and optimised data is stored in a columnar format by Snowflake in cloud storage. Continuous Data Protection (CDP) which includes time travel and fail-safe is a feature available to all accounts at no additional cost. It is designed to provide long term protection for your data. Unless mentioned at the time of creation, the tables in snowflake are created as permanent data. Snowflake stores these tables by dividing the rows into multiple micro partitions. The rows in the table are grouped and mapped into separate micro-partitions and are stored in columnar fashion. Any changes or updates in the data will affect the micro partition to reflect the new data.
- Temporary tables: Temporary files exist only within the user session and specified duration of session. They have no fail-safe option and has time travel retention period of 0-1 day.
- Transient tables: These tables possess the characteristics of both permanent and temporary tables. They possess no Fail-safe and have a retention period of 0-1 day like the temporary tables but are not associated with any sessions, unlike temporary tables.
- Staged File Storage : The files containing the data to be loaded are stored in stages, both internal and external stages are supported by Snowflake. It offers cloning feature that can be useful for creating instant backups that do not incur any additional costs.
Snowflake supports creating views, procedures and functions using SQL codes from the tables stored.
Different views can be created based on the requirement that can grant specific access to the users using CREATE VIEW statement. There are two types of view – on materialized view which are usually referred to as views and materialized views. The results of the non-materialized views are not stored whereas the results of the materialized views are stored more like a table. The secure view allows the data owner to grant access to other Snowflake users for a specific subset of data which provides cell-level security in multi-tenant situations.
- Stored Procedures
- Functions (User Defined Functions)
Authorization and Authentication
Since Snowflake is a data warehouse for cloud, it uses internet protocols for network communication and security. Snowflake authorization includes OAuth, Federated Authentication, Multi-factor Authentication.
For validation and authorization, Snowflake provisions OAuth which is an Open standard Protocol that allows clients authorized access to Snowflake without storing or sharing user login credentials which are known as delegated authorization. Snowflake aids OAuth to clients using integrations, a snowflake object that provides an interface between Snowflake and third-party services.
- Federated Authorization
Users can connect to Snowflake using Single-sign-on which is enabled by Federated Authentication. In a federated environment, authentication and access are separated using external entities which provide independent authentication.
- Multi-factor Authentication
Multi factored authentication can provide increased login security for users connecting to Snowflake which is provided as an integrated feature powered by Duo Security Service which is managed by Snowflake which requires installation of Duo Mobile Application.
Reporting in snowflake is primarily to view the results for a query that has been executed. In snowflake the worksheet offers the user an interface for creating SQL Queries and DML, DDL operations to view the results only in tabular format, no other visualizations are available. Snowflake allows to create multiple worksheets with its own separate session and save it in a library of saved worksheets for later use. Saved worksheets are not accessible outside the snowflake web interface and cannot be shared with other users. Your current interface role determines the default role for worksheets that you open, but the worksheets are not tied to the interface role. Each worksheet has its own role that can be set independently.
Snowflake with its exceptional architecture along with its concurrency and simplicity can be used for Data warehousing. Snowflake has all the DW features under one roof and with its exceptional cost, we can leverage it for our purposes.