In the Snowflake as a Data Lake blog, we saw the importance of the data lake and its technical challenges and how Snowflake can act as a data lake solution. We also touched on a few points on how a data lake can be implemented in Snowflake. In this part of the blog, we will see how Snowflake outplays other competitors in the market, like Amazon S3 and Delta Lake.
We are considering the following factors for comparison:
- Continuous data integration.
- Consumption & exposure of data.
- SQL interface.
- Sharing of data across accounts
- Compression of data
- Native stack (better integration).
- Supported data formats.
- How each data lake solution updates data.
|Factors||Snowflake||Amazon S3||Delta Lake|
|Continuous Data Integration||Has inbuilt option such as STREAMS||It is achieved using various technology or tools such as AWS Glue, Athena, and Spark.||It can be achieved using ETL tools.|
|Consuming / Exposing Data.||Snowflake has JDBC, ODBC, .NET, and Go Snowflake Drivers. Additionally, it has Node.js, Python, Spark, and Kafka Connectors. Snowflake also provides Java & Python APIs to simplify working in REST API.||REST API, SOAP API(Depreciated), JDBC & ODBC Drivers. Connectors for JS, Python, PHP, .NET, Ruby, Java, C++ and for NodeJS.||Delta ACID API for consuming and delta JDBC connector for exposing.|
|SQL Interface||Inbuilt (Worksheets)||Need Athena/Presto (additional cost)||Apache Spark SQL, Azure SQL, Data Warehouse/DB|
|Sharing of Data Across Accounts||Actual data is not copied or shared with another account. Read-only access is provided to a consumer account. It is achieved using a simple “share” command, which incurs computational cost and not storage cost.||Accessing file across accounts can be achieved using Amazon Quick Sight, which incurs additional cost.||Sharing of data is achieved using Azure Data Share, which is based on snapshot-based sharing.Azure Data Share incurs a cost for the operation to move a dataset from source to destination plus the cost for the resources incurred in moving the data.|
|Compression (Data Storage)||Automatically compresses the file as it stores data in a columnar format in the ratio is 4:1.||It can be achieved manually using EC2 machines.||Loads all data in Apache Parquet file format to leverage efficient compression.|
|Native Stack (better integration)||The Snowflake partner tools provide a better integration than other tools||Amazon Stack (Amazon S3 – Storage, Amazon Redshift – Datawarehouse, Amazon Athena – Querying, Amazon RDS – Database, AWS Data Pipeline – Orchestration etc.,)||Microsoft stack (BLOB – Storage, Azure Databricks – Data Preparation, Azure Synapse Analytics – Data Warehouse, Azure SQL DB – Database, Azure DevOps, Power BI – Reporting etc.,)|
|Supported Formats||Structured & semi-Structured Data (JSON, AVRO, ORC, PARQUET, and XML.)||Structured, semi-structured & Unstructured Data||Structured, semi-structured & unstructured data.|
|Data with updates||Updates the specific rows in the table with new values where the condition matches.||We cannot add data or remove or modify just a part of an existing S3 object. We should read the object, make changes to the object, and then write the entire object back to S3. We cannot update data in S3. We can only read and rewrite the entire object back to S3||We can update specific values in the data where the condition matches.|
Snowflake has faster analytics, simple service, stores diverse data across various cloud platforms, and can be scaled up as required; this makes it one of the most cost-effective solutions in the market.
Snowflake has a single integrated service across the three major clouds. You can have data stored in Microsoft Azure, Amazon S3, or Google Cloud but can still integrate all of them inside Snowflake. In the future, if we want to move data across cloud vendors, Snowflake would still be able to work efficiently.
Built entirely on ANSI SQL, it is effortless for one to have a data lake that has a full SQL environment. Complete resource isolation and control enables Snowflake virtual warehouses to independently fetch queries from the same object without one affecting the other. Automatic metadata management and history allow Snowflake to produce faster analytics with built-in control and governance for fast data flow.
Hence with Snowflake, we can extract batch or streaming data and build materialized views, external tables and then deliver the insights and business results much faster. Most importantly, it does not require manual intervention to rescale the cluster. As compute cost and storage cost are separated, it keeps the cost low, thus making it to be the top contender for data lakes in the market.