Data is everywhere. Nowadays, a business analyst gets the same data
This blog will discuss if Snowflake has inbuilt data governance capabilities or must depend on partners to achieve the same.
Let us start by understanding what challenges we have with effective data governance today:
- Data Silo – The organization’s data is spread across its source systems, data lakes, and data marts resulting in a data silo. This makes it difficult for analysts and business users to find the information they need to make their decisions.
- Data Complexity – The data has grown in volume and comes from different systems, in various formats and refresh rates. Unifying all of these is challenging.
- Data Veracity – Certification of data does not happen, and hence authenticity is always questioned.
- Lack of lineage – There isn’t a track of which users/applications have access to which data sets or the source of the data in the warehouses.
- Untapped data – The organizations are spread globally across regions. Analysts do not have complete information on what data is available for them to use.
- Data privacy – Access policies, ownership, and sensitivity are not well defined.
Let us next try and understand if Snowflake can solve these data governance issues.
Snowflake has a centralized repository where data is persisted and is accessible by all the compute nodes. Its storage is scalable, cheaper, and automatically compressed, partitioned, and clustered. Processing this colossal volume of data is also possible with the aid of massively parallel processing, automatically scalable compute.
Zero-copy cloning is a convenient way to snapshot data without physically copying the data again. This could be of use for testing or creating backups.
Hence with Snowflake, you can bring in data from all your sources into a single data platform, and thus governance becomes much more manageable as all data exists in one place.
Snowflake has a plethora of drivers, connectors, and support for programming languages and utilities that helps in ingesting data into Snowflake with ease. It can handle streams, batch jobs, or continuous data integration. Tasks let you schedule a procedure to run at specified intervals. Snowpipe is used for continuous data ingestion into Snowflake, combined with Kafka connector which is used to ingest data streams.
It supports structured, semi-structured data, and unstructured data will soon be supported in Snowflake. It is in private view currently. Querying semi-structured data is more manageable without the need to flatten them into tables.
Lack of lineage
Snowflake does not have out of the box solution to track lineage. As a best practice, all tables that are sourced into Snowflake will be organized into different schemas per source. dbt, the most used transformation tool with Snowflake, has built-in functionality to generate project documentation. It gives a clear picture of the dependency for the table/view, the entire lineage as a lineage graph. In addition to this, it also has information about columns like name, type, description, etc.,
When it comes to keeping track of access, ACCOUNTADMIN will be able to view all logins into the system from the table LOGIN_HOSTORY. Stats about all queries executed in Snowflake is available in the QUERY_HISTORY table.
A data catalog is an inventory of all data sets available for use by the end-users. Cataloging data gives the business users know what data is available in the system.
Snowflake supports cataloging via a private Data Exchange. It is your own data hub to securely collaborate around data between a selected group of members you invite. It enables providers to publish data that consumers can then discover. You can also have personalized listings, where users request data that can then be shared after a review.
Note: Data Exchange is currently in preview, but the data published to your Data Exchange can be used in production because it is made accessible through Secure Data Sharing
Snowflake follows Role Based Access Control, where access is granted to roles and users are assigned to roles. Ownership for an object is set at the role level. This means when a user creates a table, the OWNERSHIP is granted to the role of the creating user.
Snowflake supports both column-level and row-level access policies. Masking policies are created at the schema level, which masks data during query time. The row access policy can also be set up on a specific table or view and determines which rows are visible in a query result.
Thus, we could see that even though Snowflake is still climbing the ladder up to become self-sufficient when it comes to data governance. Data profiling, data lineage, collaboration and quality monitoring are some of the features required for better data governance. But worry not! Snowflake has a list of certified partners who can help to fill in these gaps.
Let us take a sneak peek at Axon – Data Governance tool from Informatica in the upcoming blog. Learn what its capabilities are and if it will aid in governing data in your organization.
To learn more about Snowflake, check out these blogs.
Reach out to us here today if you are interested in evaluating if Snowflake is suitable for you.
Learn more about Visual BI’s Snowflake offerings here.