Snowflake, the data warehouse built for the cloud has been embarking its name stronger day by day in the BI industry. Along with its various cutting-edge features such as Zero Cloning, Time Travel, Dynamic caching mechanism Snowflake now seems to have few more productive utilities: Streams, for Change Data Capture and Tasks, an inbuilt scheduling mechanism to schedule our jobs.
For any Data warehouse, it’s very important to have the Change Data Capture shortly known as CDC for bigger tables as data volume will be humongous especially in the Transactional systems. Imagine when there are millions of data coming into our system daily, every day there could be new transactions, an update of previous transactions or even there might be a soft/hard delete of few transactions. It’s very important for us to have this CDC mechanism as we can’t load the data fully (Every time initial load) as it would be costly both time and money. Snowflake tackles this problem by introducing the concept of Streams. Basically, a Stream object keeps track of all the DML changes made to a table and has metadata of each change so that we could use this in order to extract the changed data.
Streams take an initial snapshot of all the rows present with respect to the source table by initializing a point in time as the current version of the table. Any DML change committed after that are tracked into it and the streams have additional columns (metadata with respect to the change). A Stream, by itself, doesn’t contain any data but has 3 additional columns (METADATA$ROW_ID, METADATA$ISUPDATE and METADATA$ACTION) and we can detect if the row has been updated, deleted or new insert with these. For each row, all these 3 columns are associated and for every stream object, we are going to be charged only for the space held by these additional columns.
How do Streams work?
Let’s assume we have a table with ID, Name and Salary with following data
Now I create a stream on top of this table, and this is how my stream would look like:
Note: When we haven’t consumed from Stream and make changes to the source table with respect to data would still be considered as insert and not update.
As we can see there are 3 new additional columns introduced and since it’s initial load everything is inserted. Now I consume this Member_Check Stream object and insert values of this into another table.
After I have consumed my Stream object is empty.
Now we go and make a change in the source table:
We update the Salary column for all the Members by 500, then when we check the table:
So now we can see the Stream Object has the updated value as Insert and isUpdate to True and previous value as Delete. Now we can consume this Stream object and do the corresponding changes in the target table.
Similarly, when we delete a value or insert a new value, the stream_object tracks those changes as below (We have deleted Hari and added a new entry Mon):
This feature proves to be very helpful when there are millions of records getting transacted every day but when you only want to update the changed ones. Instead of doing a full load, we can take this feature and with the help of MERGE into command, we can update the target.
For any Data warehouse, frequency of data load is an important feature as it defines how soon we can view the data in real-time. Snowflake offers continuous loading of data through Snow Pipes, but once the data is loaded, we might need some additional transformations to be done on top of that before it is used in reports or visualization tools. For simple transformations, we can use Snowflake Stored Procedure, User-defined Functions etc. and schedule them so that whenever any data is changed in the table (we can detect this with help of streams) the task would automatically get triggered and then do our transformation process.
We can define certain conditions on when this task should be triggered like a Boolean value, if the value matches true then this task would start to run and also define the frequency of the schedule, whether the task should run every few mins or few hours or on a particular day or particular date of the month etc.
For the same example above we can create a task that would automatically insert any new data that has been inserted into our table every 5 minutes once. In this way we know whenever new data has been entered it automatically presents in the target.