Calgary is a list count-based data retrieval engine which uses indexing methodology to read the data faster. To make it simple you can say as each value of a column is assigned to an index value and maintained as a data structure and values of each column are stored in a separate directory. When you want to read the data, the engine reads the required structure and joins the columns based on the index and creates the output.
While working with big data files of volume over 100 million records, the time taken to process them will be a major concern. You will not have all day to generate a report for the decision-makers. That’s where the Calgary database plays an important role to improve the performance. This data engine format is designed in such a way to handle 100 – 300 million records with greater performance than any other file format. To the max, it can hold 2 Billion(2^31) records.
You can access this feature with the below Calgary tools available in the Alteryx tool palette.
Calgary Loader tool is used to create a Calgary database from any type of Input file.
Calgary Input Tool is used to read the Calgary database file.
Calgary join Tool is used to join an input file with a Calgary database.
Calgary Cross Count Tool is used to aggregate data across multiple Calgary database fields to find the count per record group.
Calgary Cross Count Append Tool is used to read the input file and append the count of records that join to the Calgary database.
The file format of Calgary database is “CYDB”. With the .cydb file, you will also find files created with the extensions .cyix, which are the index files of each field in the data. Every time you create a .cydb file, the index files are generated for every field in the data.
Here, the XML file stores the details about each field. Let us see a small example of performance improvement on using Calgary data base. We have two files of CSV format and CYDB format with same data with 16.5 million records that have the sales data. We need to combine the cost of each item to sale data to find profit.
So, we are going to join these two files using normal join and the Calgary join using the workflow as shown below. While using the Calgary join, you need not add another input tool to rad the Calgary file. You can directly select the file in the Calgary join tool configuration window.
For joining these two files using the normal join, it takes like 38.5 seconds. If we are using the Calgary database feature to join these files it takes only 7 seconds.
You can see that we are getting a performance that is 5 times better by using CYDB instead of CSV file. Let us see more about the usability of Calgary database in the upcoming blog.