In the previous article, the purpose of VORA and its core features was explained. In this series, VORA’s new engines; Time series, Graph, and Document store engines are going to be discussed.
Time Series Engines
VORA’s time series engine handles data with timestamps such as sensor or transaction based data (e.g. IoT, web log, clickstream data). With a traditional database, processing time series data can be difficult due to its volume and the velocity of the data. These data types require proper compression techniques (Fig 1), partition schemes, and granularization support. With the proper compression technique, data points that represent the trend or data points that exceed the error percentage, can only be recorded to save storage space instead of recording all data points.
Here is an example use case that demonstrates aggregation and column functions for sales data used in Walmart’s sales forecasting. In this dataset, sales made by different departments and other measures such as temperature, fuel price, CPI (Consumer Price Index) and unemployment rate were recorded weekly. The dataset was first inserted into HDFS then loaded to VORA’s time series engine (Fig. 2).
Once loaded, analysis can be done by simply calling built-in functions. Queries can be executed in VORA tools, Spark shell or Zeppelin. Table 1 shows the function commands and the outputs in Zeppelin.
|Trend||Trend function fits a linear regression line for the data points between the specified period and returns the slope of the regression. Positive value indicates the increasing sales trend.|
|Median||Aggregation functions (sum, average, median, mode, min, max, count) provide the summary for each column.|
|Histogram||Creates a histogram of the selected column with the specified number of bins.|
|Granulize||Modifies intervals between timestamps and returns the new series (e.g from 7 days to 1 month).|
For a linked data set that has a large volume, storing the data in a graph engine can increase the query efficiency. Graph models can handle one-to-many relationships in a simpler way by converting foreign keys in RDBMS to relations between nodes. Thus, it decreases the complexity for joining tables. Also, like the document-store engine, the graph engine does not require knowing the data type to insert data (schema free), which makes it suitable for adding new data with different structures in real-time.
To insert the dataset to the VORA’s graph engine, data should be in the JSG file extension. The JSG file type is a line-based JSON format that consists of following elements:
Common graph functions such as calculating the shortest path between two nodes (directed, undirected), finding node degrees (in, out, sum), connected components (strong, weak) and graph pattern matching are available. Example SQL commands and the outputs using the movie data are available here. Figure 5 summarizes how VORA graph engine is different from HANA graph engine.
Document Store Engine
Just like MongoDB or CouchDB, VORA’s document store engine can take document-oriented data. Document oriented data sets are semi-structured, meaning that documents (analog to a row in RDBMS) in the same collection (analog to a table in RDBMS) can have different fields. In a relational database, all rows in one table must have the same structure and the data types of each attribute need to be assigned before importing the data. However, with the document store engine, there is no pre-defined table schema to import data. Figure 6 shows an example of two documents that can be stored in one collection. These two documents do not have the same fields or data types. Adding these two records to one table is not possible in an RDBMS due to the enforced table schema.
The query syntax is same as the syntax used in the relational engine except that queries need to be wrapped inside of “ “ (back quotes). Tables in the document store engine can be joined with tables in other engines (Fig. 7).
Additionally, VORA has a disk engine, which saves data to disk instead of memory. Tables saved in VORA’s disk engine can be directly accessed in SAP HANA using a HANA Wire Connector. VORA’s connections to other SAP products will be covered separately in another blog post.
To summarize, VORA provides diverse engines in one platform where you can combine various data sources. Hence, it reduces complications for setting up individual tools and enables easy incorporation of processed data to SAP HANA.
In the next series, machine learning applications using SAP PAL and SparkML will be compared in detail, along with how VORA can help this process.
Want to know more? Click here to get in touch.