Our previous blog posts in the VORA series showed how VORA could process different types of data (streaming, semi-structured, unstructured) and combine them with enterprise data to perform a sentiment analysis. In this post, data mining application is presented by comparing SAP HANA Predictive Analytics Library (PAL) and Spark’s Machine learning library (Spark ML). Data mining is part of the knowledge discovery process and more companies are utilizing data mining to ensure data-driven decision making. It is especially useful for the tasks that can be automated but hard to do with rule-based decision making.
Big data mining however, is often considered challenging because of its size and the feature dimension. It also requires the distributed/parallel computing environment and algorithms that are specially designed to handle various data types. To meet these needs, there are many open source big data mining tools available. Upon checking each tool’s algorithms and data type coverages, a benchmark data set is typically used to get an idea of speed and accuracy of the candidate tools. For VORA, SparkML can be used to apply advanced machine learning algorithms and for SAP HANA, SAP PAL offers in-database machine learning applications. Hence, side by side comparison for both tools are presented by following the data mining procedures (Fig. 1). The focus is to walk through step-by-step procedures of how to apply ML libraries in both platforms, NOT on how to build an optimal prediction model.
Step 1: Data Loading
To load tables from HANA to VORA, either VORA tools (GUI) or a SQL command can be used by providing database schema, host and instance details. Since VORA takes the table schema from HANA, there is no need to define the table schema again. Also, as VORA uses Spark as a processing engine, when tables are registered to Spark, data gets automatically partitioned or divided to the user assigned partition number. For example, if there are 100 executors in the Spark cluster, up to 100 data partitions can be given to optimize the parallel processing capabilities. Regarding the data loading time, reading HANA tables in VORA is instant because VORA creates virtual tables of HANA tables instead of transferring data physically (Fig. 2).
The benchmark data set used in this post consists of flight arrival and departure details for all commercial flights within the USA. This data set is commonly used to build a classification model to predict whether a flight will be delayed or not. The target variable, ‘ArrDelay’, indicates how early or late the plane arrived in minutes (Fig. 3). With this classification model, airline companies can simulate flight scheduling by inputting parameters such as time of the day, day of week and time of year to fly to minimize delays or to investigate the main causes for the arrival delays.
Splitting Data into Training and Testing Subsets
Datasets need to be divided into train and test datasets to prevent the ML model from overfitting. The train dataset is used to build the model and test set is used to test the model’s performance. PAL offers 2 partition methods; random partition and stratified partition. Stratified partition scheme is useful for the imbalanced dataset because it divides the dataset by rebalancing the proportions of the target variables. SparkML also provides random partition but it does not have stratified partition (as of Version 2.1.0). For this demo, 1 million flight records were randomly selected as a training set and 100K records were used as a test dataset.
Step 2: Train a Model on Training Set
- Data Cleaning and Feature Processing
Prior to training the model, datasets normally go through pre-processing and feature engineering needs to be performed. These procedures typically take the most effort and highly affect the model’s performance. In this demo however, these two tasks were applied to minimum. For PAL, no pre-processing was needed. For SparkML, categorical features had to be converted to numeric types because most algorithms in SparkML take only numeric vectors as input. To do this transformation, Stringindexer and VectorAssembler functions were used to convert categorical variables to numeric and combine them to one feature vector. Additionally, Binarizer was used to convert the continuous values in ‘ArrDelay’ column to the binary output, ‘0’ or ‘1’, based on the used-defined threshold (e.g.: “0” if the delay was less than 15 minutes, “1” otherwise).
Though PAL offers some of the feature engineering functions, compared to SparkML, available options are limited (Fig 4). With feature transforming/extracting/selecting functions in SparkML, various data types can be handled including text-based data.
- Model Building
To use PAL in HANA, ML procedures were scripted in SQL. There are three steps to train a model; procedure set up -> data table set up -> model parameter set up. First, procedures are created by setting the data types of the input table and the desired output tables. Then, the ML algorithm (Random forest in this case) can be called by specifying the input table, and the model parameters such as maximum tree depth, tree numbers and the number of threads. After building a model, output tables with metrics like variable importance or confusion matrix can also be checked. If there are multiple-nodes HANA is connected and the data is distributed into that cluster, ‘PARALLEL_BY_PARAMETER_PARTITIONS’ function can be added to the call function to run parallel execution. For SparkML, the data set was partitioned into 4 subsets. The number of trees and tree depth parameters were set to 100 and 10 respectively for both models. System specifications are as follows:
For HANA Sandbox
- Number of cores: 14
- Total system memory: 120 GiB
For Spark ML
- Number of cores: 4
- Total system memory: 8 GiB
(AWS EC2 instance: r3.4xlarge)
- Number of cores: 16
- Total system memory: 122 GiB
Total time took to build a model was 444 seconds for PAL and 651 seconds for SparkML. With EC2 instance, it took only 249 seconds.
- Parameter Optimization
PAL currently offers parameter selection for limited algorithms; logistic regression, naïve bayes, support vector machine and random forest. Whereas SparkML offers auto-parameter tuning (Grid search) that users can set the grid for the range of parameters and the number of cross validation folds. This task normally takes extensive execution time since it requires (cross validation number * number of grid search sets) more computations. For instance, if there are 4 sets in the parameter map with 3-fold cross validation, 12 full random forest models must be built before it can conclude which parameter set is optimal. Although grid search is expansive, the existing model can be improved significantly with the tuned parameters.
Step 3: Make Predictions of the test Dataset
From Step 2, a random forest model was built. Now we can use the model to predict the test dataset. For PAL, scripts for the model prediction is as lengthy as scripts used for model building. The entire step (procedure, data, parameter set up) needs to be defined again. On the other hand, testing model requires just 1 line of code in SparkML because of it uses the ‘pipeline’ (Fig. 5). SparkML Pipeline is like a work flow, which stores a sequence of tasks that a raw dataset needs to go through to be used in the ML algorithm. Since the pipeline was already defined while training the model, the same pipeline can be used for the test set, transforming the test dataset ready to be plugged into the prediction model.
Step 4: Compare Predicted and True Label
For the evaluation metric of a classification model, a confusion matrix can be used to calculate the precision and recall ratio. Both models showed a pretty low recall rate. SparkML showed a better precision rate but they should be optimized if they were to be deployed in production. Especially, the model in PAL showed a AUC value (Area under the ROC curve) of 0.53, indicating the model performs very poorly for differentiating target variables.
RF Model in SAP PAL
|predicted||TP 1340||FP 625|
|NOT predicted||FN 20277||TN 77758|
- Precision = 1340/(1340+625) = 68%
- Recall = 1340/(1340+20277) = 6.2%
- AUC = 0.53
RF Model in SparkML
|predicted||TP 1155||FP 363|
|NOT predicted||FN 20462||TN 78020|
- Precision = 1155/(1155+363) = 76%
- Recall = 1155/(1155+20462) = 5.3%
- AUC = 0.71
Regarding variable importance, PAL and SparkML both chose ‘Departure time’ as the most deciding factor for the flight delay then origin and distance. When the data mining is completed, results in SprakML can either be stored as VORA tables or written back to HANA. Using HANA SDA (Smart Data Access) is another way to access VORA tables without moving the data physically (Fig. 6).
To summarize, here is comparison table. SparkML outperformed PAL in the run time and the accuracy of the model. More algorithms and ways to optimize the current models will be covered in the next post. Stay tuned!
|PAL in HANA (2.0)||SparkML (2.1.0)|
|Programming Languages||SQL||Scala, R, Python|
|Run time||444 seconds||249 seconds|
|Auto-tuning (grid search), Optimization||Available|
|Data mining tasks||PAL in HANA (2.0)||SparkML (2.1.0)|
|Frequent Pattern Mining||X||X|
Want to know more? Click here to get in touch.