Our previous blog posts in the VORA series showed how VORA could process different types of data (streaming, semi-structured, unstructured) and combine them with enterprise data to perform a sentiment analysis. In this post, data mining application is presented by comparing SAP HANA Predictive Analytics Library (PAL) and Spark’s Machine learning library (Spark ML). Data mining is part of the knowledge discovery process and more companies are utilizing data mining to ensure data-driven decision making. It is especially useful for the tasks that can be automated but hard to do with rule-based decision making.

Big data mining however, is often considered challenging because of its size and the feature dimension. It also requires the distributed/parallel computing environment and algorithms that are specially designed to handle various data types. To meet these needs, there are many open source big data mining tools available. Upon checking each tool’s algorithms and data type coverages, a benchmark data set is typically used to get an idea of speed and accuracy of the candidate tools. For VORA, SparkML can be used to apply advanced machine learning algorithms and for SAP HANA, SAP PAL offers in-database machine learning applications. Hence, side by side comparison for both tools are presented by following the data mining procedures (Fig. 1). The focus is to walk through step-by-step procedures of how to apply ML libraries in both platforms, NOT on how to build an optimal prediction model.

SAP HANA VORA - Machine Learning in Enterprise Computing - 1

Figure 1: Machine Learning Process in Enterprise Computing

Step 1: Data Loading

To load tables from HANA to VORA, either VORA tools (GUI) or a SQL command can be used by providing database schema, host and instance details. Since VORA takes the table schema from HANA, there is no need to define the table schema again. Also, as VORA uses Spark as a processing engine, when tables are registered to Spark, data gets automatically partitioned or divided to the user assigned partition number. For example, if there are 100 executors in the Spark cluster, up to 100 data partitions can be given to optimize the parallel processing capabilities. Regarding the data loading time, reading HANA tables in VORA is instant because VORA creates virtual tables of HANA tables instead of transferring data physically (Fig. 2).

The benchmark data set used in this post consists of flight arrival and departure details for all commercial flights within the USA. This data set is commonly used to build a classification model to predict whether a flight will be delayed or not. The target variable, ‘ArrDelay’, indicates how early or late the plane arrived in minutes (Fig. 3). With this classification model, airline companies can simulate flight scheduling by inputting parameters such as time of the day, day of week and time of year to fly to minimize delays or to investigate the main causes for the arrival delays.

SAP HANA VORA - Machine Learning in Enterprise Computing - 2.1

Figure 2: Creating Tables in VORA from SAP HANA

SAP HANA VORA - Machine Learning in Enterprise Computing - 3

Figure 3: Benchmark Dataset – Historical Flight Records

Splitting Data into Training and Testing Subsets

Datasets need to be divided into train and test datasets to prevent the ML model from overfitting. The train dataset is used to build the model and test set is used to test the model’s performance. PAL offers 2 partition methods; random partition and stratified partition. Stratified partition scheme is useful for the imbalanced dataset because it divides the dataset by rebalancing the proportions of the target variables. SparkML also provides random partition but it does not have stratified partition (as of Version 2.1.0).  For this demo, 1 million flight records were randomly selected as a training set and 100K records were used as a test dataset.

Step 2: Train a Model on Training Set

  • Data Cleaning and Feature Processing

Prior to training the model, datasets normally go through pre-processing and feature engineering needs to be performed. These procedures typically take the most effort and highly affect the model’s performance. In this demo however, these two tasks were applied to minimum. For PAL, no pre-processing was needed. For SparkML, categorical features had to be converted to numeric types because most algorithms in SparkML take only numeric vectors as input. To do this transformation, Stringindexer and VectorAssembler functions were used to convert categorical variables to numeric and combine them to one feature vector. Additionally, Binarizer was used to convert the continuous values in ‘ArrDelay’ column to the binary output, ‘0’ or ‘1’, based on the used-defined threshold (e.g.: “0” if the delay was less than 15 minutes, “1” otherwise).

Though PAL offers some of the feature engineering functions, compared to SparkML, available options are limited (Fig 4). With feature transforming/extracting/selecting functions in SparkML, various data types can be handled including text-based data.

SAP HANA VORA - Machine Learning in Enterprise Computing - 4

Figure 4: Available Feature Engineering Functions in PAL (left), SparkML (right)

  • Model Building

To use PAL in HANA, ML procedures were scripted in SQL. There are three steps to train a model; procedure set up -> data table set up -> model parameter set up. First, procedures are created by setting the data types of the input table and the desired output tables. Then, the ML algorithm (Random forest in this case) can be called by specifying the input table, and the model parameters such as maximum tree depth, tree numbers and the number of threads. After building a model, output tables with metrics like variable importance or confusion matrix can also be checked. If there are multiple-nodes HANA is connected and the data is distributed into that cluster, ‘PARALLEL_BY_PARAMETER_PARTITIONS’ function can be added to the call function to run parallel execution. For SparkML, the data set was partitioned into 4 subsets. The number of trees and tree depth parameters were set to 100 and 10 respectively for both models. System specifications are as follows:

System Specifications:

For HANA Sandbox

  • Number of cores: 14
  • Total system memory: 120 GiB

For Spark ML
(Hortonworks Sandbox)

  • Number of cores: 4
  • Total system memory: 8 GiB

(AWS EC2 instance: r3.4xlarge)

  • Number of cores: 16
  • Total system memory: 122 GiB

Total time took to build a model was 444 seconds for PAL and 651 seconds for SparkML. With EC2 instance, it took only 249 seconds.

    • Parameter Optimization

PAL currently offers parameter selection for limited algorithms; logistic regression, naïve bayes, support vector machine and random forest. Whereas SparkML offers auto-parameter tuning (Grid search) that users can set the grid for the range of parameters and the number of cross validation folds. This task normally takes extensive execution time since it requires (cross validation number * number of grid search sets) more computations. For instance, if there are 4 sets in the parameter map with 3-fold cross validation, 12 full random forest models must be built before it can conclude which parameter set is optimal. Although grid search is expansive, the existing model can be improved significantly with the tuned parameters.

Step 3: Make Predictions of the test Dataset

From Step 2, a random forest model was built. Now we can use the model to predict the test dataset. For PAL, scripts for the model prediction is as lengthy as scripts used for model building. The entire step (procedure, data, parameter set up) needs to be defined again. On the other hand, testing model requires just 1 line of code in SparkML because of it uses the ‘pipeline’ (Fig. 5). SparkML Pipeline is like a work flow, which stores a sequence of tasks that a raw dataset needs to go through to be used in the ML algorithm. Since the pipeline was already defined while training the model, the same pipeline can be used for the test set, transforming the test dataset ready to be plugged into the prediction model.

SAP HANA VORA - Machine Learning in Enterprise Computing - 5

Figure 5: Amount of Scripts Needed for PAL and SparkML

Step 4: Compare Predicted and True Label

For the evaluation metric of a classification model, a confusion matrix can be used to calculate the precision and recall ratio. Both models showed a pretty low recall rate. SparkML showed a better precision rate but they should be optimized if they were to be deployed in production. Especially, the model in PAL showed a AUC value (Area under the ROC curve) of 0.53, indicating the model performs very poorly for differentiating target variables.

RF Model in SAP PAL

 delayedNOT delayed
predictedTP 1340FP 625
NOT predictedFN 20277TN 77758
  • Precision = 1340/(1340+625) = 68%
  • Recall = 1340/(1340+20277) = 6.2%
  • AUC = 0.53

RF Model in SparkML

 delayedNOT delayed
predictedTP 1155FP 363
NOT predictedFN 20462TN 78020
  • Precision = 1155/(1155+363) = 76%
  • Recall = 1155/(1155+20462) = 5.3%
  • AUC = 0.71

Regarding variable importance, PAL and SparkML both chose ‘Departure time’ as the most deciding factor for the flight delay then origin and distance. When the data mining is completed, results in SprakML can either be stored as VORA tables or written back to HANA. Using HANA SDA (Smart Data Access) is another way to access VORA tables without moving the data physically (Fig. 6).

SAP HANA VORA - Machine Learning in Enterprise Computing - 6

Figure 6: VORA SDA Architecture

To summarize, here is comparison table. SparkML outperformed PAL in the run time and the accuracy of the model. More algorithms and ways to optimize the current models will be covered in the next post. Stay tuned!

 PAL in HANA (2.0)SparkML (2.1.0)
Programming LanguagesSQLScala, R, Python
LicenseCommercialOpen source
Run time444 seconds249 seconds
AUC0.530.71
Auto-tuning (grid search), OptimizationAvailable

 

Data mining tasksPAL in HANA (2.0)SparkML (2.1.0)
Data pre-processingXX
Basic StatisticsXX
ClusteringXX
ClassificationXX
RegressionXX
Frequent Pattern MiningXX
Time-Series AlgorithmsXX
Collaborative FilteringX
Dimensionality ReductionXX
Anomaly detectionXX
Streaming modelXX
Text analysisXX
Deep learning

Want to know more? Click here to get in touch.

5600 Tennyson Pkwy
Suite 120
Plano TX 75024.

+1 888-227-2794

+1 972-232-2233

+1 888-227-7192

solutions@visualbi.com

Join Ingo as he unveils Visual BI Extensions (VBX 2.0) for SAP Lumira 2.0 Designer / SAP Design Studio.

You have Successfully Subscribed!