In the previous blog, we had shown Data Visualization in KNIME. In this blog, we will see how to implement Supervised Learning techniques in KNIME.
Linear Regression Predictive Model
Regression analysis is used to find the relationship between a dependent variable and the independent variables. One of the methods we are going to use is called Linear Regression. The dependent variable must be continuous, in our case it is mpg. The independent variables we are taking are wt and hp.
Load the file mileage_new.csv using File Reader node
Go to Manipulation -> Row -> Transform, add Partitioning and create data partition (Relative %: 70 and select Draw randomly, set the random seed to 101)
Go to Analytics -> Mining -> Linear/Polynomial Regression and add Linear Regression Learner
Connect Output1 of Partitioning to Input of Linear Regression Learner
Select Target Column:mpg and Include:wt, hp
Add Regression Predictor
Connect Output1 of Linear Regression Learner to Input1 of Regression Predictor and Output2 of Partitioning to Input2 of Regression Predictor
By default, it will create a new column called Prediction (mpg)
Go to Analytics -> Mining -> Scoring and add Numeric Scorer
Right-click on Numeric Scorer node and select View: Statistics
Linear Regression workflow
Logistic Regression Predictive Model
Logistic Regression is a supervised machine learning method used for classification problems. The dependent variable is categorical. In the case of binary logistic regression, the dependent variable is a dichotomous variable (having two factors). So we will be using binary logistic regression for the Orings3.xls data, where our dependent variable is Condition (0 -> No failure, 1-> At least one O-ring failure had occurred)and the independent variable is Temp (launch temperature in degrees F).
Use Excel Reader (XLS) node to load the file Orings3.xls
Go to Manipulation -> Row -> Other add Rule Engine, in the expression box add these two lines
$Condition$ = 0 => “0”
$Condition$ = 1 => “1”
In Configure, Select Append Column: Name a new variable Condition_factor
Go to Manipulation -> Row -> Transform, add Partitioning and create data partition (Relative %: 70, select Draw randomly, set the random seed to 101)
Go to Analytics -> Mining -> Logistic regression and add Logistic Regression Learner node
Add Output1 of Partitioning to the Input of Logistic Regression Learner
In Configure, Select Target column:Condition_factor, Reference category:0, Select solver:Iterative reweighted least squares, Include:Temp
Right-click on Logistic Regression Learner and select Coefficients and Statistics to see the model summary
Add Logistic Regression Predictor
Add Output1 of Logistic Regression Learner to the Input1 of Logistic Regression Predictor and Output1 (train data) of Partitioning to the Input2 of Logistic Regression Predictor
In Configure, Check the option: Append columns with predicted probabilities
Go to Analytics -> Mining -> Scoring and add Scorer to the Logistic Regression Predictor
In Configure, Select First column:Condition_factor and Second column:Prediction(Condition_factor)
Right-click on Scorer node and select View: Confusion Matrix
Go to Views -> JavaScript and add ROC Curve to the Logistic Regression Predictor
In Configure, Select Class column:Condition_factor, Positive class:1, Include:P(Condition_factor=1)
Right-click on ROC Curve node and select Interactive View: ROC Curve
Similarly, repeat steps 9-14 for the test data
Binary Logistic Regression workflow
The interesting thing to note is that we have worked on imbalanced data. Still, we have got a higher accuracy (~95%), but unable to predict even one of the failed rocket launches correctly.
Advantages of using KNIME
It has over 100 processing nodes for data I/O, preprocessing and cleansing, modeling, analysis and data mining as well as various interactive views, such as scatter plots, parallel coordinates, and others
Integration with R, Python, and JAVA
Connect to various databases like Oracle, MySQL, PostgreSQL, MS-Access, Microsoft SQL Server, etc.
Partial execution of nodes inside a workflow
Has in-built ready-to-run examples
Requires little to no coding experience
Disadvantages of using KNIME
As we saw, by default KNIME takes the cut-off value as 0.5 in logistic regression. So we have to add a few nodes (Math Formula or do R-Integration) in order to implement a logistic regression model with different cut-off values (sensitivity, specificity).
So we have learned how to implement some of the machine learning models in KNIME. Share your views on how you have found these blogs useful in your learning/working experience.