Supervised Learning using KNIME

In the previous blog, we had shown Data Visualization in KNIME. In this blog, we will see how to implement Supervised Learning techniques in KNIME.

 

Linear Regression Predictive Model

 

Regression analysis is used to find the relationship between a dependent variable and the independent variables. One of the methods we are going to use is called Linear Regression. The dependent variable must be continuous, in our case it is mpg. The independent variables we are taking are wt and hp.

 

  1. Load the file mileage_new.csv using File Reader node
  2. Go to Manipulation -> Row -> Transform, add Partitioning and create data partition (Relative %: 70 and select Draw randomly, set the random seed to 101)
  3. Go to Analytics -> Mining -> Linear/Polynomial Regression and add Linear Regression Learner
  4. Connect Output1 of Partitioning to Input of Linear Regression Learner
  5. Select Target Column: mpg and Include: wt, hp
  6. Add Regression Predictor
  7. Connect Output1 of Linear Regression Learner to Input1 of Regression Predictor and Output2 of Partitioning to Input2 of Regression Predictor
  8. By default, it will create a new column called Prediction (mpg)
  9. Go to Analytics -> Mining -> Scoring and add Numeric Scorer
  10. Right-click on Numeric Scorer node and select View: Statistics

230fc4974ff86d9d216fbdf08da84286 image

Linear Regression workflow

 

Logistic Regression Predictive Model

 

Logistic Regression is a supervised machine learning method used for classification problems. The dependent variable is categorical. In the case of binary logistic regression, the dependent variable is a dichotomous variable (having two factors). So we will be using binary logistic regression for the Orings3.xls data, where our dependent variable is Condition (0 -> No failure, 1-> At least one O-ring failure had occurred) and the independent variable is Temp (launch temperature in degrees F).

 

  1. Use Excel Reader (XLS) node to load the file Orings3.xls
  2. Go to Manipulation -> Row -> Other add Rule Engine, in the expression box add these two lines
  1. $Condition$ = 0 => “0”
  2. $Condition$ = 1 => “1”
  1. In Configure, Select Append Column: Name a new variable Condition_factor
  2. Go to Manipulation -> Row -> Transform, add Partitioning and create data partition (Relative %: 70, select Draw randomly, set the random seed to 101)
  3. Go to Analytics -> Mining -> Logistic regression and add Logistic Regression Learner node
  4. Add Output1 of Partitioning to the Input of Logistic Regression Learner
  5. In Configure, Select Target column: Condition_factor, Reference category: 0, Select solver: Iterative reweighted least squares, Include: Temp
  6. Right-click on Logistic Regression Learner and select Coefficients and Statistics to see the model summary
  7. Add Logistic Regression Predictor
  8. Add Output1 of Logistic Regression Learner to the Input1 of Logistic Regression Predictor and Output1 (train data) of Partitioning to the Input2 of Logistic Regression Predictor
  9. In Configure, Check the option: Append columns with predicted probabilities
  10. Go to Analytics -> Mining -> Scoring and add Scorer to the Logistic Regression Predictor
  11. In Configure, Select First column: Condition_factor and Second column: Prediction(Condition_factor)
  12. Right-click on Scorer node and select View: Confusion Matrix
  13. Go to Views -> JavaScript and add ROC Curve to the Logistic Regression Predictor
  14. In Configure, Select Class column: Condition_factor, Positive class: 1, Include: P(Condition_factor=1)
  15. Right-click on ROC Curve node and select Interactive View: ROC Curve
  16. Similarly, repeat steps 9-14 for the test data

7d9003d1041f2c9da423a6f011ee3b65 image

Binary Logistic Regression workflow

 

The interesting thing to note is that we have worked on imbalanced data. Still, we have got a higher accuracy (~95%), but unable to predict even one of the failed rocket launches correctly.

 

Advantages of using KNIME

  1. It has over 100 processing nodes for data I/O, preprocessing and cleansing, modeling, analysis and data mining as well as various interactive views, such as scatter plots, parallel coordinates, and others
  2. Integration with R, Python, and JAVA
  3. Connect to various databases like Oracle, MySQL, PostgreSQL, MS-Access, Microsoft SQL Server, etc.
  4. Partial execution of nodes inside a workflow
  5. Has in-built ready-to-run examples
  6. Requires little to no coding experience

Disadvantages of using KNIME

  1. As we saw, by default KNIME takes the cut-off value as 0.5 in logistic regression. So we have to add a few nodes (Math Formula or do R-Integration) in order to implement a logistic regression model with different cut-off values (sensitivity, specificity).

 

So we have learned how to implement some of the machine learning models in KNIME. Share your views on how you have found these blogs useful in your learning/working experience.

Facebook
Twitter
Pinterest
Email