Supervised Learning: Glance of the powerful Classification algorithms

8 min readDec 14, 2019

Classification is the process of grouping things according to similar features they share.

Types of Classification Algorithms
1. Logistic Regression
2. Decision Tree
3. Random Forest
4. K Nearest Neighbour
5. Naive byes
6. SVM
7. XGBOOST

Logistic Regression

is used when the dependent variable(target) is categorical.

Supervised Learning

Introduction to Regression

medium.com

2. Decision Tree

Graphical representation of all possible solutions to a decision made by some conditions
Can be easily explained and conceptualized

Key Terminologies:

Root Node — It represents the entire population or sample, and this further gets divided into two or more homogenous sets.
Leaf Node — Node cannot be segregated further in nodes (Gini index =0 at a leaf node)
Splitting — Dividing the root node/sub node into different parts based on some condition.
Subtree — Intermediate step of splitting a tree
Pruning — Removing unwanted branches from the tree
Parent/Child node — Root node is the parent node and all other nodes branching from it is called child node

Terminology for Best Splitting a Tree

Entropy — Defines randomness in the data. It is a metric which measures the impurity. The first step to solve the decision tree problem.
Information Gain — The information gain is the decrease in entropy after a dataset is split based on an attribute. Constructing a decision tree is all about finding an attribute that returns the highest information gain.
Gini Index- The measure of impurity (or purity) used in building the decision tree is Gini index
Reduction in variance- Algorithm used for continuous target variables (regression problem). The split with lower variance is selected as the criteria to split the population

Formula of Entropy
E(S) = -P(Yes)log2 P(Yes)-P(No) log2 P(no)
When P(yes)=P(no)=0.5 , Equal no. of yes and no
E(S) = -0.5log2 0.5 - 0.5 log2 0.5
E(S) =-0.5(log2 0.5 - log2 0.5) =1E(S) = -P(Yes)log2 P(Yes)
When P(yes)=1 , Only yes in sample space
E(S) = 1 log2 1 =0
Similarily for P(No)=1 Only no in sample spce
E(S) = 1 log2 1 =0Information Gain = Entropy(S) -[(Weighted Avg)x Entropy(each feature)]   ; s- total collection

Video Explaining how Entropy and Information Gain is used for splitting in decision Tree

Demo:

Notebook on nbviewer

Check out this Jupyter notebook!

nbviewer.jupyter.org

The important concept of Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.

Summarizes the count value of correct and incorrect predictions(grouped by class)

Example: Patient having Disease or Not.

true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.
true negatives (TN): We predicted no, and they don’t have the disease.
false positives (FP): We predicted yes, but they don’t have the disease. (Also known as a “Type I error.”)
false negatives (FN): We predicted no, but they do have the disease. (Also known as a “Type II error.”)

from sklearn.metrics import confusion_matrix
expected  = [1,1,0,1,0,1,1,0,0,0]
predicted = [1,0,0,1,0,0,1,0,1,1]
results = confusion_matrix(expected,predicted)
print(results)
#it means the system predict 6 times correctly and 2 times incorrectly as 1,0 in predictedo/p:
[[3 2]
 [2 3]]

3. Random Forest

Builds multiple decision trees and merges them to get a more accurate and stable prediction. For prediction takes an average of all decision trees
Correct decision tree’s characteristics of overfitting the training set. Uses Bagging method — building multiple decision trees by using a random set of dataset
Called Random because Each decision tree in forest considers a random subset of features while forming questions and have access to only a random set of training data sets, that why it’s robust.

4. KNN

use data and classify new data points based on similarity measures. used in search applications. Seen in Flipkart if you purchase a shirt it recommends products from different categories like pants etc

select K =number of nearest neighbor

suppose we introduce a new data point (star) and K=3, Using the least distance it finds it has 2 orange points and 1 blue point as its closest three neighbors (k=3) so it will be classified as yellow
For K=6, similarly, it has 6 blue points and 2 yellow points it is classified as blue.

lazy learner — decides on the time of prediction, not intuitive training phase

Handle Dataset
Similarity — Calculate the distance between two data measures
Neighbors — Locate K most similar data instances
Response — Generate a response from a set of data instance (prediction)
Accuracy — summarizing the accuracy of prediction

Notebook on nbviewer

Check out this Jupyter notebook!

nbviewer.jupyter.org

5. Naive Byes Classifier

Introduction to Conditional Probability and Bayes theorem in R for data science professionals

Introduction Understanding of probability is a must for a data science professional. Solutions to many data science…

www.analyticsvidhya.com

probabilistic method of machine learning, based on Bayes Theorem
Assumption — The presence of a particular feature in a class is not related to the presence of any other feature.

Ex: Orange has features color, shape and size. Then Naive Bayes assumes that each feature contribute independently to the probability if the fruit is orange or not

Conditional probability: Calculate the probability of the second event (event B) given that first event (event A) has already happened

ex: the probability of customer buying baby pants given that he has already added milk in its cart.

Ex:

Suppose you have a jar containing 6 marbles — 3 black and 3 white. What is the probability of getting a black given the first one was black too?

P (A) = getting a black marble in the first turn

P (B) = getting a black marble in the second turn

P (A) = 3/6

P (B) = 2/5

P (A and B) =P(A)*P(B)= ½*2/5 = 1/5

P(B|A) = P(A ∩B)/p(A) = 0.2/0.5 =0.4

Bayes theorem shows the realtion between a conditional probablity and its reverse form

if conditional probability is P(A/B) then reverse probability is P(B/A) using Bayes theorem

EX: A event : Patient has lung disease , Past data says 10% of patients had lung disease, P(A) =0.1B event: Patient smokes, 5% are smoker ,P(B)=0.05you know that among the patients having lung disease , 7% are smokerP(B/A) =0.07so by bayes thereomP(A|B) =(0.07 *0.1)/0.05

Example: training data set of weather and corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players will play or not based on weather conditions.

6 Easy Steps to Learn Naive Bayes Algorithm with codes in Python and R

Note: This article was originally published on Sep 13th, 2015 and updated on Sept 11th, 2017 Overview Understand one of…

www.analyticsvidhya.com

Example: Predicting Heart diseases based on Decision Tree and Naive Bayes theorem

dataset: https://www.dropbox.com/s/ek6ukmz0zof7ed9/heart1.csv?dl=0

Notebook on nbviewer

Check out this Jupyter notebook!

nbviewer.jupyter.org

6. Support Vector Machine

Support vector machine is a supervised machine learning algorithm which classifies data based on its features
SVM separates data using hyperplanes. There are infinite ways of drawing hyperplanes to separate data, to select the best fit we use support vectors

Support vectors are the two nearest data points to the hyperplane
The optimal hyper-plane would have the maximum distance between the support vectors
The distance between the support vectors is known as margin

Suppose we introduce new data, we first create hyperplane based on support vectors. Chose the hyperplane having the highest margin to the best fit hyperplane. In below ex margin1 is a better fit

For non-linear Data

svn uses Kernel function to transform 2D non-linear data to higher dimensions

Kernels Fn — Polynomial, Gaussian, Gaussian Radial Basis and Laplace RBF Kernel

Demo: using Cancer dataset

dataset: Inbuild sample dataset in learn.dataset package

Notebook on nbviewer

Features: [‘mean radius’ ‘mean texture’ ‘mean perimeter’ ‘mean area’ ‘mean smoothness’ ‘mean compactness’ ‘mean…

nbviewer.jupyter.org

7. Random Forest

This is an ensemble method , combine several base models in order to produce one optimal predictive model

Instead of building one model for the entire dataset, Ensemble approach is dividing the dataset into an n Training set and build a model on each Training set, then we take an aggregate of o/p of the individual model for decision.

The technique used in Random Forest is Bagging

Sampling with Replacement

from n records in Data like [x,y,z,a,b,c], we pick n’ several samples to fill D1 such that n’<n also n’ is randomly selected set with repetition , ex: D1 could contain [a,a,x,y], D2 : [b,a,x,x] etc.Create m set of dataset with same number of n’ samples. For each of dataset x decision tree. Suppose we need to predict for new set R then we will get x prediction from each tree created from bags.Then classify based on highest votes/aggregated result.

Random Forest

Random forest is very similar to bagging, the same process is employed for creating x datasets, the only difference is creating of model/Decision Trees.

suppose the dataset has 5 independent variables(columns), in bagging we considered all the independent variables to create the decision tree. But in the random forest, not all 5 independent variables will be provided at the time of node splitting in decision tree but a random subset of the 5 independent variables will be provided (so that each of the trees is unique).

Demo: Iris Data set

Notebook on nbviewer

Check out this Jupyter notebook!

nbviewer.jupyter.org

8.XGBOOST

Implementation: on Iris Data set

Notebook on nbviewer

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0…

nbviewer.jupyter.org

Introduction to this topic will be given in a new post to avoid ToMuchInfo :)

Supervised Learning: Glance of the powerful Classification algorithms

Supervised Learning

Introduction to Regression

Notebook on nbviewer

Check out this Jupyter notebook!

The important concept of Confusion Matrix

4. KNN

Notebook on nbviewer

Check out this Jupyter notebook!

Introduction to Conditional Probability and Bayes theorem in R for data science professionals

Introduction Understanding of probability is a must for a data science professional. Solutions to many data science…

6 Easy Steps to Learn Naive Bayes Algorithm with codes in Python and R

Note: This article was originally published on Sep 13th, 2015 and updated on Sept 11th, 2017 Overview Understand one of…

Notebook on nbviewer

Check out this Jupyter notebook!

6. Support Vector Machine

Notebook on nbviewer

Features: [‘mean radius’ ‘mean texture’ ‘mean perimeter’ ‘mean area’ ‘mean smoothness’ ‘mean compactness’ ‘mean…

7. Random Forest

Notebook on nbviewer

Check out this Jupyter notebook!

8.XGBOOST

Notebook on nbviewer

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0…

Written by azam sayeed