Supervised Learning: Glance of the powerful Classification algorithms

azam sayeed
8 min readDec 14, 2019

Classification is the process of grouping things according to similar features they share.

Types of Classification Algorithms
1. Logistic Regression
2. Decision Tree
3. Random Forest
4. K Nearest Neighbour
5. Naive byes
6. SVM
7. XGBOOST
  1. Logistic Regression

is used when the dependent variable(target) is categorical.

More details in my other post :

2. Decision Tree

  • Graphical representation of all possible solutions to a decision made by some conditions
  • Can be easily explained and conceptualized
Gini Impurity of the leaf node is zero

Key Terminologies:

  1. Root Node — It represents the entire population or sample, and this further gets divided into two or more homogenous sets.
  2. Leaf Node — Node cannot be segregated further in nodes (Gini index =0 at a leaf node)
  3. Splitting — Dividing the root node/sub node into different parts based on some condition.
  4. Subtree — Intermediate step of splitting a tree
  5. Pruning — Removing unwanted branches from the tree
  6. Parent/Child node — Root node is the parent node and all other nodes branching from it is called child node

Terminology for Best Splitting a Tree

  1. Entropy — Defines randomness in the data. It is a metric which measures the impurity. The first step to solve the decision tree problem.
  2. Information Gain — The information gain is the decrease in entropy after a dataset is split based on an attribute. Constructing a decision tree is all about finding an attribute that returns the highest information gain.
  3. Gini Index- The measure of impurity (or purity) used in building the decision tree is Gini index
  4. Reduction in variance- Algorithm used for continuous target variables (regression problem). The split with lower variance is selected as the criteria to split the population
Formula of Entropy
E(S) = -P(Yes)log2 P(Yes)-P(No) log2 P(no)
When P(yes)=P(no)=0.5 , Equal no. of yes and no
E(S) = -0.5log2 0.5 - 0.5 log2 0.5
E(S) =-0.5(log2 0.5 - log2 0.5) =1
E(S) = -P(Yes)log2 P(Yes)
When P(yes)=1 , Only yes in sample space
E(S) = 1 log2 1 =0
Similarily for P(No)=1 Only no in sample spce
E(S) = 1 log2 1 =0
Information Gain = Entropy(S) -[(Weighted Avg)x Entropy(each feature)] ; s- total collection

Video Explaining how Entropy and Information Gain is used for splitting in decision Tree

Demo:

The important concept of Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.

  • Summarizes the count value of correct and incorrect predictions(grouped by class)

Example: Patient having Disease or Not.

  • true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.
  • true negatives (TN): We predicted no, and they don’t have the disease.
  • false positives (FP): We predicted yes, but they don’t have the disease. (Also known as a “Type I error.”)
  • false negatives (FN): We predicted no, but they do have the disease. (Also known as a “Type II error.”)
from sklearn.metrics import confusion_matrix
expected = [1,1,0,1,0,1,1,0,0,0]
predicted = [1,0,0,1,0,0,1,0,1,1]
results = confusion_matrix(expected,predicted)
print(results)
#it means the system predict 6 times correctly and 2 times incorrectly as 1,0 in predicted
o/p:
[[3 2]
[2 3]]

3. Random Forest

  • Builds multiple decision trees and merges them to get a more accurate and stable prediction. For prediction takes an average of all decision trees
  • Correct decision tree’s characteristics of overfitting the training set. Uses Bagging method — building multiple decision trees by using a random set of dataset
  • Called Random because Each decision tree in forest considers a random subset of features while forming questions and have access to only a random set of training data sets, that why it’s robust.

4. KNN

  • use data and classify new data points based on similarity measures. used in search applications. Seen in Flipkart if you purchase a shirt it recommends products from different categories like pants etc
  1. select K =number of nearest neighbor
Euclidean Distance formula
  1. suppose we introduce a new data point (star) and K=3, Using the least distance it finds it has 2 orange points and 1 blue point as its closest three neighbors (k=3) so it will be classified as yellow
  2. For K=6, similarly, it has 6 blue points and 2 yellow points it is classified as blue.
  • lazy learner — decides on the time of prediction, not intuitive training phase
  1. Handle Dataset
  2. Similarity — Calculate the distance between two data measures
  3. Neighbors — Locate K most similar data instances
  4. Response — Generate a response from a set of data instance (prediction)
  5. Accuracy — summarizing the accuracy of prediction

5. Naive Byes Classifier

  • probabilistic method of machine learning, based on Bayes Theorem
  • Assumption — The presence of a particular feature in a class is not related to the presence of any other feature.

Ex: Orange has features color, shape and size. Then Naive Bayes assumes that each feature contribute independently to the probability if the fruit is orange or not

Conditional probability: Calculate the probability of the second event (event B) given that first event (event A) has already happened

ex: the probability of customer buying baby pants given that he has already added milk in its cart.

Ex:

Suppose you have a jar containing 6 marbles — 3 black and 3 white. What is the probability of getting a black given the first one was black too?

P (A) = getting a black marble in the first turn

P (B) = getting a black marble in the second turn

P (A) = 3/6

P (B) = 2/5

P (A and B) =P(A)*P(B)= ½*2/5 = 1/5

P(B|A) = P(A ∩B)/p(A) = 0.2/0.5 =0.4

  • Bayes theorem shows the realtion between a conditional probablity and its reverse form

if conditional probability is P(A/B) then reverse probability is P(B/A) using Bayes theorem

EX: A event : Patient has lung disease , Past data says 10% of patients had lung disease, P(A) =0.1B event: Patient smokes, 5% are smoker ,P(B)=0.05you know that among the patients having lung disease , 7% are smokerP(B/A) =0.07so by bayes thereomP(A|B) =(0.07 *0.1)/0.05

Example: training data set of weather and corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players will play or not based on weather conditions.

Example: Predicting Heart diseases based on Decision Tree and Naive Bayes theorem

dataset: https://www.dropbox.com/s/ek6ukmz0zof7ed9/heart1.csv?dl=0

6. Support Vector Machine

  • Support vector machine is a supervised machine learning algorithm which classifies data based on its features
  • SVM separates data using hyperplanes. There are infinite ways of drawing hyperplanes to separate data, to select the best fit we use support vectors
  • Support vectors are the two nearest data points to the hyperplane
  • The optimal hyper-plane would have the maximum distance between the support vectors
  • The distance between the support vectors is known as margin

Suppose we introduce new data, we first create hyperplane based on support vectors. Chose the hyperplane having the highest margin to the best fit hyperplane. In below ex margin1 is a better fit

For non-linear Data

  • svn uses Kernel function to transform 2D non-linear data to higher dimensions
Kernels Fn — Polynomial, Gaussian, Gaussian Radial Basis and Laplace RBF Kernel

Demo: using Cancer dataset

dataset: Inbuild sample dataset in learn.dataset package

7. Random Forest

This is an ensemble method , combine several base models in order to produce one optimal predictive model

Instead of building one model for the entire dataset, Ensemble approach is dividing the dataset into an n Training set and build a model on each Training set, then we take an aggregate of o/p of the individual model for decision.

The technique used in Random Forest is Bagging

Sampling with Replacement

from n records in Data like [x,y,z,a,b,c], we pick n’ several samples to fill D1 such that n’<n also n’ is randomly selected set with repetition , ex: D1 could contain [a,a,x,y], D2 : [b,a,x,x] etc.Create m set of dataset with same number of n’ samples. For each of dataset x decision tree. Suppose we need to predict for new set R then we will get x prediction from each tree created from bags.Then classify based on highest votes/aggregated result.

Random Forest

Random forest is very similar to bagging, the same process is employed for creating x datasets, the only difference is creating of model/Decision Trees.

suppose the dataset has 5 independent variables(columns), in bagging we considered all the independent variables to create the decision tree. But in the random forest, not all 5 independent variables will be provided at the time of node splitting in decision tree but a random subset of the 5 independent variables will be provided (so that each of the trees is unique).

Demo: Iris Data set

8.XGBOOST

Implementation: on Iris Data set

Introduction to this topic will be given in a new post to avoid ToMuchInfo :)

--

--