Supervised Learning: Glance of the powerful Classification algorithms
Classification is the process of grouping things according to similar features they share.
Types of Classification Algorithms
1. Logistic Regression
2. Decision Tree
3. Random Forest
4. K Nearest Neighbour
5. Naive byes
- Logistic Regression
is used when the dependent variable(target) is categorical.
More details in my other post :
2. Decision Tree
- Graphical representation of all possible solutions to a decision made by some conditions
- Can be easily explained and conceptualized
- Root Node — It represents the entire population or sample, and this further gets divided into two or more homogenous sets.
- Leaf Node — Node cannot be segregated further in nodes (Gini index =0 at a leaf node)
- Splitting — Dividing the root node/sub node into different parts based on some condition.
- Subtree — Intermediate step of splitting a tree
- Pruning — Removing unwanted branches from the tree
- Parent/Child node — Root node is the parent node and all other nodes branching from it is called child node
Terminology for Best Splitting a Tree
- Entropy — Defines randomness in the data. It is a metric which measures the impurity. The first step to solve the decision tree problem.
- Information Gain — The information gain is the decrease in entropy after a dataset is split based on an attribute. Constructing a decision tree is all about finding an attribute that returns the highest information gain.
- Gini Index- The measure of impurity (or purity) used in building the decision tree is Gini index
- Reduction in variance- Algorithm used for continuous target variables (regression problem). The split with lower variance is selected as the criteria to split the population
Formula of Entropy
E(S) = -P(Yes)log2 P(Yes)-P(No) log2 P(no)
When P(yes)=P(no)=0.5 , Equal no. of yes and no
E(S) = -0.5log2 0.5 - 0.5 log2 0.5
E(S) =-0.5(log2 0.5 - log2 0.5) =1E(S) = -P(Yes)log2 P(Yes)
When P(yes)=1 , Only yes in sample space
E(S) = 1 log2 1 =0
Similarily for P(No)=1 Only no in sample spce
E(S) = 1 log2 1 =0Information Gain = Entropy(S) -[(Weighted Avg)x Entropy(each feature)] ; s- total collection
Video Explaining how Entropy and Information Gain is used for splitting in decision Tree
The important concept of Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.
- Summarizes the count value of correct and incorrect predictions(grouped by class)
Example: Patient having Disease or Not.
- true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.
- true negatives (TN): We predicted no, and they don’t have the disease.
- false positives (FP): We predicted yes, but they don’t have the disease. (Also known as a “Type I error.”)
- false negatives (FN): We predicted no, but they do have the disease. (Also known as a “Type II error.”)
from sklearn.metrics import confusion_matrix
expected = [1,1,0,1,0,1,1,0,0,0]
predicted = [1,0,0,1,0,0,1,0,1,1]
results = confusion_matrix(expected,predicted)
#it means the system predict 6 times correctly and 2 times incorrectly as 1,0 in predictedo/p:
3. Random Forest
- Builds multiple decision trees and merges them to get a more accurate and stable prediction. For prediction takes an average of all decision trees
- Correct decision tree’s characteristics of overfitting the training set. Uses Bagging method — building multiple decision trees by using a random set of dataset
- Called Random because Each decision tree in forest considers a random subset of features while forming questions and have access to only a random set of training data sets, that why it’s robust.
- use data and classify new data points based on similarity measures. used in search applications. Seen in Flipkart if you purchase a shirt it recommends products from different categories like pants etc
- select K =number of nearest neighbor
- suppose we introduce a new data point (star) and K=3, Using the least distance it finds it has 2 orange points and 1 blue point as its closest three neighbors (k=3) so it will be classified as yellow
- For K=6, similarly, it has 6 blue points and 2 yellow points it is classified as blue.
- lazy learner — decides on the time of prediction, not intuitive training phase
- Handle Dataset
- Similarity — Calculate the distance between two data measures
- Neighbors — Locate K most similar data instances
- Response — Generate a response from a set of data instance (prediction)
- Accuracy — summarizing the accuracy of prediction
5. Naive Byes Classifier
Introduction to Conditional Probability and Bayes theorem in R for data science professionals
Introduction Understanding of probability is a must for a data science professional. Solutions to many data science…
- probabilistic method of machine learning, based on Bayes Theorem
- Assumption — The presence of a particular feature in a class is not related to the presence of any other feature.
Ex: Orange has features color, shape and size. Then Naive Bayes assumes that each feature contribute independently to the probability if the fruit is orange or not
Conditional probability: Calculate the probability of the second event (event B) given that first event (event A) has already happened
ex: the probability of customer buying baby pants given that he has already added milk in its cart.
Suppose you have a jar containing 6 marbles — 3 black and 3 white. What is the probability of getting a black given the first one was black too?
P (A) = getting a black marble in the first turn
P (B) = getting a black marble in the second turn
P (A) = 3/6
P (B) = 2/5
P (A and B) =P(A)*P(B)= ½*2/5 = 1/5
P(B|A) = P(A ∩B)/p(A) = 0.2/0.5 =0.4
- Bayes theorem shows the realtion between a conditional probablity and its reverse form
if conditional probability is P(A/B) then reverse probability is P(B/A) using Bayes theorem
EX: A event : Patient has lung disease , Past data says 10% of patients had lung disease, P(A) =0.1B event: Patient smokes, 5% are smoker ,P(B)=0.05you know that among the patients having lung disease , 7% are smokerP(B/A) =0.07so by bayes thereomP(A|B) =(0.07 *0.1)/0.05
Example: training data set of weather and corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players will play or not based on weather conditions.
6 Easy Steps to Learn Naive Bayes Algorithm with codes in Python and R
Note: This article was originally published on Sep 13th, 2015 and updated on Sept 11th, 2017 Overview Understand one of…
Example: Predicting Heart diseases based on Decision Tree and Naive Bayes theorem
6. Support Vector Machine
- Support vector machine is a supervised machine learning algorithm which classifies data based on its features
- SVM separates data using hyperplanes. There are infinite ways of drawing hyperplanes to separate data, to select the best fit we use support vectors
- Support vectors are the two nearest data points to the hyperplane
- The optimal hyper-plane would have the maximum distance between the support vectors
- The distance between the support vectors is known as margin
Suppose we introduce new data, we first create hyperplane based on support vectors. Chose the hyperplane having the highest margin to the best fit hyperplane. In below ex margin1 is a better fit
For non-linear Data
- svn uses Kernel function to transform 2D non-linear data to higher dimensions
Demo: using Cancer dataset
dataset: Inbuild sample dataset in learn.dataset package
Notebook on nbviewer
Features: [‘mean radius’ ‘mean texture’ ‘mean perimeter’ ‘mean area’ ‘mean smoothness’ ‘mean compactness’ ‘mean…
7. Random Forest
This is an ensemble method , combine several base models in order to produce one optimal predictive model
Instead of building one model for the entire dataset, Ensemble approach is dividing the dataset into an n Training set and build a model on each Training set, then we take an aggregate of o/p of the individual model for decision.
The technique used in Random Forest is Bagging
Sampling with Replacement
from n records in Data like [x,y,z,a,b,c], we pick n’ several samples to fill D1 such that n’<n also n’ is randomly selected set with repetition , ex: D1 could contain [a,a,x,y], D2 : [b,a,x,x] etc.Create m set of dataset with same number of n’ samples. For each of dataset x decision tree. Suppose we need to predict for new set R then we will get x prediction from each tree created from bags.Then classify based on highest votes/aggregated result.
Random forest is very similar to bagging, the same process is employed for creating x datasets, the only difference is creating of model/Decision Trees.
suppose the dataset has 5 independent variables(columns), in bagging we considered all the independent variables to create the decision tree. But in the random forest, not all 5 independent variables will be provided at the time of node splitting in decision tree but a random subset of the 5 independent variables will be provided (so that each of the trees is unique).
Demo: Iris Data set
Implementation: on Iris Data set
Notebook on nbviewer
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0…
Introduction to this topic will be given in a new post to avoid ToMuchInfo :)