# Regression Models

Regression is a technique that displays the relationship between variable y based on values of x

ex: y- Inches of rain varies according to x- New Cars sold

- If you think there is a relationship between two things, the regression would help to confirm it.

Main Types:

Linear Regression — Continuous Variables, Solves Regression Issue, Straight Line

Logistic Regression- Categorical Variables, Solves Classification Issue, s curve

## Linear Regression

- plotting a line equation, Y=mx+C
- Simple Linear regression is useful for finding the relationship between two continuous variables. One is the independent variable, and the other is the dependent variable
- Good for a problem involving finding the exact value of Y for given X value. Like finding House Size(y) for given Money(x) but not suited for Classification problems like if the house is in a good locality or not.

Video Explaining Step by Step Linear Regression

Least Square Error — the process used to plot the regression line

https://www.youtube.com/watch?v=JvS2triCgOY

steps: calculate mean of x and y, the regression line will always pass through mean of collective x and y point values.

Then find value of m and c using mean point in eqn y=mx+c, then find best fit then R2 value.

R2 value — Goodness of Fit Rotation of Regression line https://www.youtube.com/watch? (tell if independent variable is dependent on dependent variable or not and by how much)

Standard Error of Estimate

v=w2FKXOa0HGA&t=192s , https://www.youtube.com/watch?v=r-txC-dpI-E

Basic Linear regression in Python — Plotting and y=b0+b1x coefficients

import numpy as np

import matplotlib.pyplot as pltdef estimate_coefficient(x,y):

n=np.size(x)

mean_x,mean_y=np.mean(x),np.mean(y)

SS_xy = np.sum(y*x - n*mean_y*mean_x)

SS_xx = np.sum(x*x - n*mean_x*mean_x)

b1=SS_xy/SS_xx

b0=mean_y-b1*mean_x

return(b0,b1)def plot_regression_line(x,y,b):

plt.scatter(x,y,color='m',marker="o")

y_pred=b[0]+b[1]*x

plt.plot(x,y_pred,color='g')

plt.xlabel('Size')

plt.ylabel('Cost')

plt.show()

x=np.array([1,2,3,4,5,6,7,8,9,10])

y=np.array([300,350,500,700,800,850,900,900,1000,1200])b=estimate_coefficient(x,y)

print("estimated Coefficients :\nb0 = {} \nb1 = {}".format(b[0],b[1]))

plot_regression_line(x,y,b)

Logistic Regression

- A statistical classification model
- Deals with categorical dependent variables
- could be binary or distinct values, multinominal
- Takes both continuous and discrete input data
- Gives outcome in terms of probability, which helps in classifying
- works well with the large volume of the dataset

## Step by Step calculation of Logistic Regression

Example: Basic Spam Email classifier

- Define the variables

Independent variables — count of spam words

ex of spam words: Lottery, Winner, Crores, Free, etc

Dependent variable — label Spam(1) and Not Spam(0)

2. Plot Labeled data

3. Draw Regression Line

Steps of creating Sigmoid curve for the best fit

- convert probability of the scale of 0 to 1 to scale of log(odds) that range from +ve infinity to -ve infinity

Probability = Favourable Events /Total events

Odds= Favourable Events/Unfavourable events

Log(odds) =Logit Function

Log(odds ratio) = Log(odds for case1/odds for case2)

For our spam classifier case, converting scale of Probability 0,1 to scale of log(odds)

>>> To convert scale of probability between 0 and 1 to more meaningful y axis graph , we convert it to log(odds) using formula =log(p/1-p). Line passing through zero corresponds p=.5log(odds) = log(P(spam)/1-P(spam))= log(1/1–1) =log(1/0) -> + ∞log(odds) = log(P(Not Spam)/1-p(Not Spam) =log(0/1) -> -∞

- Sigmoid Curve helps in classification problems as in Fig, Ye value is predicted based on Xe value. Since Ye>0.5 we can say it will be classified as label 1.
- can take any real value as input and map to a value between 0 to 1

4. Find the best Fit using MLE

Regression Models using sklearn Package

Using Boston DataSet, Another Hello World program for Linear regression problem

Dataset: