Supervised Learning-Linear Regression ,Logistic Regression


Regression

Linear Regression

Linear regression is perhaps one of the most well known and well understood algorithms in statistics and machine learning.

Linear regression was developed in the field of statistics and is studied as a model for understanding the relationship between input and output numerical variables, but has been borrowed by machine learning. It is both a statistical algorithm and a machine learning algorithm.

Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables ($x$) and the single output variable ($y$). More specifically, that $y$ can be calculated from a linear combination of the input variables ($x$).

When there is a single input variable ($x$), the method is referred to as simple linear regression. When there are multiple input variables, literature from statistics often refers to the method as multiple linear regression.

Different techniques can be used to prepare or train the linear regression equation from data, the most common of which is called Ordinary Least Squares. It is common to therefore refer to a model prepared this way as Ordinary Least Squares Linear Regression or just Least Squares Regression.

Linear Regression Model

The linear regression model is simple.The representation is a linear equation that combines a specific set of input values ($x$) the solution to which is the predicted output for that set of input values ($y$). As such, both the input values ($x$) and the output value are numeric.

For example, in a simple regression problem (a single x and a single y), the form of the model would be:

$y = \beta_0 + \beta_1*x$

In higher dimensions when we have more than one input ($x$), the line is called a plane or a hyper-plane. The representation therefore is the form of the equation and the specific values used for the coefficients.
It is common to talk about the complexity of a regression model like linear regression. This refers to the number of coefficients used in the model.

Learning a linear regression model means estimating the values of the coefficients used in the representation with the data that we have available.

Take note of Ordinary Least Squares because it is the most common method used in general. Also take note of Gradient Descent as it is the most common technique taught in machine learning classes.

1. Simple Linear Regression

With simple linear regression when we have a single input, we can use statistics to estimate the coefficients.

This requires that you calculate statistical properties from the data such as means, standard deviations, correlations and covariance. All of the data must be available to traverse and calculate statistics.
The key point in Simple Linear Regression is that the dependent variable must be a continuous/real value. However, the independent variable can be measured on continuous or categorical values.

Simple Linear regression algorithm has mainly two objectives:
Model the relationship between the two variables. Such as the relationship between Income and expenditure, experience and Salary, etc.
Forecasting new observations. Such as Weather forecasting according to temperature, Revenue of a company according to the investments in a year, etc.

Simple Linear Regression involves the model
$Y=\beta_0+\beta_1 X$

The ordinary least square estimate of $\beta_0$ and $\beta_1$ try to minimize the sum of squared error

$SSE(\beta_0,\beta_1)=\sum_{i=1}^{n}(Y_i-\hat{Y_i})^2$, where $\hat{Y}$ is the estimated value of $Y$
Finding the derivative of $SSE(\beta_0,\beta_1)$ with respect to $\beta_0$ and $\beta_1$ and equating them to zero and solving for $\beta_0$ and $\beta_1$ yields the optimum value

$\frac{\partial }{\partial \beta_0}SSE(\beta_0,\beta_1)=-2\sum_{i=1}^ny_i-\beta_0-\beta_1x_i=0$
$\frac{\partial }{\partial \beta_1}SSE(\beta_0,\beta_1)=-2\sum_{i=1}^nx_i(y_i-\beta_0-\beta_1x_i)=0$
These  give
$n\beta_0+\beta_1\sum_{i=1}^n x_i=\sum_{i=1}^n y_i$
$\beta_0 \sum_{i=1}^n x_i+\beta_1\sum_{i=1}^n x_i^2=\sum_{i=1}^n x_iy_i$
solving for $\beta_0$ and $\beta_1$ will give the least square estimate

The least square estimate of  $\beta_0$ and $\beta_1$ are
$\hat{\beta_1}=\frac{{}\sum_{i=1}^n(X_i-\bar{X})(Y_i-\bar{Y})}{\sum_{i=1}^{n}(X_i-\bar{X})^2}=\frac{Cov(X,Y)}{Var(X)}$

$\hat{\beta_0}=\bar{Y}-\beta_1\bar{X}$

Example: ( University question)
Determine the regression equation by finding the regression slope coefficient and the intercept value using the following data.
$\begin{vmatrix}
x& 55 & 60 & 65 & 70& 80 \\
y & 52 & 54 & 56 & 58 & 62
\end{vmatrix}$


so the regression equation is $y=30+0.4x$

python code
import numpy as np
import matplotlib.pyplot as plt

def estimate_coef(x, y):
    # number of observations/points
    n = np.size(x)

    # mean of x and y vector
    m_x = np.mean(x)
    m_y = np.mean(y)
    print("mean x",m_x)
    print("mean y",m_y)
    # calculating cross-deviation and deviation about x
   
    SS_xy=np.sum((x-m_x)*(y-m_y))
    SS_xx=np.sum((x-m_x)*(x-m_x))
    print("covar xy",SS_xy)
    print("var x",SS_xx)
    # calculating regression coefficients
    b_1 = SS_xy / SS_xx
    b_0 = m_y - b_1*m_x

    return (b_0, b_1)

def plot_regression_line(x, y, b):
    # plotting the actual points as scatter plot
    plt.scatter(x, y, color = "m",marker = "o", s = 30)

    # predicted response vector
    y_pred = b[0] + b[1]*x

    # plotting the regression line
    plt.plot(x, y_pred, color = "g")

    # putting labels
    plt.xlabel('x')
    plt.ylabel('y')

    # function to show plot
    plt.show()

# observations / data
x = np.array([55, 60, 65, 70,80])
y = np.array([52,54,56,58,62])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))

# plotting regression line
plot_regression_line(x, y, b)

output:
mean x 66.0
mean y 56.4
covar xy 148.0
var x 370.0
Estimated coefficients:
b_0 = 29.999999999999996 
b_1 = 0.4

The values of x and their corresponding values of y are shown in the table below ( university question)
Find the least square regression line $y=ax+b$
Estimate the value of y when x=10
$\begin{vmatrix}
x&|&0&|&1&|&2&|&3&|&4 \\
y&|&2&|&3&|&5&|&4&|&6
\end{vmatrix}$
Estimated coefficients:b_0 = 2.2 b_1 = 0.9 
Regression Line $y=0.9x+2.2$
Estimated value of y at x=10 11.2

Example with python code
import numpy as np
import matplotlib.pyplot as plt

def estimate_coef(x, y):
    # number of observations/points
    n = np.size(x)

    # mean of x and y vector
    m_x = np.mean(x)
    m_y = np.mean(y)

    # calculating cross-deviation and deviation about x
     SS_xy=np.sum((x-m_x)*(y-m_y))
    SS_xx=np.sum((x-m_x)*(x-m_x))

    # calculating regression coefficients
    b_1 = SS_xy / SS_xx
    b_0 = m_y - b_1*m_x

    return (b_0, b_1)

def plot_regression_line(x, y, b):
    # plotting the actual points as scatter plot
    plt.scatter(x, y, color = "m",marker = "o", s = 30)

    # predicted response vector
    y_pred = b[0] + b[1]*x

    # plotting the regression line
    plt.plot(x, y_pred, color = "g")

    # putting labels
    plt.xlabel('x')
    plt.ylabel('y')

    # function to show plot
    plt.show()

# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))

# plotting regression line
plot_regression_line(x, y, b)

Estimated coefficients: 
b_0 = 1.2363636363636363
b_1 = 1.1696969696969697
Use the following data to construct a linear regression model for the auto insurance premium as a function of driving experience. ( university question)
$\begin{vmatrix}
Driving Exp&|&5&|&2&|&12&|&9&|&15&|&6&|&25&|&16 \\
Monthly Premium&|&64&|&87&|&50&|&71&|&44&|&56&|&42&|&60
\end{vmatrix}$


y=b0+b1.x
y=76.66-1.55x

Real Time example with Python library
Lets consider a data file houseprices.csv

     area   price
02600550000
13000565000
23200610000
33600680000
44000725000
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
df=pd.read_csv('houseprices.csv')
print(df)
reg=linear_model.LinearRegression()
reg.fit(df[['area']],df.price)
print("beta1=",reg.coef_)
print("beta0=",reg.intercept_)
inputarea=np.array([[3300]])
print("sample prediction with area 3300",reg.predict(inputarea))

%matplotlib inline
plt.scatter(df.area,df.price,color='red')
plt.xlabel('area in sqft')
plt.ylabel('prices in us$')
plt.plot(df.area,reg.predict(df[['area']]),color='blue')
plt.show()

outputs
    area price 
0 2600 550000 
1 3000 565000 
2 3200 610000 
3 3600 680000 
4 4000 725000 
beta1= [135.78767123] 
beta0= 180616.43835616432 
sample prediction with area 3300 [628715.75342466]


Predict the price of a 1000 square feet house using the regression model generated from the following data.
Square feet     Price(Lakhs)
500                     5
900                     10
1200                   13
1500                   18
2000                   25
2500                   32
2700                   35

Lets find the regression line $y=w_0+w_1*x$
where $x=square feet$ 
$y=Price in lakhs$
$mean of x= 1614.2857$
$mean of y=19.71$
$var x=4048571$
$covxy=55828.57$
$w_1=cov xy/var x=0.01379$
$w_0=mean y-w_1*mean x=-2.55$
So the regression line is $y=-2.55+0.01379*x$
Prediction for price 1000 is $y=-2.55+0.01379*1000=11.24$

The predicted point is shown in red
 Note: The below code use scipy stats module linregress method
 
import matplotlib.pyplot as plt
from scipy import stats

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):
  return slope * x + intercept

mymodel = list(map(myfunc, x))

plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()




2. Multiple Linear Regression(Multivariate Linear Regression)

In Simple Linear Regression, where a single Independent/Predictor(X) variable is used to model the response variable (Y). But there may be various cases in which the response variable is affected by more than one predictor variable; for such cases, the Multiple Linear Regression algorithm is used.

When we have more than one input we can use Ordinary Least Squares to estimate the values of the coefficients.

The Ordinary Least Squares procedure seeks to minimize the sum of the squared residuals. This means that given a regression line through the data we calculate the distance from each data point to the regression line, square it, and sum all of the squared errors together. This is the quantity that ordinary least squares seeks to minimize.

This approach treats the data as a matrix and uses linear algebra operations to estimate the optimal values for the coefficients. It means that all of the data must be available and you must have enough memory to fit the data and perform matrix operations.

Consider a dataset with $p$ features(or independent variables) and one response(or dependent variable).
Also, the dataset contains $n$ rows/observations.
We define:
$X$ (feature matrix) = a matrix of size $n \times p$ where $x_{ij}$ denotes the values of $j$th feature for $i$th observation.
$\begin{bmatrix}x_{11} & \cdots & x_{1p}\\
x_{21} & \cdots & x_{2p} \\
\vdots & \ddots & \vdots\\
x_{n1}& \cdots & x_{np}
\end{bmatrix}$

$y$ (response vector) = a vector of size $n$ where $y_{i}$ denotes the value of response for $i$th observation.
$\begin{bmatrix}
y_1\\
y_2\\
\vdots\\
y_n\\
\end{bmatrix}$
The regression line for $p$ features is represented as
$y=\beta_0 + \beta_1 x_{i1}+ \beta_2 x_{i2}+ \cdots + \beta_n x_{in}$

We can generalize our linear model a little bit more by representing feature matrix $X$ as:
$\begin{bmatrix}1&x_{11} & \cdots & x_{1p}\\
1&x_{21} & \cdots & x_{2p} \\
1&\vdots & \ddots & \vdots\\
1 &x_{n1}& \cdots & x_{np}
\end{bmatrix}$
So now, the linear model can be expressed in terms of matrices as:
$y=X \beta $
where $\beta=$
$\begin{bmatrix}
\beta_0\\
\beta_1\\
\vdots\\
\beta_p\\
\end{bmatrix}$
Now, we determine an estimate of $\beta$, using the Least Squares method.
As already explained, the Least Squares method tends to determine $\beta$ for which total residual error is minimized.
$\hat{\beta}=(X^TX)^{-1}X^Ty$
Knowing the least square estimates, $\beta$, the multiple linear regression model can now be estimated as:
$\hat{y}=X\hat{\beta}$
where $\hat{y}$ is the estimated response vector.

Example
Fit a multiple linear regression model to the following data
$\begin{vmatrix}
X1&|&1&|&1&|&2&|&0 \\
X2&|&1&|&2&|&2&|&1\\
Y&|&3.25&|&6.5&|&3.5&|&5
\end{vmatrix}$

$\hat{\beta}=(X^T.X)^{-1}.X^T.y$
$X= \begin{bmatrix}
1 &1 &1 \\
1&1 &2 \\
1 & 2 & 2 \\
1 & 0 & 1 \\
\end{bmatrix}$
$X^T.X=\begin{bmatrix}4 &4 &6 \\
4&6 &7 \\
6 & 7 & 10 \\
\end{bmatrix}.$
$(X^T.X)^{-1}=\begin{bmatrix}2.75 &0.5 &-2 \\
0.5&1 &-1 \\
-2 & -1 & 2 \\
\end{bmatrix}.$
$(X^T.Y)=\begin{bmatrix}18.25 \\
16.75 \\
28.25 \\
\end{bmatrix}.$
$\beta=\begin{bmatrix} 2.0625 \\
-2.375 \\
3.25 \\
\end{bmatrix}.$
Python Code
import numpy as np

# Given data
x1 = np.array([1, 1, 2, 0])
x2 = np.array([1, 2, 2, 1])
y = np.array([3.25, 6.5, 3.5, 5])

# Number of observations
m = len(y)

# Construct the X matrix with a column of ones for the intercept
X = np.column_stack((np.ones(m), x1, x2))

# Compute the coefficients using the normal equation
beta = np.linalg.inv(X.T @ X) @ X.T @ y

# Extract the intercept and coefficients
intercept = beta[0]
coefficients = beta[1:]

# Make predictions
y_pred = X @ beta

# Model evaluation
mse = np.mean((y - y_pred) ** 2)
r2 = 1 - (np.sum((y - y_pred) ** 2) / np.sum((y - np.mean(y)) ** 2))

print(intercept, coefficients, mse, r2, y_pred)
2.062500000000007 
[-2.375 3.25 ] 
0.09765625000000007 
0.9425287356321839 
[2.9375 6.1875 3.8125 5.3125]

Using sklearn library
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Given data
x1 = [1, 1, 2, 0]
x2 = [1, 2, 2, 1]
y = [3.25, 6.5, 3.5, 5]

# Create a DataFrame
data = {
    'x1': x1,
    'x2': x2,
    'y': y
}

df = pd.DataFrame(data)

# Independent variables (x1, x2)
X = df[['x1', 'x2']]

# Dependent variable (y)
y = df['y']

# Initialize the linear regression model
model = LinearRegression()

# Fit the model
model.fit(X, y)

# Coefficients
intercept = model.intercept_
coefficients = model.coef_

print(f"Intercept: {intercept}")
print(f"Coefficients: {coefficients}")

Intercept: 2.0625000000000004 
Coefficients: [-2.375 3.25 ]

Example with Python code
Lets consider a csv file houseprices.csv with the following attributes
area bedrooms age price
2600     3.0       20 550000 
3000     4.0       15 565000 
3200     NaN     18 610000 
3600     3.0        30 680000 
4000     5.0         8  725000
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from sklearn import linear_model
df=pd.read_csv('houseprices.csv')
#do the data preprocessing fill NaN with median value
mv_bedrooms=math.floor(df.bedrooms.median())
df.bedrooms=df.bedrooms.fillna(mv_bedrooms)
print(df)

#create a linear regression model and fitting the model
reg=linear_model.LinearRegression() 
reg.fit(df[['area','bedrooms','age']],df.price) 
print("Coefficents",reg.coef_) 
print("Intercept",reg.intercept_) 
print("prediction using values 5000,4,10",reg.predict([[5000,4,10]]))

outputs
area bedrooms age price 
0 2600 3.0 20 550000 
1 3000 4.0 15 565000 
2 3200 3.0 18 610000 
3 3600 3.0 30 680000 
4 4000 5.0 8 725000 
Coefficents [ 143.625 -6762.5 337.5 ] 
Intercept 173112.5 
prediction using values 5000,4,10 [867562.5]

3. Gradient Descent
Gradient descent (GD) is an iterative first-order optimization algorithm used to find a local minimum/maximum of a given function. This method is commonly used in machine learning (ML) and deep learning(DL) to minimize a cost/loss function (e.g. in a linear regression).

Gradient is a slope of a curve at a given point in a specified direction.In the case of a univariate function, it is simply the first derivative at a selected point. In the case of a multivariate function, it is a vector of derivatives in each main direction (along variable axes). Because we are interested only in a slope along one axis and we don’t care about others these derivatives are called partial derivatives.

Gradient descent algorithm does not work for all functions. There are two specific requirements. A function has to be:
differentiable
convex


First, what does it mean it has to be differentiable? If a function is differentiable it has a derivative for each point in its domain — not all functions meet these criteria.

differentiable functions

                                                        non differentiable functions

Next requirement — function has to be convex. For a univariate function, this means that the line segment connecting two function’s points lays on or above its curve (it does not cross it). If it does it means that it has a local minimum which is not a global one.
Another way to check mathematically if a univariate function is convex is to calculate the second derivative and check if its value is always bigger than 0.

When there are one or more inputs you can use a process of optimizing the values of the coefficients by iteratively minimizing the error of the model on your training data.

This operation is called Gradient Descent and works by starting with random values for each coefficient. The sum of the squared errors are calculated for each pair of input and output values. A learning rate is used as a scale factor and the coefficients are updated in the direction towards minimizing the error. The process is repeated until a minimum sum squared error is achieved or no further improvement is possible.

When using this method, you must select a learning rate (alpha) parameter that determines the size of the improvement step to take on each iteration of the procedure.

Gradient descent is often taught using a linear regression model because it is relatively straightforward to understand. In practice, it is useful when you have a very large data set either in the number of rows or the number of columns that may not fit into memory.

Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).

Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.

For the linear regression problem the coefficients can be optimized by evaluating the cost function which is the sum of squared error

Gradient Descent Procedure

The procedure starts off with initial values for the coefficient or coefficients for the function. These could be 0.0 or a small random value.

$coefficient$ = start with a small random value

The cost of the coefficients is evaluated by plugging them into the function and calculating the cost.

$cost = evaluate(f(coefficient))$

The derivative of the cost is calculated. The derivative is a concept from calculus and refers to the slope of the function at a given point. We need to know the slope so that we know the direction (sign) to move the coefficient values in order to get a lower cost on the next iteration.

$delta = derivative(cost)$

Now that we know from the derivative which direction is downhill, we can now update the coefficient values. A learning rate parameter (alpha) must be specified that controls how much the coefficients can change on each update.

$coefficient = coefficient – (alpha * delta)$

This process is repeated until the cost of the coefficients (cost) is 0.0 or close enough to zero to be good enough.
You can see how simple gradient descent is. It does require you to know the gradient of your cost  function or the function you are optimizing, but besides that, it’s very straightforward. 

Gradient Descent Algorithm iteratively calculates the next point using gradient at the current position, then scales it (by a learning rate) and subtracts obtained value from the current position (makes a step). It subtracts the value because we want to minimize the function (to maximize it would be adding). This process can be written as:
$p_{n+1}=p_n-\alpha \Delta f(p_n)$

There’s an important parameter $\alpha$ which scales the gradient and thus controls the step size. In machine learning, it is called learning rate and have a strong influence on performance.The smaller learning rate the longer GD converges, or may reach maximum iteration before reaching the optimum point.If learning rate is too big the algorithm may not converge to the optimal point (jump around) or even to diverge completely.

In summary, Gradient Descent method’s steps are:
choose a starting point (initialization)
calculate gradient at this point
make a scaled step in the opposite direction to the gradient (objective: minimise)
repeat points 2 and 3 until one of the criteria is met:


Image credit: towards data science



Batch Gradient Descent for Machine Learning

The cost function involves evaluating the coefficients in the machine learning model by calculating a prediction for the model for each training instance in the data set and comparing the predictions to the actual output values and calculating a sum or average error (such as the Sum of Squared Residuals or SSR in the case of linear regression).

From the cost function a derivative can be calculated for each coefficient so that it can be updated using exactly the update equation described above.

The cost is calculated for a machine learning algorithm over the entire training data set for each iteration of the gradient descent algorithm. One iteration of the algorithm is called one batch and this form of gradient descent is referred to as batch gradient descent.

Batch gradient descent is the most common form of gradient descent described in machine learning.

Stochastic Gradient Descent for Machine Learning

Gradient descent can be slow to run on very large datasets.

Because one iteration of the gradient descent algorithm requires a prediction for each instance in the training dataset, it can take a long time when you have many millions of instances.

In situations when you have large amounts of data, you can use a variation of gradient descent called stochastic gradient descent.

In this variation, the gradient descent procedure described above is run but the update to the coefficients is performed for each training instance, rather than at the end of the batch of instances.

The first step of the procedure requires that the order of the training dataset is randomized. This is to mix up the order that updates are made to the coefficients. Because the coefficients are updated after every training instance, the updates will be noisy jumping all over the place, and so will the corresponding cost function. By mixing up the order for the updates to the coefficients, it harnesses this random walk and avoids it getting distracted or stuck.

The update procedure for the coefficients is the same as that above, except the cost is not summed over all training patterns, but instead calculated for one training pattern.

The learning can be much faster with stochastic gradient descent for very large training datasets and often you only need a small number of passes through the dataset to reach a good or good enough set of coefficients.

Note:
Optimization is a big part of machine learning.
Gradient descent is a simple optimization procedure that you can use with many machine learning algorithms.
Batch gradient descent refers to calculating the derivative from all training data before calculating an update.
Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating the update immediately.

4. Regularization

There are extensions of the training of the linear model called regularization methods. These seek to both minimize the sum of the squared error of the model on the training data (using ordinary least squares) but also to reduce the complexity of the model (like the number or absolute size of the sum of all coefficients in the model).

Two popular examples of regularization procedures for linear regression are:
Lasso Regression: where Ordinary Least Squares is modified to also minimize the absolute sum of the coefficients (called L1 regularization).
Ridge Regression: where Ordinary Least Squares is modified to also minimize the squared absolute sum of the coefficients (called L2 regularization).

These methods are effective to use when there is collinearity in your input values and ordinary least squares would overfit the training data.

Lasso Regression

Linear regression refers to a model that assumes a linear relationship between input variables and the target variable.

With a single input variable, this relationship is a line, and with higher dimensions, this relationship can be thought of as a hyperplane that connects the input variables to the target variable. The coefficients of the model are found via an optimization process that seeks to minimize the sum squared error between the predictions ($\hat{y}$) and the expected target values ($y$).
$loss = \sum_{i=0}^n (y_i – \hat{y})^2$

A problem with linear regression is that estimated coefficients of the model can become large, making the model sensitive to inputs and possibly unstable. This is particularly true for problems with few observations (samples) or less samples ($n$) than input predictors ($p$) or variables (so-called $p >> n$ problems).

One approach to address the stability of regression models is to change the loss function to include additional costs for a model that has large coefficients. Linear regression models that use these modified loss functions during training are referred to collectively as penalized linear regression.

A popular penalty is to penalize a model based on the sum of the absolute coefficient values. This is called the L1 penalty. An L1 penalty minimizes the size of all coefficients and allows some coefficients to be minimized to the value zero, which removes the predictor from the model.
$L1_{penalty} = \sum_{ j=0}^p abs(\beta_j)$

An L1 penalty minimizes the size of all coefficients and allows any coefficient to go to the value of zero, effectively removing input features from the model.

This acts as a type of automatic feature selection.

This penalty can be added to the cost function for linear regression and is referred to as Least Absolute Shrinkage And Selection Operator regularization (LASSO).

A hyperparameter is used called “lambda” that controls the weighting of the penalty to the loss function. A default value of 1.0 will give full weightings to the penalty; a value of 0 excludes the penalty. Very small values of $\lambda$, such as 1e-3 or smaller, are common.

$lasso_{loss} = loss + (\lambda * L1_{penalty})$

Ridge Regression 

Ridge Regression is a popular type of regularized linear regression that includes an L2 penalty. This has the effect of shrinking the coefficients for those input variables that do not contribute much to the prediction task.

One popular penalty is to penalize a model based on the sum of the squared coefficient values (beta). This is called an L2 penalty.
$L2_{penalty} = \sum_{ j=0}^p \beta_j^2$

An L2 penalty minimizes the size of all coefficients, although it prevents any coefficients from being removed from the model by allowing their value to become zero.

This penalty can be added to the cost function for linear regression and is referred to as Tikhonov regularization (after the author), or Ridge Regression more generally.

The effect of this penalty is that the parameter estimates are only allowed to become large if there is a proportional reduction in SSE. In effect, this method shrinks the estimates towards 0 as the lambda penalty becomes large (these techniques are sometimes called “shrinkage methods”).

A hyperparameter is used called “lambda” that controls the weighting of the penalty to the loss function. A default value of 1.0 will fully weight the penalty; a value of 0 excludes the penalty. Very small values of $\lambda$, such as 1e-3 or smaller are common.
$ridge_{loss} = loss + (\lambda * L2_{penalty})$

Python code for lasso and Ridge Regression is here

Making Predictions with Linear Regression

Given the representation is a linear equation, making predictions is as simple as solving the equation for a specific set of inputs.

Let’s make this concrete with an example. Imagine we are predicting weight (y) from height (x). Our linear regression model representation for this problem would be:

$weight =\beta_0 +\beta_1 * height$

Where $\beta_0$ is the bias coefficient and $\beta_1$ is the coefficient for the height column. We use a learning technique to find a good set of coefficient values. Once found, we can plug in different height values to predict the weight.

For example, lets use $\beta_0 = 0.1$ and $\beta_1 = 0.5$. Let’s plug them in and calculate the weight (in kilograms) for a person with the height of 182 centimeters.

$weight = 0.1 + 0.5 * 182$
$weight = 91.1$
You can see that the above equation could be plotted as a line in two-dimensions. The $\beta_0$ is our starting point regardless of what height we have

Preparing Data For Linear Regression

Linear regression is been studied at great length, and there is a lot of literature on how your data must be structured to make best use of the model.

As such, there is a lot of sophistication when talking about these requirements and expectations which can be intimidating. In practice, you can uses these rules more as rules of thumb when using Ordinary Least Squares Regression, the most common implementation of linear regression.

Try different preparations of your data using these heuristics and see what works best for your problem.

Linear Assumption. Linear regression assumes that the relationship between your input and output is linear. It does not support anything else. This may be obvious, but it is good to remember when you have a lot of attributes. You may need to transform data to make the relationship linear (e.g. log transform for an exponential relationship).
Remove Noise. Linear regression assumes that your input and output variables are not noisy. Consider using data cleaning operations that let you better expose and clarify the signal in your data. This is most important for the output variable and you want to remove outliers in the output variable (y) if possible.
Remove Collinearity. Linear regression will over-fit your data when you have highly correlated input variables. Consider calculating pairwise correlations for your input data and removing the most correlated.
Gaussian Distributions. Linear regression will make more reliable predictions if your input and output variables have a Gaussian distribution. You may get some benefit using transforms (e.g. log or BoxCox) on you variables to make their distribution more Gaussian looking.
Rescale Inputs: Linear regression will often make more reliable predictions if you re scale input variables using standardization or normalization.

See the Wikipedia article on Linear Regression for an excellent list of the assumptions made by the model. There’s also a great list of assumptions on the Ordinary Least Squares Wikipedia article.


Overfitting and Underfitting

The cause of poor performance in machine learning is either overfitting or underfitting the data.

Supervised machine learning is best understood as approximating a target function (f) that maps input variables $(X)$ to an output variable $(Y).$

$Y = f(X)$

This characterization describes the range of classification and prediction problems and the machine algorithms that can be used to address them.

An important consideration in learning the target function from the training data is how well the model generalizes to new data. Generalization is important because the data we collect is only a sample, it is incomplete and noisy.

Training a deep neural network that can generalize well to new data is a challenging problem.

A model with too little capacity cannot learn the problem, whereas a model with too much capacity can learn it too well and overfit the training dataset. Both cases result in a model that does not generalize well.

A modern approach to reducing generalization error is to use a larger model that may be required to use regularization during training that keeps the weights of the model small. These techniques not only reduce overfitting, but they can also lead to faster optimization of the model and better overall performance.

Generalization in Machine Learning

In machine learning we describe the learning of the target function from training data as inductive learning.

Induction refers to learning general concepts from specific examples which is exactly the problem that supervised machine learning problems aim to solve. This is different from deduction that is the other way around and seeks to learn specific concepts from general rules.

Generalization refers to how well the concepts learned by a machine learning model apply to specific examples not seen by the model when it was learning.

The central challenge in machine learning is that we must perform well on new, previously unseen inputs — not just those on which our model was trained. The ability to perform well on previously unobserved inputs is called generalization.

The goal of a good machine learning model is to generalize well from the training data to any data from the problem domain. This allows us to make predictions in the future on data the model has never seen.

There is a terminology used in machine learning when we talk about how well a machine learning model learns and generalizes to new data, namely overfitting and underfitting.

Overfitting and underfitting are the two biggest causes for poor performance of machine learning algorithms.

We require that the model learn from known examples and generalize from those known examples to new examples in the future. We use methods like a train/test split or k-fold cross-validation only to estimate the ability of the model to generalize to new data.

Learning and also generalizing to new cases is hard.

Too little learning and the model will perform poorly on the training dataset and on new data. The model will underfit the problem. Too much learning and the model will perform well on the training dataset and poorly on new data, the model will overfit the problem. In both cases, the model has not generalized.
Underfit Model. A model that fails to sufficiently learn the problem and performs poorly on a training dataset and does not perform well on a holdout sample.
Overfit Model. A model that learns the training dataset too well, performing well on the training dataset but does not perform well on a hold out sample.
Good Fit Model. A model that suitably learns the training dataset and generalizes well to the old out dataset.

A model fit can be considered in the context of the bias-variance trade-off.

An underfit model has high bias and low variance. Regardless of the specific samples in the training data, it cannot learn the problem. An overfit model has low bias and high variance. The model learns the training data too well and performance varies widely with new unseen examples or even statistical noise added to examples in the training dataset.

Statistical Fit

In statistics, a fit refers to how well you approximate a target function.

This is good terminology to use in machine learning, because supervised machine learning algorithms seek to approximate the unknown underlying mapping function for the output variables given the input variables.

Statistics often describe the goodness of fit which refers to measures used to estimate how well the approximation of the function matches the target function.

Some of these methods are useful in machine learning (e.g. calculating the residual errors), but some of these techniques assume we know the form of the target function we are approximating, which is not the case in machine learning.

If we knew the form of the target function, we would use it directly to make predictions, rather than trying to learn an approximation from samples of noisy training data.

Overfitting in Machine Learning

Overfitting refers to a model that models the training data too well.

Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.

Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how much detail the model learns.

For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail it has picked up.

Underfitting in Machine Learning

Underfitting refers to a model that can neither model the training data nor generalize to new data.

An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.

Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good contrast to the problem of overfitting.

A Good Fit in Machine Learning

Ideally, you want to select a model at the sweet spot between underfitting and overfitting.

This is the goal, but is very difficult to do in practice.

To understand this goal, we can look at the performance of a machine learning algorithm over time as it is learning a training data. We can plot both the skill on the training data and the skill on a test dataset we have held back from the training process.

Over time, as the algorithm learns, the error for the model on the training data goes down and so does the error on the test dataset. If we train for too long, the performance on the training dataset may continue to decrease because the model is overfitting and learning the irrelevant detail and noise in the training dataset. At the same time the error for the test set starts to rise again as the model’s ability to generalize decreases.

The sweet spot is the point just before the error on the test dataset starts to increase where the model has good skill on both the training dataset and the unseen test dataset.

You can perform this experiment with your favorite machine learning algorithms. This is often not useful technique in practice, because by choosing the stopping point for training using the skill on the test dataset it means that the testset is no longer “unseen” or a standalone objective measure. Some knowledge (a lot of useful knowledge) about that data has leaked into the training procedure.

There are two additional techniques you can use to help find the sweet spot in practice: resampling methods and a validation dataset.

How To Limit Overfitting

Both overfitting and underfitting can lead to poor model performance. But by far the most common problem in applied machine learning is overfitting.

Overfitting is such a problem because the evaluation of machine learning algorithms on training data is different from the evaluation we actually care the most about, namely how well the algorithm performs on unseen data.

There are two important techniques that you can use when evaluating machine learning algorithms to limit overfitting:
Use a resampling technique to estimate model accuracy.
Hold back a validation dataset.

The most popular resampling technique is k-fold cross validation. It allows you to train and test your model k-times on different subsets of training data and build up an estimate of the performance of a machine learning model on unseen data.

A validation dataset is simply a subset of your training data that you hold back from your machine learning algorithms until the very end of your project. After you have selected and tuned your machine learning algorithms on your training dataset you can evaluate the learned models on the validation dataset to get a final objective idea of how the models might perform on unseen data.

Using cross validation is a gold standard in applied machine learning for estimating model accuracy on unseen data. If you have the data, using a validation dataset is also an excellent practice.

What is R-squared?

R squared or Coefficient of determination, or $R^2$ is a measure that provides information about the goodness of fit of the regression model. In simple terms, it is a statistical measure that tells how well the plotted regression line fits the actual data. R squared measures how much the variation is there in predicted and actual values in the regression model.

What is the significance of R squared
R-squared values ​​range from 0 to 1, usually expressed as a percentage from 0% to 100%.
And this value of R square tells you how well the data fits the line you’ve drawn.
The higher the model’s R-Squared value, the better the regression line fits the data.

Note: R-squared values ​​very close to 1 are likely overfitting of the model and should be avoided. 
So if the model value is close to 0, then the model is not a good fit
A good model should have an R-squared greater than 0.8.

When to Use R Squared

Linear regression is a powerful tool for predicting future events, and the r-squared statistic measures how accurate your predictions are. But when should you use r-squared?
Both independent and dependent variables must be continuous.
When the independent and dependent variables have linear relationship (+ve or -ve) between them.

How to calculate R squared in linear regression?
The deviation of an actual value from the mean is therefore calculated as the sum of the squared distances of the individual points from the mean. This is also called the total sum of squares (SST). SST is also called total error. Mathematically, SST is expressed as:
$SST=\sum_{i=1}^n (y_i-\bar{y})$
where $\bar{y}$ represents the average(mean) and $y_i$ represents the actual value.

The total variation of the actual values from the regression line is expressed as the sum of the squared distances between the predicted values from the regression line, also known as the Residual sum of squared error (SSR). SSR is sometimes called explanatory error or explanatory variance. Mathematically, SSR is expressed as 
$SSR=\sum_{i=1}^n (y_i-\hat{y_i})$
$\hat{y_i}$ represents the predicted value and $y_i$ represents the actual value.

The formula to calculate $R- Squared$ is
$R-squared=1-\frac{SSR}{SST}$
where:
$SSR$ is the sum of squared residuals (i.e., the sum of squared errors)
$SST$ is the total sum of squares (i.e., the sum of squared deviations from the mean)

Limitations of Using R Squared
R squared can be misleading if you’re not careful. For example, if you have a lot of noise in your data, your r-squared value will inflate. In other words, it’ll look like your model needs to do a better job of predicting the data when in reality, it’s not.
R-squared values don’t help in case we have categorical variables.

*********************************************************************************
Logistic Regression
*********
Logistic Regression is one of the basic and popular algorithms to solve a classification problem. It is named ‘Logistic Regression’ because its underlying technique is quite the same as Linear Regression. The term “Logistic” is taken from the Logit function that is used in this method of classification.

The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.
$1 / (1 + e^{-value})$

We identify the problem as a classification problem when independent variables are continuous in nature and the dependent variable is in categorical form i.e. in classes like positive class and negative class. The real-life example of classification example would be, to categorize the mail as spam or not spam, to categorize the tumor as malignant or benign, and to categorize the transaction as fraudulent or genuine. All these problem’s answers are in categorical form i.e. Yes or No. and that is why they are two-class classification problems.
Although, sometimes we come across more than 2 classes, and still it is a classification problem. These types of problems are known as multi-class classification problems.

Suppose we have data of tumor size vs its malignancy. As it is a classification problem, if we plot, we can see, all the values will lie on 0 and 1. And if we fit the best-found regression line, by assuming the threshold at 0.5, we can do line pretty reasonable job.We can decide the point on the x-axis from where all the values lie to its left side are considered as a negative class and all the values lie to its right side are positive class.
But what if there is an outlier in the data. Things would get pretty messy. For example, for 0.5 thresholds,


If we fit the best-found regression line, it still won’t be enough to decide any point by which we can differentiate classes. It will put some positive class examples into negative class. The green dotted line (Decision Boundary) is dividing malignant tumors from benign tumors but the line should have been at a yellow line which is clearly dividing the positive and negative examples. So just a single outlier is disturbing the whole linear regression predictions. And that is where logistic regression comes into the picture.

From this example, it can be inferred that linear regression is not suitable for classification problem. Linear regression is unbounded, and this brings logistic regression into picture. Their value strictly ranges from 0 to 1.

Data is fit into linear regression model, which then be acted upon by a logistic function predicting the target categorical dependent variable.This justifies the name ‘logistic regression’.

As discussed earlier, to deal with outliers, Logistic Regression uses the Sigmoid function.

An explanation of logistic regression can begin with an explanation of the standard logistic function. The logistic function is a Sigmoid function, which takes any real value between zero and one. It is defined as

$\sigma (t)=\frac{e^t}{e^t+1}=\frac{1}{1+e^{-t}}$
Let’s consider $t$ as a linear function in a univariate regression model.
$t=\beta_0+\beta_1 x$
So the Logistic Equation will become
$\sigma (x)=\frac{1}{1+e^{-(\beta_0+\beta_1x)}}$
Now, when the logistic regression model comes across an outlier, it will take care of it.




Types of Logistic Regression

1. Binary Logistic Regression

The categorical response has only two 2 possible outcomes. Example: Spam or Not

2. Multinomial Logistic Regression

Three or more categories without ordering. Example: Predicting which food is preferred more (Veg, Non-Veg, Vegan)

3. Ordinal Logistic Regression

Three or more categories with ordering. Example: Movie rating from 1 to 5

Decision Boundary

To predict which class a data belongs, a threshold can be set. Based upon this threshold, the obtained estimated probability is classified into classes.

Say, if $predicted-value ≥ 0.5$, then classify email as spam else as not spam.

Decision boundary can be linear or non-linear. Polynomial order can be increased to get complex decision boundary.
For the binary classification problem , the decision boundary is always linear.

The fundamental application of logistic regression is to determine a decision boundary for a binary classification problem. Although the baseline is to identify a binary decision boundary, the approach can be very well applied for scenarios with multiple classification classes or multi-class classification.

Cost Function ( Cross Entropy Loss Function)
Linear regression uses Least Squared Error as loss function that gives a convex graph and then we can complete the optimization by finding its vertex as global minimum. However, it’s not an option for logistic regression anymore. Since the hypothesis is changed, Least Squared Error will result in a non-convex graph with local minimums by calculating with sigmoid function applied on raw model output.

if $h(\beta)=sigmoid(\beta_0+\beta_1x)$
$cost(h_\beta(x),y_{actual})=-log(h_{\beta}(x))$ if  $y=1$
$cost(h_\beta(x),y_{actual})= -log(1-h_{\beta}(x))$ if  $y=0$

Intuitively, we want to assign more punishment when predicting 1 while the actual is 0 and when predict 0 while the actual is 1. The loss function of logistic regression is doing this exactly which is called Logistic Loss. See as below. If y = 1, looking at the plot below on left(blue), when prediction = 1, the cost = 0, when prediction = 0, the learning algorithm is punished by a very large cost. Similarly, if y = 0, the plot on right shows(red), predicting 0 has no punishment but predicting 1 has a large value of cost.


Another advantage of this loss function is that although we are looking at it by $y = 1$ and $y = 0$ separately, it can be written as one single formula which brings convenience for calculation:
So the cost function of the model is the summation from all training data samples:
$J(\beta)=\frac{1}{m}\left[  \sum_{i=1}^m -y^{(i)}log(h_\beta(x^{(i)}))+(1-y^{(i)})log(1-h_\beta(x^{(i)}))\right]$
where $m$ is the number of samples.

Python Code Plotting the Cost function

import numpy as np
import matplotlib.pyplot as plt
 
'''
Hypothesis Function - Sigmoid function
'''
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))
 
'''
yHat represents the predicted value / probability value calculated as output of hypothesis / sigmoid function
 
y represents the actual label
'''
def cross_entropy_loss(yHat, y):
    if y == 1:
      return -np.log(yHat)
    else:
      return -np.log(1 - yHat)
#
# Calculate sample values for Z
#
z = np.arange(-10, 10, 0.1)
#
# Calculate the hypothesis value / probability value
#
h_z = sigmoid(z)
#
# Value of cost function when y = 1
# -log(h(x))
#
cost_1 = cross_entropy_loss(h_z, 1)
#
# Value of cost function when y = 0
# -log(1 - h(x))
#
cost_0 = cross_entropy_loss(h_z, 0)
#
# Plot the cross entropy loss
#
fig, ax = plt.subplots(figsize=(8,6))
plt.plot(h_z, cost_1, label='J(w) if y=1')
plt.plot(h_z, cost_0, label='J(w) if y=0')
plt.xlabel('$\phi$(z)')
plt.ylabel('J(w)')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
Python code : x -represent tumor size y-represent malignant or not 
import numpy
from sklearn import linear_model

#Reshaped for Logistic function.
X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

logr = linear_model.LogisticRegression()
logr.fit(X,y)

#predict if tumor is cancerous where the size is 3.46mm:
predicted = logr.predict(numpy.array([3.46]).reshape(-1,1))
print(predicted) 
 
This will return the probability that each tumor is cancerous 
import numpy
from sklearn import linear_model

X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

logr = linear_model.LogisticRegression()
logr.fit(X,y)

def logit2prob(logr, X):
  log_odds = logr.coef_ * X + logr.intercept_
  odds = numpy.exp(log_odds)
  probability = odds / (1 + odds)
  return(probability)

print(logit2prob(logr, X)) 
 

Multinomial Classification problem

Logistic regression is a classification algorithm.

It is intended for datasets that have numerical input variables and a categorical target variable that has two values or classes. Problems of this type are referred to as binary classification problems.

Logistic regression is designed for two-class problems, modeling the target using a binomial probability distribution function. The class labels are mapped to 1 for the positive class or outcome and 0 for the negative class or outcome. The fit model predicts the probability that an example belongs to class 1.

By default, logistic regression cannot be used for classification tasks that have more than two class labels, so-called multi-class classification.

Instead, it requires modification to support multi-class classification problems.

One popular approach for adapting logistic regression to multi-class classification problems is to split the multi-class classification problem into multiple binary classification problems and fit a standard logistic regression model on each subproblem. Techniques of this type include one-vs-rest and one-vs-one wrapper models.

An alternate approach involves changing the logistic regression model to support the prediction of multiple class labels directly. Specifically, to predict the probability that an input example belongs to each known class label.

The probability distribution that defines multi-class probabilities is called a multinomial probability distribution. A logistic regression model that is adapted to learn and predict a multinomial probability distribution is referred to as Multinomial Logistic Regression. Similarly, we might refer to default or standard logistic regression as Binomial Logistic Regression.

Binomial Logistic Regression: Standard logistic regression that predicts a binomial probability (i.e. for two classes) for each input example.

Multinomial Logistic Regression: Modified version of logistic regression that predicts a multinomial probability (i.e. more than two classes) for each input example.

Using iris data set
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

iris = datasets.load_iris()

X = iris['data']
y = iris['target']

logit = LogisticRegression(max_iter = 10000)

print(logit.fit(X,y))

print(logit.score(X,y)) 




Advantages of Logistic Regression

The advantages of the logistic regression are as follows:

1. Logistic Regression is very easy to understand.
2. It requires less training.
3. It performs well for simple datasets as well as when the data set is linearly separable.
4. It doesn’t make any assumptions about the distributions of classes in feature space.
5. A Logistic Regression model is less likely to be over-fitted but it can overfit in high dimensional datasets. To avoid over-fitting these scenarios, One may consider regularization.

Disadvantages of Logistic Regression

The disadvantages of the logistic regression are as follows:

1. Sometimes a lot of Feature Engineering is required.
2. If the independent features are correlated with each other it may affect the performance of the classifier.
3. It is quite sensitive to noise and overfitting.
4. Logistic Regression should not be used if the number of observations is lesser than the number of features, otherwise, it may lead to overfitting.
5. By using Logistic Regression, non-linear problems can’t be solved because it has a linear decision surface. But in real-world scenarios, the linearly separable data is rarely found.


Classification Performance

Binary classification has four possible types of results:

True positives: correctly predicted positives (ones)
True negatives: correctly predicted negatives (zeros)
False positives: incorrectly predicted positives (ones)
False negatives: incorrectly predicted negatives (zeros)

You usually evaluate the performance of your classifier by comparing the actual and predicted  outputs and counting the correct and incorrect predictions.

The most straightforward indicator of classification accuracy is the ratio of the number of correct predictions to the total number of predictions (or observations)(score). Other indicators of binary classifiers include the following:
The positive predictive value is the ratio of the number of true positives to the sum of the numbers of true and false positives.(Precision)
The negative predictive value is the ratio of the number of true negatives to the sum of the numbers of true and false negatives.
The sensitivity (also known as Recall or true positive rate) is the ratio of the number of true positives to the number of actual positives.
The specificity (or true negative rate) is the ratio of the number of true negatives to the number of actual negatives.
Accuracy is just the ratio of correct results to all the results of a test.
F1-score is one of the most important evaluation metrics in machine learning. It elegantly sums up the predictive performance of a model by combining two otherwise competing metrics — precision and recall.
Confusion Matrix:A confusion matrix is a matrix that is used to define the performance of a classification algorithm. A confusion matrix visualizes and summarizes the performance of a classification algorithm.

Example:

Definitions:
Patient: positive for disease ( sick)
Healthy: negative for disease
True positive (TP)= the number of cases correctly identified as patient
False positive (FP) = the number of cases incorrectly identified as patient
True negative (TN) = the number of cases correctly identified as healthy
False negative (FN) = the number of cases incorrectly identified as healthy


Confusion Matrix
Positive predictive value: ( Precision)
Positive predictive value is the proportion of cases giving positive test results who are already patients . It is the ratio of patients truly diagnosed as positive to all those who had positive test results (including healthy subjects who were incorrectly diagnosed as patient). This characteristic can predict how likely it is for someone to truly be patient, in case of a positive test result.

Positive predictive value=$\frac{TP}{TP+FP}$

Precision attempts to answer the following question:
What proportion of positive identifications was actually correct?

Negative predictive value:
Negative predictive value is the proportion of the cases giving negative test results who are already healthy. It is the ratio of subjects truly diagnosed as negative to all those who had negative test results (including patients who were incorrectly diagnosed as healthy). This characteristic can predict how likely it is for someone to truly be healthy, in case of a negative test result.

Negative predictive value=$\frac{TN}{TN+FN}$

Sensitivity :( Recall)
The proportion of people with the disease who tested positive compared to the number of all the people with the disease, regardless of their test result.

To calculate sensitivity, we'll need:Number of true positive cases (TP); and
Number of false negative cases (FN).

$Sensitivity = \frac{TP} {(TP + FN)}$

Recall attempts to answer the following question:
What proportion of actual positives was identified correctly?

Specificity:
The proportion of healthy people that tested negative compared to the total number of people without the disease, no matter their test result.

To calculate specificity, we'll need:Number of true negative cases (TN); and Number of false positive cases (FP).

$Specificity = \frac{TN }{ (FP + TN)}$

Accuracy

From all the classes (positive and negative), how many of them we have predicted correctly. 
Accuracy should be high as possible.
$Accuracy = \frac{(TP + TN)} {(TP + TN + FP + FN)}$

F1-Score

Precision measures the extent of error caused by False Positives (FPs) whereas recall measures the extent of error caused by False Negatives (FNs). Therefore, to decide which metric to use, we should assess the relative impact of these two types of errors on our use-case. Thus, the key question we should be asking is:

“Which type of error — FPs or FNs — is more undesirable for our use-case?”

It is also possible that for your use-case, you assess that the errors caused by FPs and FNs are (almost) equally undesirable. Hence, you may wish for a model to have as few FPs and FNs as possible. Put differently, you would want to maximize both precision and recall. In practice, it is not possible to maximize both precision and recall at the same time because of the trade-off between precision and recall.

Increasing precision will decrease recall, and vice versa.

So, given pairs of precision and recall values for different models, how would you compare and decide which is the best? The answer is  F1-score.

By definition, F1-score is the harmonic mean of precision and recall. It combines precision and recall into a single number using the following formula:

$F1-score= \frac{2}{ \frac{1}{Precision} + \frac{1}{Recall}}$

This formula can also be equivalently written as,
$F1-score=2 \times \frac{Precision \times Recall}{ Precision + Recall }$

Notice that F1-score takes both precision and recall into account, which also means it accounts for both FPs and FNs. The higher the precision and recall, the higher the F1-score. F1-score ranges between 0 and 1. The closer it is to 1, the better the model.
Suppose we have trained three different models for cancer prediction, and each model has different precision and recall values.

Figure courtesy:towards data science
If we assess that errors caused by FPs  are more undesirable, then we will select a model based on precision and choose Model C.
If we assess that errors caused by FNs  are more undesirable, then we will select a model based on recall and choose Model B.
However, if we assess that both types of errors are undesirable, then we will select a model based on F1-score and choose Model A.

So, the takeaway here is that the model you select depends greatly on the evaluation metric you choose, which in turn depends on the relative impacts of errors of FPs and FNs in your use-case.

The most suitable indicator depends on the problem of interest.

Support
Support is the number of actual occurrences of the class in the specified dataset. Imbalanced support in the training data may indicate structural weaknesses in the reported scores of the classifier and could indicate the need for stratified sampling or rebalancing. Support doesn’t change between models but instead diagnoses the evaluation process.

Example: ( university question)
Suppose 10000 patients get tested for flu; out of them, 9000 are actually healthy and 1000 are actually sick. For the sick people, a test was positive for 620 and negative for 380. For healthy people, the same test was positive for 180 and negative for 8820. Construct a confusion matrix for the data and compute the accuracy, precision and recall for the data
Example: 
Consider a two-class classification problem of predicting whether a photograph contains a man or a woman. Suppose we have a test dataset of 10 records with expected outcomes and a set of predictions from our classification algorithm. Compute the confusion matrix, accuracy, precision, recall, sensitivity and specificity on the following data.
$\begin{vmatrix}
No & Actual & Predicted \\
1 & man & woman \\
2& man & man\\
3&woman & woman \\
4& man & man\\
5&man&woman\\
6&woman & woman \\
7&woman&man \\
8&man&man\\
9&man&woman\\
10&woman & woman\\
\end{vmatrix}$

 

 

Confusion Matrix

 

 

Actual Values

 

 

Total

Man

 

Woman

 

 

 

Predicted Values

 

 

Man

 

3 (TP)

 

1 (FP)

 

4

 

Woman

 

3 (FN)

 

3 (TN)

 

6

 

 

Total

 

6

 

4

 

10


$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}=\frac{6}{10}=0.6=60 \%$
$Precision(P)=\frac{TP}{TP+FP}=\frac{3}{4}=0.75$
$Precision(N)=\frac{TN}{TN+FN}=\frac{3}{6}=0.5$
$Recall(P)=Sensitivity=\frac{TP}{TP+FN}=\frac{3}{6}=0.5$
$Recall(N)=Specificity= \frac{TN }{ (FP + TN)}=\frac{3}{4}=0.75$
$F1-Score=2*\frac{Precision*Recall}{Precision+recall}=2*\frac{0.75*0.5}{0.75+0.5}=0.6$
Python Code( Confusion Matrix)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
# actual values
actual = [1,1,0,1,1,0,0,1,1,0]
# predicted values
predicted = [0,1,0,1,0,0,1,1,0,0]
# confusion matrix
matrix = confusion_matrix(actual,predicted, labels=[1,0])
print('Confusion matrix : \n',matrix.T)

# outcome values order in sklearn
tp, fn, fp, tn = confusion_matrix(actual,predicted,labels=[1,0]).reshape(-1)
print('TP=%-5d FP=%-5d\nFN=%-5d TN=%-5d:\n'% (tp, fp, fn, tn))

# classification report for precision, recall f1-score and accuracy
matrix = classification_report(actual,predicted,labels=[1,0])
print('Classification report : \n',matrix)
output
Confusion matrix : 
 [[3 1]
 [3 3]]
TP=3     FP=1    
FN=3     TN=3    

Classification report : 
               precision    recall  f1-score   support

           1       0.75      0.50      0.60         6
           0       0.50      0.75      0.60         4

    accuracy                           0.60        10
   macro avg       0.62      0.62      0.60        10
weighted avg       0.65      0.60      0.60        10

Example with Python code

consider a csv data file with age and insurance(1/0 for yes/no)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

df=pd.read_csv('insurance.csv')
print(df.head())

X_train,X_test,y_train,y_test=train_test_split(df[['age']],df.insurance,test_size=0.1)
print(X_test)
model=LogisticRegression()
model.fit(X_train,y_train)
print("classes",model.classes_)
print("beta 1",model.coef_)
print("beta 0",model.intercept_)
print("predictions\n",model.predict(X_test))
print("p(x)....1-p(x)\n",model.predict_proba(X_test))
print("train score\n",model.score(X_train,y_train))
print("test score\n",model.score(X_test,y_test))
print("Confusion Matrix\n",confusion_matrix(y_test, model.predict(X_test)))
print("classification_report",classification_report(y_test,model.predict(X_test)))

outputs
    age  insurance
0   22          0
1   25          0
2   47          1
3   52          0
4   46          1
   age
3   52
4   46
8   62
classes [0 1]
beta 1 [[0.15807358]]
beta 0 [-5.89830245]
predictions
 [1 1 1]
p(x)....1-p(x)
 [[0.08935598 0.91064402]
 [0.2021223  0.7978777 ]
 [0.01979641 0.98020359]]
train score
 0.9166666666666666
test score
 0.6666666666666666
Confusion Matrix
 [[0 1]
 [0 2]]
classification_report               
                precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.67      1.00      0.80         2

    accuracy                           0.67         3
   macro avg       0.33      0.50      0.40         3
weighted avg       0.44      0.67      0.53         3


Example with pythoncode

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sb

df=pd.read_csv('insurance.csv')
print(df.head())

X_train,X_test,y_train,y_test=train_test_split(df[['age']],df.insurance,test_size=0.1)
print(X_test)
model=LogisticRegression()
mxtrain=np.mean(X_train)
msxtrain=X_train-mxtrain
model.fit(msxtrain,y_train)
print("classes",model.classes_)
print("beta 1",model.coef_)
print("beta 0",model.intercept_)
beta0=model.intercept_
beta1=model.coef_
print("predictions\n",model.predict(X_test))
print("p(x)....1-p(x)\n",model.predict_proba(X_test))
print("train score\n",model.score(X_train,y_train))
print("test score\n",model.score(X_test,y_test))
print("Confusion Matrix\n",confusion_matrix(y_test, model.predict(X_test)))
print("classification_report\n",classification_report(y_test,model.predict(X_test)))
y=beta0+beta1*msxtrain
plt.plot(msxtrain,y)
z=1/(1.0+np.exp(-y))
plt.scatter(msxtrain,y_train)
plt.plot(msxtrain,y)
plt.plot(msxtrain,z)
cm=confusion_matrix(y_test, model.predict(X_test))
fig,ax=plt.subplots()
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
ax.text(0, 0, cm[0,0], ha='center', va='center', color='red')
ax.text(0, 1, cm[0,1], ha='center', va='center', color='red')
ax.text(1, 0, cm[1,0], ha='center', va='center', color='red')
ax.text(1, 1, cm[1,1], ha='center', va='center', color='red')
plt.imshow(cm)

#you can also use Seaborn package for plotting try this
#sb.heatmap(cm,annot=True)
output
age  insurance
0   22          0
1   25          0
2   47          1
3   52          0
4   46          1
    age
14   49
20   21
19   18
classes [0 1]
beta 1 [[0.12651506]]
beta 0 [0.25585701]
predictions
 [1 1 1]
p(x)....1-p(x)
 [[0.00157002 0.99842998]
 [0.05153062 0.94846938]
 [0.07356816 0.92643184]]
train score
 0.5416666666666666
test score
 0.3333333333333333
Confusion Matrix
 [[0 2]
 [0 1]]
classification_report
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.33      1.00      0.50         1

    accuracy                           0.33         3
   macro avg       0.17      0.50      0.25         3
weighted avg       0.11      0.33      0.17         3


User Database – This dataset contains information about users from a company’s database. It contains information about UserID, Gender, Age, EstimatedSalary, and Purchased. We are using this dataset for predicting whether a user will purchase the company’s newly launched product or not.

Let us make the Logistic Regression model, predicting whether a user will purchase the product or not.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset = pd.read_csv("User_Data.csv")
# input
x = dataset.iloc[:, [2, 3]].values
# output
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
  
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.25, random_state=0)
from sklearn.preprocessing import StandardScaler
  
sc_x = StandardScaler()
xtrain = sc_x.fit_transform(xtrain)
xtest = sc_x.transform(xtest)

from sklearn.linear_model import LogisticRegression
  
classifier = LogisticRegression(random_state = 0)
classifier.fit(xtrain, ytrain)

y_pred = classifier.predict(xtest)

from sklearn.metrics import confusion_matrix
  
cm = confusion_matrix(ytest, y_pred)
print ("Confusion Matrix : \n", cm)

from sklearn.metrics import accuracy_score
  
print ("Accuracy : ", accuracy_score(ytest, y_pred))

Output
Confusion Matrix : 
 [[4 0]
 [0 1]]

from matplotlib.colors import ListedColormap
X_set, y_set = xtest, ytest
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(
np.array([X1.ravel(), X2.ravel()]).T).reshape(
X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green')))

plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Classifier (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()



Comments

Popular posts from this blog

Concepts in Machine Learning- CST 383 KTU Minor Notes- Dr Binu V P

Overview of Machine Learning

Syllabus Concepts in Machine Learning- CST 383 KTU