Cross Validation

In this section we discuss how to evaluate a classifier $M$ using some performance measure $θ$. Typically, the input dataset $D$ is randomly split into a disjoint training set and testing set. The training set is used to learn the model $M$, and the testing set is used to evaluate the measure $θ$. However, how confident can we be about the classification performance? The results may be due to an artifact of the random split, for example, by random chance the testing set may have particularly easy (or hard) to classify points, leading to good (or poor) classifier performance. As such, a fixed, pre-defined partitioning of the dataset is not a good strategy for evaluating classifiers. Also note that, in general, $D$ is itself a d-dimensional multivariate random sample drawn from the true (unknown) joint probability density function $f(x)$ that represents the population of interest. Ideally, we would like to know the expected value $E[θ]$ of the performance measure over all possible testing sets drawn from $f$ .However, because $f$ is unknown, we have to estimate $E[θ]$ from $D$.
Cross-validation and resampling are two common approaches to compute the expected value and variance of a given performance measure; we discuss these methods in the following sections.

Cross-Validation in Machine Learning

There is always a need to validate the stability of your machine learning model. I mean you just can’t fit the model to your training data and hope it would accurately work for the real data it has never seen before. You need some kind of assurance that your model has got most of the patterns from the data correct, and its not picking up too much on the noise, or in other words its low on bias and variance.

Validation

This process of deciding whether the numerical results quantifying hypothesized relationships between variables, are acceptable as descriptions of the data, is known as validation. Generally, an error estimation for the model is made after training, better known as evaluation of residuals. In this process, a numerical estimate of the difference in predicted and original responses is done, also called the training error. However, this only gives us an idea about how well our model does on data used to train it. Now its possible that the model is underfitting or overfitting the data. So, the problem with this evaluation technique is that it does not give an indication of how well the learner will generalize to an independent/ unseen data set. Getting this idea about our model is known as Cross Validation.

Holdout Method

Now a basic remedy for this involves removing a part of the training data and using it to get predictions from the model trained on rest of the data. The error estimation then tells how our model is doing on unseen data or the validation set. This is a simple kind of cross validation technique, also known as the holdout method. Although this method doesn’t take any overhead to compute and is better than traditional validation, it still suffers from issues of high variance. This is because it is not certain which data points will end up in the validation set and the result might be entirely different for different sets.

K-Fold Cross Validation

As there is never enough data to train your model, removing a part of it for validation poses a problem of underfitting. By reducing the training data, we risk losing important patterns/ trends in data set, which in turn increases error induced by bias. So, what we require is a method that provides ample data for training the model and also leaves ample data for validation. $K$ Fold cross validation does exactly that.

In $K$ Fold cross validation, the data is divided into k subsets. Now the holdout method is repeated k times, such that each time, one of the k subsets is used as the test set/ validation set and the other k-1 subsets are put together to form a training set. The error estimation is averaged over all k trials to get total effectiveness of our model. As can be seen, every data point gets to be in a validation set exactly once, and gets to be in a training set $K-1$ times. This significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in validation set. Interchanging the training and test sets also adds to the effectiveness of this method. As a general rule and empirical evidence, K = 5 or 10 is generally preferred, but nothing’s fixed and it can take any value.
Cross-validation divides the dataset $D$ into $K$ equal-sized parts, called folds, namely $D_1, D_2, \ldots, D_K$. Each fold $D_i$ is, in turn, treated as the testing set, with the remaining folds comprising the training set $D /D_i =\cup_{i \ne j}D_j$ . After training the model $M_i$ on  $D/D_i$ , we assess its performance on the testing set $D_i$ to obtain the $i$-th estimate $\theta_i$ .

The expected value of the performance measure can then be estimated as
$\hat{\mu}_\theta=E(\theta)= \frac{1}{K}\sum_{1=1}^K \theta_i$
and its variance as
${\hat{\sigma}_\theta }^2= \frac{1}{K}\sum_{1=1}^K (\theta_i - \hat{\mu}_\theta) ^ 2$

Algorithm22.2 shows the pseudo-code for K-fold cross-validation.After randomly shuffling the dataset $D$, we partition it into $K$ equal folds (except for possibly the last one). Next, each fold $D_i$ is used as the testing set on which we assess the performance $\theta_i$ of the classifier $M_i$ trained on $D/D_i$ . The estimated mean and variance of $\theta$ can then be reported. Note that the K-fold cross-validation can be repeated multiple times; the initial random shuffling ensures that the folds are different each  time.Usually $K$ is chosen to be 5 or 10. The special case, when $K = n$, is called leave-one-out cross-validation, where the testing set comprises a single point and the remaining data is used for training purposes.


Bootstrap Resampling
Another approach to estimate the expected performance of a classifier is to use the bootstrap resampling method. Instead of partitioning the input dataset $D$ into disjoint folds, the bootstrap method draws $K$ random samples of size $n$ with replacement from $D$. Each sample $D_i$ is thus the same size as D, and has several repeated points. Consider the probability that a point $x_j \in D$ is not selected for the $i$th bootstrap sample $D_i$ . Due to sampling with replacement, the probability that a given point is selected is given as $p = \frac{1}{n}$, and thus the probability that it is not selected is 

$q=1-p=(1-\frac{1}{n})$

Because $D_i$ has $n$ points, the probability that $x_j$ is not selected even after $n$ tries is given as

$P(x_j \notin D_j)=q^n=(1-\frac{1}{n})^n \simeq e^{-1}=0.368$
On the other hand, the probability that $x_j \in D$ is given as
$P(x_j \in D_i ) = 1−P(x_j \notin D_i ) = 1−0.368= 0.632$

This means that each bootstrap sample contains approximately 63.2% of the points from $D$.The bootstrap samples can be used to evaluate the classifier by training it on each of samples $D_i$ and then using the full input dataset $D$ as the testing set, as shown in Algorithm 22.3. The expected value and variance of the performance measure $\theta$ can be obtained using Eqs. used earlier. However, it should be borne in mind that the estimates will be somewhat optimistic owing to the fairly large overlap between the training and testing datasets (63.2%). The cross-validation approach does not suffer from this limitation because it keeps the training and testing sets disjoint.

Bootstrap Sampling in Machine Learning
In statistics, Bootstrap Sampling is a method that involves drawing of sample data repeatedly with replacement from a data source to estimate a population parameter.

Hence, when we have to estimate a parameter of a large population, we can take the help of Bootstrap Sampling.

Bootstrap sampling is used in a machine learning ensemble algorithm called bootstrap aggregating (also called bagging). It helps in avoiding overfitting and improves the stability of machine learning algorithms.
In bagging, a certain number of equally sized subsets of a dataset are extracted with replacement. Then, a machine learning algorithm is applied to each of these subsets and the outputs are ensembled as  illustrated below:

Bias and Variance

Intuitively, the bias of a classifier refers to the systematic deviation of its predicted decision boundary from the true decision boundary, whereas the variance of a classifier refers to the deviation among the learned decision boundaries over different training sets.

In general, bias indicates whether the model M is correct or incorrect. It also reflects our assumptions about the domain in terms of the decision boundary. For example, if the decision boundary is nonlinear, and we use a linear classifier, then it is likely to have high bias, that is, it will be consistently incorrect over different training sets. On the other hand, a nonlinear (or a more complex) classifier is more likely to capture the correct decision boundary, and is thus likely to have a low bias. Nevertheless, this does not necessarily mean that a complex classifier will be a better one, since we also have to consider the variance term, which measures the inconsistency of the classifier decisions. A complex classifier induces a more complex decision boundary and thus may be prone to overfitting, that is, it may try to model all the small nuances in the training data, and thus may be susceptible to small changes in training set, which may result in high variance.

Bias Variance Trade Off

In general, the expected loss can be attributed to high bias or high variance, with typically a trade-off between these two terms. Ideally, we seek a balance between these opposing trends, that is, we prefer a classifier with an acceptable bias (reflecting domain or dataset specific assumptions) and as low a variance as possible.

If the algorithm is too simple (hypothesis with linear eq.) then it may be on high bias and low variance condition and thus is error-prone. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.

If algorithms fit too complex ( hypothesis with high degree eq.) then it may be on high variance and low bias. In the latter condition, the new entries will not perform well. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.

Well, there is something between both of these conditions, known as Trade-off or Bias Variance Trade-off.
Let the variable we are trying to predict as $Y$ and other covariates as $X$. We assume there is a relationship between the two such that

$Y=f(X) + e$

Where $e$ is the error term and it’s normally distributed with a mean of 0.
We will make a model $\hat{f(X)}$ of $f(X)$ using linear regression or any other modeling technique.

So the expected squared error at a point $x$ is

$Err(x)=E[ (Y- \hat{f(x)})^2 ]$

The $Err(x)$ can be further decomposed as

$Err(x)=(E[\hat{f(x)}]-f(x))^2 + E[ (\hat{f(x)}-E[\hat{f(x)}])^2]+\sigma_e^2$

$Err(x)=Bias^2+Variance+Irreducible Error$

Irreducible error is the error that can’t be reduced by creating good models. It is a measure of the amount of noise in our data. Here it is important to understand that no matter how good we make our model, our data will have certain amount of noise or irreducible error that can not be removed.

To build a good model, we need to find a good balance between bias and variance such that it minimizes the total error.



An optimal balance of bias and variance would never overfit or underfit the model.

Therefore understanding bias and variance is critical for understanding the behavior of prediction models.

Ensemble Classifiers
A classifier is called unstable if small perturbations in the training set result in large changes in the prediction or decision boundary.High variance classifiers are inherently unstable, since they tend to overfit the data. On the other hand, high bias methods typically underfit the data, and usually have low variance. In either case, the aim of learning is to reduce classification error by reducing the variance or bias, ideally both. Ensemble methods create a combined classifier using the output of multiple base classifiers, which are trained on different data subsets. Depending on how the training sets are selected, and on the stability of the base classifiers, ensemble classifiers can help reduce the variance and the bias, leading to a better overall performance.

Bagging
Bagging, which stands for Bootstrap Aggregation, is an ensemble classification method that employs multiple bootstrap samples (with replacement) from the input training data D to create slightly different training sets $D_i , i = 1,2, \ldots,K$. Different base classifiers $M_i$ are learned, with $M_i$ trained on $D_i$ . Given any test point $x$, it is first classified using each of the $K$ base classifiers, $M_i$ . Let the number of classifiers that predict the class of $x$ as $c_j$ be given as

$v_j (x)=|\{M_i (x) = c_j,i = 1,\ldots,K \}|$

The combined classifier, denoted $M^K$, predicts the class of a test point $x$ by majority voting among the $k$ classes:

$M^K(x)=argmax_{c_j}\{ v_j(x)| j=1,\ldots,K\}$

For binary classification, assuming that the classes are given as $\{+1,−1\}$, the combined classifier $M^K$ can be expressed more simply as

$M^K(x)=sign \left ( \sum_{i=1}^k M_i(x) \right )$

Bagging can help reduce the variance, especially if the base classifiers are unstable, due to the averaging effect of majority voting. It does not, in general, have much effect on the bias.

Boosting
Boosting is another ensemble technique that trains the base classifiers on different samples. However, the main idea is to carefully select the samples to boost the performance on hard to classify instances. Starting from an initial training sample $D_1$, we train the base classifier $M_1$, and obtain its training error rate. To construct the next sample $D_2$, we select the misclassified instances with higher probability, and after training $M_2$, we obtain its training error rate. To construct $D_3$, those instances that are hard to classify by $M_1$ or $M_2$, have a higher probability of being selected. This process is repeated for $K$ iterations. Thus, unlike bagging that uses independent random samples from the input dataset, boosting employs weighted or biased samples to construct the different training sets, with the current sample depending on the previous ones. Finally, the combined classifier is obtained via weighted voting over the output of the $K$ base classifiers $M_1,M_2, \ldots ,M_K.$

Boosting is most beneficial when the base classifiers are weak, that is, have an error rate that is slightly less than that for a random classifier. The idea is that where as $M_1$ may not be particularly good on all test instances, by design $M_2$ may help classify some cases where $M_1$ fails, and $M_3$ may help classify instances where $M_1$ and $M_2$ fail, and so on. Thus, boosting has more of a bias reducing effect. Each of the weak learners is likely to have high bias (it is only slightly better than random guessing), but the final combined classifier can have much lower bias, since different weak learners learn to classify instances in different regions of the input space. Several variants of boosting can be obtained based on how the instance weights are computed for sampling, how the base classifiers are combined, and so on

Adaptive Boosting:AdaBoost 
Let $D$ be the input training set, comprising $n$ points $x_i \in R^d$. The boosting process will be repeated $K$ times. Let $t$ denote the iteration and let $α_t$ denote the weight for the $t$ th classifier $M_t$ . Let $w_i^t$ denote the weight for $x_i$ , with $w^t = (w_1^t,w_2^t,\ldots ,w_n^t)^T$ being the weight vector over all the points for the $t$ th iteration.


In fact, $w$ is a probability vector, whose elements sum to one. Initially all points have equal weights, that is,
$w^0=\left( \frac{1}{n},\frac{1}{n},\ldots,\frac{1}{n}\right)^T 1$
where $1 \in R^n$ is the $n$-dimensional vector of all $1$’s.

The pseudo-code for AdaBoost is shown in Algorithm above. During iteration $t$ ,the training sample $D_t$ is obtained via weighted resampling using the distribution $w^{t−1}$,that is, we draw a sample of size $n$ with replacement, such that the $i$th point is chosen according to its probability $w_i^{t−1}$.
Next, we train the classifier $M_t$ using $D_t$ , and compute its weighted error rate $\epsilon_t$ on the entire input dataset $D$

$\epsilon_t=\sum_{i=1}^n w_i^{t-1}.I(M_t(x_i) \ne y_i)$

where $I$ is an indicator function that is 1 when its argument is true, that is, when $M_t$ classifies $x_i$ , and is 0 otherwise.The weight for the $t$ th classifier is then set as

$\alpha_t=ln \left(\frac{1-\epsilon_t}{\epsilon_t} \right )$

and the weight for each point $x_i \in D$ is updated based on whether the point is misclassified or not

$w_i^t=w_i^{t-1}.exp\{\alpha_t.I(M_t(x_i \ne y_i)\}$

Thus, if the predicted class matches the true class, that is, if $M_t (x_i )=y_i$ , then $I(M_t (x_i) =y_i) = 0$, and the weight for point $x_i$ remains unchanged. On the other hand, if the point is misclassified, that is, $M_t (x_i) \ne y_i$ , then we have $I(M_t (x_i) \ne y_i) = 1$, and 

$w_i^t=w_i^{t-1}.exp\{\alpha_t\}=w_i^{t-1}exp\{\frac{1-\epsilon_t}{\epsilon_t}\}=w_i^{t-1}\left(\frac{1}{\epsilon_t}-1 \right)$

We can observe that if the error rate $\epsilon_t$ is small, then there is a greater weight increment for $x_i$. The intuition is that a point that is misclassified by a good classifier (with a low error rate) should be more likely to be selected for the next training dataset. On the other hand, if the error rate of the base classifier is close to 0.5, then there is only a small change in the weight, since a bad classifier (with a high error rate) is expected to misclassify many instances. Note that for a binary class problem, an error rate of 0.5 corresponds to a random classifier, that is, one that makes a random guess. Thus,we require that a base classifier has an error rate at least slightly better than random guessing, that is, $\epsilon_t < 0.5$. If the error rate $\epsilon_t \ge 0.5$, then the boosting method discards the classifier, and returns to line 5 to try another data sample. Alternatively, one can simply invert the predictions for binary classification. It is worth emphasizing that for a multi-class problem (with $k >2$), the requirement that $\epsilon_t <0.5$ is a significantly stronger requirement than for the binary ($k = 2$) class problem because in the multiclass case a random classifier is expected to have an error rate of $\frac{k−1}{k}$.
Note also that if the error rate of the base classifier $\epsilon_t = 0$, then we can stop the boosting iterations.

Once the point weights have been updated, we re-normalize the weights so that $w^t$ is a probability vector

$w^t=\frac{w^t}{1^Tw^t}=\frac{1}{\sum_{j=1}^n w_j^t}(w_1^t,w_2^t,\ldots,w_n^t)^T$

Combined Classifier 

Given the set of boosted classifiers, $M_1,M_2,\ldots,M_K$, along with their weights $\alpha_1,\alpha_2, . . . ,\alpha_K$, the class for a test case $x$ is obtained via weighted majority voting. Let $v_j (x)$ denote the weighted vote for class $c_j$ over the $K$ classifiers, given as 

$v_j(x)=\sum_{t=1}^k \alpha_t I(M_t(x)=c_j)$

Because $I(M_t (x) = c_j )$ is 1 only when $M_t (x) = c_j$ , the variable $v_j (x)$ simply obtains the tally for class $c_j$ among the $K$ base classifiers, taking into account the classifier weights.
The combined classifier, denoted $M_K$, then predicts the class for $x$ as follows:

$M^K(x)=argmax_{c_j} \{v_j(x) | j=1,\ldots,k | \}$

In the case of binary classification, with classes $\{+1,−1\}$, the combined classifier $M^K$ can be expressed more simply as

$M^K(x)=sign \left( \sum_{t=1}^{k} \alpha_t M_t(x)\right )$

Comments

Popular posts from this blog

Concepts in Machine Learning- CST 383 KTU Minor Notes- Dr Binu V P

Overview of Machine Learning

Syllabus Concepts in Machine Learning- CST 383 KTU