Tech Notes

My notes on Statistics, Big Data, Cloud Computing, Cyber Security

Model Accuracy

— For Regression setting

Measuring Quality of Fit

For regression, Mean Squared Error (MSE) is the most commonly used measure. The MSE will be small if the predicted responses are very close to the true responses. MSE is usually calculated using training data and used on test data – called the test MSE. So the way to go is to evaluate test MSE, and select the learning method for which the test MSE is the least.

If no test observations were available, then use the training MSE to choose the learning method.

When the training MSE is low and the test MSE is high then overfitting has occured.

Screenshot_012614_111022_PM

Three estimates of fare shown: the linear regression line (orange curve), and two smoothing spline fits (blue and green curves).Right:Training MSE (grey curve), test MSE (red curve), and minimum possible test MSE over all methods (dashed line). Squares represent the training and test MSEs for the three fits shown in the left-hand
panel. The magic line is the blue line. So basically we have to choose a model where the MSE would be low.

The Bias-Variance Trade-Off

In order to reduce the expected test MSE, we need to selected a statistical learning method that simultaneously achieves low variance and low bias.

Variance refers to the amount by which f^ would change if we estimated it using a different training data set. Ideally the estimate for f should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in f^

The flexible green curve below has high variance because adding any point will change the green curve, whereas the orange curve has low variance.

Screenshot_012614_103743_PM

Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. For example, linear regression assumes that there is a linear relationship between Y and X1,X2,…,Xp. It is unlikely that any real-life problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of f.

Screenshot_012614_104442_PM

Sample plot showing the tradeoff. Squared bias (blue curve), variance (orange curve), Var(ε)(dashed line), and test MSE (red curve). The relationship between bias, variance, and test set MSE is the bias-variance trade-off. Good test set performance of a statistical learning method requires low variance as well as low squared bias.

— For Classification Setting

The most common approach for quantifying the accuracy of our estimate f^ is the training error rate, the proportion of mistakes that are made if we apply our estimate f^ to the training observations.  We are most interested in the error rates that result from applying our classifier to test observations that were not used in training. A good classifier is one for which the test error is smallest

Read this post for Bayes Classifier

Generally the performance of a classifier is found by the misclassification error rate

Using the Bayes Classifier we can now classify observations into classes and calculate the error rate

Screenshot_020214_112631_AM

In the example above, A simulated data set consisting of 100 observations in each of two groups, indicated in blue and in orange. The purple dashed line represents the Bayes decision boundary. The orange background grid indicates the region in which a test observation will be assigned to the orange class, and the blue background grid indicates the region in which a test observation will be assigned to the blue class.

Similar to the Bayes classifier, we have the KNN  (K Nearest neighbour) classifier. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

Screenshot_020214_114107_AM

In the above example using K=3, is illustrated in a simple situation with six blue observations and six orange observations. A test observation at which a predicted class label is desired is shown as a black cross. The three closest points to the test observation are identified, and it is predicted that the test observation belongs to the most commonly-occurring class, in this case blue.

Comparison of Bayes Classifier and KNN Classifier

Screenshot_020214_114345_AM

The black curve indicates the KNN decision boundary on the data from above using K=10. The Bayes decision boundary is shown as a purple dashed line. The KNN and Bayes decision boundaries are very similar.

Screenshot_020214_114559_AM

The KNN training error rate (blue, 200 observations) and test error rate (orange, 5,000 observations) on the data from above figure as the level of flexibility (assessed using 1/K) increases, or equivalently as the number of neighbors K decreases. The black dashed line indicates the Bayes error rate.

Disclaimer : These are my study notes – online – instead of on paper so that others can benefit. In the process, I have used some pictures / content from other original authors. All sources / original content publishers are listed below and they deserve credit for their work. No copyright violation intended.

References for these notes :

The study material for the MOOC “Statistical Learning” at Stanford Online

The ebook “Introduction to Statistical Learning”

http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: