Tech Notes

My notes on Statistics, Big Data, Cloud Computing, Cyber Security

Prediction Study Design

Steps in building a prediction
1. Find the right data
2. Define your error rate
3. Split data into:

  • Training
  • Testing
  • Validation (Optional)

4. On the training set pick features
5. On the training set pick prediction function
6. On the training set cross-validate
7. If no validation – apply 1x to test set
8. If validation – apply to test set and refine
9. If validation – apply 1x to validation

Type I and Type II Errors – Error rates

True positive = correctly identified
False positive = incorrectly identified
True negative = correctly rejected
False negative = incorrectly rejected

In general – > Positive = identified and negative = rejected.


Sensitivity relates to the test’s ability to identify positive results.

Eg . The sensitivity of a test is the proportion of people that are known to have the disease, who test positive for it. In other words, it is the probability of a positive test given that the patient is ill.

Specificity relates to the test’s ability to identify negative results.

Eg .The specificity of a test is defined as the proportion of patients that are known NOT to have the disease who will test negative for it. In other words, it is the probability of a negative test given that the patient is well.

Error rate – A test with a high sensitivity has a low type II error rate. Similarly A test with a high specificity has a low type I error rate.

The phrase “gold standard” comes from when you could exchange a currency for gold, guaranteeing its value. Thus any “gold standard” item is a benchmark and of the highest quality… which is a matter of consensus or opinion

Data Sets

Normally to perform supervised learning you need two types of data sets:

  1. In one dataset (your “gold standard”) you have the input data together with correct/expected output, This dataset is usually duly prepared either by humans or by collecting some data in semi-automated way. But it is important that you have the expected output for every data row here, because you need for supervised learning.
  2. The data you are going to apply your model to. In many cases this is the data where you are interested for the output of your model and thus you don’t have any “expected” output here yet.

While performing machine learning you do the following:

  1. Training phase: you present your data from your “gold standard” and train your model, by pairing the input with expected output.
  2. Validation/Test phase: in order to estimate how good your model has been trained (that is dependent upon the size of your data, the value you would like to predict, input etc) and to estimate model properties (mean error for numeric predictors, classification errors for classifiers, recall and precision for IR-models etc.)
  3. Application phase: now you apply your freshly-developed model to the real-world data and get the results. Since you normally don’t have any reference value in this type of data (unless why would you need your model?), you can only speculate about the quality of your model output using the results of your validation phase.

The validation phase is often split into two parts:

  1. In the first part you just look at your models and select the best performing approach using the validation data (=validation)
  2. Then you estimate the accuracy of the selected approach (=test).

Hence the separation to 50/25/25.

In case if you don’t need to choose an appropriate model from several rivaling approaches, you can just re-partition your set that you basically have only training set and test set, without performing the validation of your trained model. I personally partition them 70/30 then.

Most supervised data mining algorithms follow these three steps:

  1. The training set is used to build the model. This contains a set of data that has preclassified target and predictor variables.
  2. Typically a hold-out dataset or test set is used to evaluate how well the model does with data outside the training set. The test set contains the preclassified results data but they are not used when the test set data is run through the model until the end, when the preclassified data are compared against the model results. The model is adjusted to minimize error on the test set.
  3. Another hold-out dataset or validation set is used to evaluate the adjusted model in step #2 where, again, the validation set data is run against the adjusted model and results compared to the unused preclassified data.

Example of a predictive study design – Netflix Movie Recommendation Prediction System .

Disclaimer : These are my study notes – online – instead of on paper so that others can benefit. In the process I’ve have used some pictures / content from other original authors. All sources / original content publishers are listed below and they deserve credit for their work. No copyright violation intended.

References for these notes :

The study material for the MOOC “Data Analysis” at

A discussion thread on

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: