Tech Notes

My notes on Statistics, Big Data, Cloud Computing, Cyber Security

Tag Archives: test set

Decision Trees and Prediction with trees

A decision tree is a kind of flowchart — a graphical representation of the process for making a decision or a series of decisions. Read more of this post

Advertisements

Cross Validation

Cross-​​validation is pri­mar­ily a way of mea­sur­ing the pre­dic­tive per­for­mance of a sta­tis­ti­cal model. Every sta­tis­ti­cian knows that the model fit sta­tis­tics are not a good guide to how well a model will pre­dict: high R^2 does not nec­es­sar­ily mean a good model. It is easy to over-​​fit the data by includ­ing too many degrees of free­dom and so inflate R^2 and other fit sta­tis­tics. For exam­ple, in a sim­ple poly­no­mial regres­sion I can just keep adding higher order terms and so get bet­ter and bet­ter fits to the data. But the pre­dic­tions from the model on new data will usu­ally get worse as higher order terms are added. Read more of this post

Prediction Study Design

Steps in building a prediction
1. Find the right data
2. Define your error rate
3. Split data into:

  • Training
  • Testing
  • Validation (Optional)

4. On the training set pick features
5. On the training set pick prediction function
6. On the training set cross-validate
7. If no validation – apply 1x to test set
8. If validation – apply to test set and refine
9. If validation – apply 1x to validation Read more of this post