Tech Notes

My notes on Statistics, Big Data, Cloud Computing, Cyber Security

Tag Archives: train­ing set

Cross Validation

Cross-​​validation is pri­mar­ily a way of mea­sur­ing the pre­dic­tive per­for­mance of a sta­tis­ti­cal model. Every sta­tis­ti­cian knows that the model fit sta­tis­tics are not a good guide to how well a model will pre­dict: high R^2 does not nec­es­sar­ily mean a good model. It is easy to over-​​fit the data by includ­ing too many degrees of free­dom and so inflate R^2 and other fit sta­tis­tics. For exam­ple, in a sim­ple poly­no­mial regres­sion I can just keep adding higher order terms and so get bet­ter and bet­ter fits to the data. But the pre­dic­tions from the model on new data will usu­ally get worse as higher order terms are added. Read more of this post