Skip to main content

Supervised machine learning: Overfitting and underfitting

When trying to perform regression or classification, a procedure is used to fit a mathematical formula to the properties of available training data.  In general, the training itself consists of adjusting 'volume knobs' or internal variables inside the formula so that increasingly better outputs are generated in response to inputs.

It may be possible to get very good results by letting the training improve until it can't improve any more.  Indeed, it may be possible for training to be too successful.  After all, the real test of the effectiveness of an input-output machine is how well it does on novel observations that were not a part of the training set (but of course, that measure the same phenomena as in the training set).  These new inputs are often called a validation set, or out-of-sample data, meaning that they were not seen during training.  The purpose of training a machine is not simply to store and retrieve the training data, which is trivial, but to make good guesses of output from new input observations. This is called generalization, and it is a critical property of any cognitive system.

What if we gathered an entirely new training set, and started learning again from scratch?  How similar would the second trained machine be to the first one?  Can the input-output relationships in the second data set be predicted by a machine trained on the first training set?

Training that has gone too far is called overfitting.  It results in a machine that becomes so committed to being right about one limited data set that it performs poorly on new data.  The opposite is underfitting, which involves making committments that ignore good information that is available across many possible training sets.

In the statistics literature, the struggle between overfitting and underfitting has been called the bias-variance dilemma.  If you wish, you may take it as a homework assignment to look up the term cross-validation as a solution to this dilemma.

The author's affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE's concurrence with, or support for, the positions, opinions, or viewpoints expressed by the author.

Date
Blog Name