Skip to main content

Supervised machine learning: Classification


A previous post introduced supervised machine learning as a way to infer a numerical 'answer' to a numerical 'question', by training on a set of correct input-output pairs supplied by a user.  Here we examine the special case where the answers are categorizations or classifications of inputs.

Perhaps we'd like to predict whether bananas are at a desirable level of ripeness using measurements of their color.  Figure A shows a distribution of fictitious red and green color measurements for a population of bananas (as from an RGB color camera).  Those judged 'edible' by humans are shown with small circles (tending toward yellow in the upper right corner, look closely), and those judged 'inedible' are shown with large circles (very green ones and very brown ones, mostly).

To categorize new bananas, a mathematical classifier is used to cut the red-green color plane up into two regions, 'edible' and 'inedible'.  Any new color point can then be classified based on which region it falls into.  One type of linear classifier tries to fit normal distributions to the two labeled sets of points, then draws a line between the two (Figure B).  A quadratic classifier does the same thing, but has an extra degree of freedom to draw a curve (Figure C), and so tends to do a better job.

Both classifiers make plenty of mistakes.  The groups of edible and inedible data points are not close to being normally distributed, so the assumption used by the classifiers is already wrong.  Also, there is spatial overlap between the two classes in the training data, meaning that no simple linear or quadratic classifier could correctly classify them all.

If we add more degrees of freedom to the classifier, we could fit a very complicated boundary that would give the right answers on all of the training data.  However, that complicated boundary might be bad at predicting human judgment on bananas that weren't in the training data.

The author's affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE's concurrence with, or support for, the positions, opinions, or viewpoints expressed by the author.

Blog Name