Bias and variance are, to me, the most important concepts in machine learning. Despite this, not many have a well-formed intuition about what they are. Why is this the case?
Bias and variance generally get discussed in the context of error sources. We can understand, for example, that if we try to model a relationship that has diminishing returns like house value to square footage, that if we use a linear model it will be good in some ranges and poor in others. Low values are well approximated linearly, and if the model generalizes to other values in the “normal range” of house values, the bias of the model will help. Or, if we try to predict values for really large houses, the bias of the model may hurt.
If we try to fit a complex model, like a quintic function, it will introduce unnecessary perturbations in the curve which it fits which result in extreme under- and over- estimations. Adding a few data samples may completely change the shape of the curve. This indicates that the final model will be subject to change to fit data points, and possibly cover a large area of the feature space. It may wildly fluctuate in output given small changes in the training data This is then used as the high variance example.
It’s normal practice to follow this up with a cross-validation example where we can look at training error versus some indication of generalization error (hold-out validation, averaged k-fold cross-validation, whatever) and then show that there’s a sweet spot, maybe quadratic, where we can approximate the diminishing returns shape of the “true relationship” and then we select that model. The victim of our example comes away thinking that bias and variance are just another way of stating the impact of degrees of freedom on training versus generalization error.
But the bias/variance tradeoff is much more central and permeates all aspects of model design, learning, information theoretic formulations and statistical formulations of machine learning, etc. So, here is my attempt to flesh out some more on what this tradeoff implies for machine learning models.
Another example for bias and variance
Bias is when you (think you) know where to look. Say I’m modeling pixels of moderate resolution satellite images as a mixture of their components. I’m assuming it’s a linear combination because I understand it’s somewhat like 1 pixel being made up of materials I could then zoom in on at 25 pixels (5x5), then adding them up and averaging them. It’s a good bias. What does that get me?
The search space I have to search over is dramatically reduced.
The model I selected is likely to generalize.
The features I use are likely to map on to some real world value I can interpret.
Alternatively, I could say that it’s any polynomial fit, or possibly a nearest neighbor model. Or jagged decision tree boundaries: if this much green, it’s half plant. This much more green, and 3⁄4 of the pixel are covered in vegetation, etc. I can throw all of these models together and average them out. What does that get me?
The search space I have to search over is now enormous.
The individual models I train are not likely to generalize.
The features I use in isolation may be nonsensical.
I may fit a lot of incidental noise in the image.
And so on. Though if I use multiple models there’s one possible way out. I can average all these things together and if they’re uncorrelated, they might give me something that will generalize because all their “different opinions” were averaged out. This is not unlike how Eratosthenes estimated the circumference of the world. He assumed the distance from Alexandria and Syene was 5000 stadia, and that the Earth was a perfect sphere. Neither of these assumptions are correct, but their impact on the total measurement is not correlated (which would have magnified the error), and so he came within 66 km of the true value given our current best measure of the Earth’s circumference.
What does a bad bias do? Say I assume the relationship between water and plant growth is linear. Then the more water the more better, right? Well, eventually a terrestial plant will die when it finds itself in a lake. The way a bad bias hurts a model is fairly obvious, so I won’t belabor that point.
A human example
This is about AI after all, and we think of humans as intelligent because we are usually talking about things humans are doing when we use the word “intelligent.” So the words “human” and “intelligent” are near neighbors to each other in that giant language vector space in our heads. If bias and variance are good ways to talk about intelligence, then how can we think about bias and variance in humans?
One super simplistic way is to approach human biology as bias. For example, the types and configuration of cones in my retina are a source of bias. I will only be identifying features in colors I see. Again, a reduction in search space due to bias. I can learn any arbitrary human language I grow up with. If I grow up with English or Swahili or Urdu the sounds I make with my mouth and the meaning I infer from sounds will vary drastically. So, brain plasticity here in terms of language acquisition is a source of variance in human behavior.
Of course it’s more complex. As certain pathways close, I may never really be able to hear or make a sound like a native speaker. So my learning history can become a source of bias later on.
Bias reduces search space
This is very important. Why does it take deep nets millions of examples and humans only a few to learn certain types of images? They’re not as biased as humans are! How could you improve nets in this way? Find a better bias for them. This is an important problem in machine learning - how to use less examples. And here’s step 1 in figuring out how to solve it. First we have to break up the problem into pieces we can understand, and the first step to doing that should be framing it in terms of bias and variance? Here’s a process flow:
Problem — the model doesn’t work with 5000 examples.
Solution part A — it’s not biased enough.
Solution part B — I can bias it with changes to the architecture.
Solution part C — I can bias it with a different learning rate.
Solution done — now it trains
This is super-simplistic and you may not need to say the words bias or variance to think through it, but if you can cast new problems in terms of bias and variance you can start expanding the search space for the solution more effectively. It’s a problem solving technique that’s worth being well practiced in. Also, my discussion of it is purposely nested here. Understanding bias and variance is a way of biasing you to limit your own search space as you solve machine learning problems.
Bias all the way down
When I decide what models to select, I’m biasing my model selection. These models themselves are biased to look for certain patterns or relations in the data. They introduce their own biases. The hyperparameters we set for them introduce a bias. When we look at combining models and want them to not be correlated, one way to achieve this is to select models with a different bias.
Bias underlies deep versus shallow net performance
Why are deep nets more effective than shallow nets? They’re more biased! And it’s a better bias! They assume the data is made up of hierarchies of relations. For this bias, they can exponentially reduce the size of the space they search over, which is good because that space also happens to grow exponentially (the curse of dimensionality).
Bias underlies information transfer
It’s a huge source of bias to think that learning to see fuzzy things helps you predict what things would be like to touch. To think that learning certain visual features would let you caption images. To think that you could take the early convolutional layers from ImageNet and use it to classify houses in satellite images, or different types of fish. But it works - it’s a good bias, and it’s because computer vision is less one damn thing after another and more one damn thing over and over.
Finding the Right Bias(es) is behind the Human/AI Gap
Why are bias and variance so important? They give us a framework for understanding why humans are so far ahead of AI in some measures and not in others. They give us a way to formally account for differences in learning even when performance is good. Here’s an example:
In facial recognition, you commonly have to search over a huge search space of faces. But a baby doesn’t need to see 200 million faces to learn to recognize people. Why? Humans are more biased! We instantly scan several facial keypoints with our fovea and aren’t aware of the saccades that make this possible. The convnet has to search over every possible orientation of different faces and learn which features will matter. A baby already knows. A human is biased by biology, the deep net will have to look at 200 million faces to get close or match its performance.
How do we close the gap for regions we care about? Well it’s a huge space and will take a lot of years of effort, but one way to narrow that search space down is to bias ourselves in terms of bias.
How Humans Got Their Bias
It’s worth noting that the 200 million faces for a neural net versus baby comparison isn’t entirely fair. Why? You can look at two different scales of human process. The brain/cognitive scale, and the evolutionary/genetic scale. How can a baby only see and know how to encode a few faces? Well, how many faces have been seen over humans who passed their genes on for how many years of evolution? Maybe looking at 200 million faces in a month isn’t so bad for a new convolutional neural net.
Biological, evolutionary processes in the wild have a huge space to search over and are a good higher variance model. Humans can learn to do all kinds of things they are biased to do, but they’re biased in a particular way by a process that can produce things that aren’t human at all. Evolution is biased by the structure of DNA, the physical/chemical components of the material that has to be ingested, and the things that emerge from that, like the need for operating on a tight energy budget. But since it has produced a superset of learners when compared with humans, and could possibly produce some other unknown set of learners, and to step back one level, since it can produce learning at all, I think we can safely say it’s comparatively high variance.
I hope this random walk through bias and variance has shown the degree to which bias and variance are about much more than overfitting and error sources. It’s much like transitioning from thinking about derivatves as slopes to first-order variation. Or when you first hear that you can frame a scientific model as a way of compressing reality. To me these kinds of connections are important because they enable a range of creativity in problem solving. If you encounter a new problem and can cast it in terms that let you map it onto several other related domains, you’re much more likely to find a solution.