“Bias-Variance trade-off” is one of the fundamental concepts in Machine Learning studies, which means that there is a trade-off relation between two errors(or losses), bias and variance when evaluating the generalization capacity of ML algorithms.

To put it more simply, it can be seen as discussing the problem of finding the optimal point between over-fitting and under-fitting.

First, we are going to define what bias and variance are, then see how these two are related.

**Bias, Variance and Noise**

The following derivation is brought from lecture 12 of CS4780 class in the fall semester 2018, from Cornell University.

Let assume that we try to conduct a regression task with the train data $D=\set{(x_1, y_1),(x_2,y_2),…,(x_n,y_n)}$.

We have a model trained using $D$, called $h_D$.

Now let’s define some numerical expressions as follows.

- $h_D(x)$: This indicates the model’s outcome when $x$ is fed into the model. These values follow a normal distribution that has $\bar{h(x)}$ as the expected value.
- $y$: This is the real label. Remember that in the real cases, label values are not fixed but follow a certain normal distribution.
- $\bar{h}(x)$: This is the expected value from the model. We can assume this as the expected value of actual outputs from the model.
- $\bar{y}$: This is the expected label value. We usually try to make the ML model predict $\bar{y}$ when $x$ is provided.

Before going into the derivation, we should notice that $D$ is the training data and $(x,y)$ is a pair from the test data.

We usually use MSE(Mean Square Error) for regression tasks, so let us assume the expected test error like below.

$\text{Error}$

$=E_{x,y,D}[(h_D(x)-y)^2]$

$=E_{x,y,D}[[(h_D(x)-\bar{h}(x))+(\bar{h}(x)-y)]^2]$

$=E_{x,D}[(h_D(x)-\bar{h}(x))^2]+E_{x,y}[(\bar{h}(x)-y)^2]+2E_{x,y,D}[(h_D(x)-\bar{h}(x))(\bar{h}(x)-y)]$

Since $D$ is independent with the test pairs, we can make the third term into $0$.

$E_{x,y,D}[(h_D(x)-\bar{h(}x))(\bar{h}(x)-y)]$

$=E_{x,y}[E_D[h_D(x)-\bar{h}(x)]\left(\bar{h}(x)-y\right)]$

$=E_{x,y}[(E_D[h_D(x)]-\bar{h}(x))(\bar{h}(x)-y)]$

$=E_{x,y}[(\bar{h}(x)-\bar{h}(x))(\bar{h}(x)-y)]$

$=E_{x,y}[0]$

$=0$

Next, we change the second term like below.

$E_{x,y}[(\bar{h}(x)-y)^2]$

$=E_{x,y}[[(\bar{h}(x)-\bar{y}(x))+(\bar{y}(x)-y)]^2]$

$=E_{x,y}[(\bar{y}(x)-y)^2]+E_x[(\bar{h}(x)-\bar{y}(x))^2]+2E_{x,y}[(\bar{h}(x)-\bar{y}(x))(\bar{y}(x)-y)]$

The third term of the above equation becomes $0$ by the following derivation.

$E_{x,y}[(\bar{h}(x)-\bar{y}(x))(\bar{y}(x)-y)]$

$=E_x[E_{y \mid x}[\bar{y}(x)-y]\left(\bar{h}(x)-\bar{y}(x)\right)]$

$=E_x[(\bar{y}(x)-E_{y \mid x}[y])(\bar{h}(x)-\bar{y}(x))]$

$=E_x[(\bar{y}(x)-\bar{y}(x))(\bar{h}(x)-\bar{y}(x))]$

$=E_x[0]$

$=0$

Therefore, the final form of the test error is as follows.

$\text{Error}=E_{x,D}[(h_D(x)-\bar{h}(x))^2]+E_{x,y}[(\bar{y}(x)-y)^2]+E_x[(\bar{h}(x)-\bar{y}(x))^2]$

To sum up, the first term is **variance**, the second one is **noise** and the third one is **bias**(technically, the square of bias).

Let’s check each of them.

- Variance: Literally, this is “variance” of the distribution from the trained model. As we can see from the expression, this indicates how the values are scattered from the mean.
- Bias: This comes from the difference between the expected results from the model and the actual values we want to predict. If that difference is large, we call that the model is “biased”. That is, bias is the inherent error of the model.
- Noise: This is the error from the data itself. If real values are dispersed from the expected value, then we say that data has “noise”. In other words, noise is the inherent error of data.

In conclusion, the test error is the combination of these three errors.

**Bias-Variance trade off**

Now, let’s talk about the trade-off relation.

What’s obvious is that noise cannot be reduced, since it already exists in the data and there is no way to handle it.

But what about bias and variance?

First, bias indicates how the model is “underfit”.

A strongly biased model cannot be easily tuned to follow the actual distributions of real values because of this inherent error inside the model from the beginning.

The way to reduce this bias is similar to that to overcome under-fitting, which is to train the model sufficiently.

Second, variance notifies how the model is “overfit”.

Over-fitting makes the model too sensitive to the training data, which leads to very distant results when matched with the actual test data.

In other words, the model captures too many features from data and it makes decisions strictly trying to fit each data point too much.

So each result can be placed too far from the generalized expected value.

As we can see, the relation between bias and variance can be seen as that between over-fitting & under-fitting.

Then we are also able to notice why these two are in a trade-off.

If the test error is fixed and we cannot do anything about noise, trying to lower one results in making the other higher.

This also means that if we try to handle either over-fitting or under-fitting, we happen to the other more likely.

A visual representation of it is as follows.

The model complexity is related to how the model is overfit(or underfit).

If the model is too complicated, it is trained to fit too many features and this leads to over-fitting.

On the other hand, low complexity makes the model insensitive to data and keeps it from sufficiently trained, which is under-fitting.

Bias-Variance trade-off concludes that we have to find the optimal point that minimizes the total test error and keeps both bias and variance as low as possible.

In conclusion, Bias-Variance trade-off is a little bit more mathematical and precise definition of over-fitting, under-fitting and the relation between them.

It is a simple, but very important concept worth knowing for all machine learning tasks.

I have also known this only roughly, but it was able to understand it more deeply this time.