Regression models, both Single and multivariate, are the backbone of many types of machine learning. Using the structure you specify, these tools create equations that match the modeled data set as closely as possible. Regression algorithms create the optimal equation by minimizing the error between the results predicted by the model and the data provided.
That said, no regression model will ever be perfect (and if your model appears to be near perfect, I recommend checking overfitting). There will always be a difference between the values predicted by a regression model and the actual data. These differences will change drastically as you modify the structure of the model, where residues get in the game.
How do you find the residues?
The residual for a specific data point is the difference between the value predicted by the regression and the observed value for that data point. Calculating the residual provides a valuable clue as to how well your model fits the data set. To calculate the residuals, we need to find the difference between the calculated value for the independent variable and the observed value for the independent variable.
The residual for a specific data point is the difference between the value predicted by the regression and the observed value for that data point. Calculating the residual provides a valuable clue as to how well your model fits the data set. A poorly fitted regression model will produce residuals for some very large data points, indicating that the model is not capturing a trend in the data set. A well-fitting regression model will produce small residuals for all data points.
Let’s talk about how to calculate residuals.
Creating a Sample Dataset
In order to calculate the residuals, we first need a dataset for the example. We can create a fairly trivial dataset using Python‘s pandas NumPy and scikit-learn packages. You can use the following code to create a dataset basically y = x
with a little noise added at each point.
import pandas as pd
import numpy as np
data = pd.DataFrame(index = range(0, 10))
data['Independent'] = data.index
np.random.seed(0)
bias = 0
stdev = 15
data['Noise'] = np.random.normal(bias, stdev, size = len(data.index))
data['Dependent'] = data['Independent'] * data['Noise']/100 + data['Independent']
This code performs the following steps:
- Imports the Pandas and NumPy packages you will need for analysis
- Creates a Pandas dataframe with 10
Independent
variables represented by the range between0
and10
- Calculates a random error percentage (
Noise
) for each point using a normal distribution with a standard deviation of 15% - Calculate some
Dependent
data which is equal to theIndependent
data plus the error caused by theNoise
We can now use this dataframe as a sample dataset.
Implementing a linear model
The Dependent
the variable is our x
data sets and Independent
the variable is our y
. Now we need a model that predicts y
as a function of x
. We can do this using the linear regression model from scikit-learn with the following code.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(np.array(data['Independent']).reshape((-1, 1)), data['Dependent'])
data['Calculated'] = model.predict(np.array(data['Independent']).reshape((-1, 1)))
This code works as follows:
- Import the scikit-learn
LinearRegression
model to use in the analysis - Create an instance of
LinearRegression
which will become our regression model - Adapts to the model using the
Independent
andDependent
variables in our dataset - Adds a new column to our data frame storing the dependent values as predicted by our model (
Calculated
)
If the model matches the data set perfectly, the values of the Calculated
will correspond to the values of the Dependent
column. We can plot the data to see if this is the case or not.
…Nope.
Calculation of model residuals
We could have seen this coming because we used a first-order linear regression model to match a dataset with known noise. In other words, we know that this model would have perfectly suited y = x
but the variation we added in each data point made each y
a little different from the corresponding x
. Instead of perfection, we see gaps between Regression
line and the Data
points. These gaps are called residuals
. See the following graph which highlights the residual for the point to x = 4
.
To calculate the residuals, find the difference between the calculated value of the independent variable and the observed value of the independent variable. In other words, we need to calculate the difference between the Calculated
and Independent
columns in our data frame. We can do this with the following code:
data['Residual'] = data['Calculated'] - data['Dependent']
We can now plot the residuals to see how they vary across the dataset. Here is an example of plotted output:
Notice how some of the residuals are greater than zero and some are less than zero. It will always be the case! Since linear regression reduces the total error between the data and the model to zero, the result must contain errors less than zero to compensate for errors greater than zero.
You can also see how some of the errors are bigger than others. Several of the residuals fall in the range of +0.25 to 0.5, while others have an absolute value in the range of 0.75 to 1. These are the signs you are looking for to be sure that a model fits the data well. If there is a considerable difference, such as a single point or a group of points clustered together with a much larger residual, you know your model has a problem. For example, if the residual at x = 4
has been -5
that would be a clear sign of a problem. Note that a residue so big would probably indicate a outlier in the dataset, and you should consider removing the point using the interquartile range (IQR) methods.
Identify a poor model fit
To highlight the argument that residuals may demonstrate poor model fit, consider a second data set. To create the new dataset, I made two changes. The modified lines of code are as follows:
data = pd.DataFrame(index = range(0, 100))
data['Dependent'] = data['Independent'] * data['Noise']/100 + data['Independent']**2
The first change increased the index length of the data block to 100. This created a data set with 100 points, instead of the previous 10. The second change made the Dependent
variable either a function of the Independent
squared variable, creating a parabolic dataset. Performing the same linear regression as before (not a single letter of code changed) and plotting the data has the following:
Since this is just an example to demonstrate the point, we can already say that the regression does not fit the data well. There is an obvious curve in the data, but the regression is a single straight line. The regression underpredicts at the low and high ends, and overpredicts in the middle. We also know this will be a bad fit because it is a first order linear regression on a parabolic data set.
That said, this visualization effectively shows how looking at residuals can show a model with a bad fit. Consider the following plot, which I generated using the exact same code as the previous residual plot.
Can you see the residual trend? The residuals are very negative when the X Data
is both low and high, indicating that the model is underestimating the data at these points. The residuals are also positive when the X Data
falls around the midpoint, indicating that the model over-predicts data in that range. Clearly, the model has the wrong shape, and since the residual curve shows only one inflection point, we can reasonably guess that we need to increase the order of the model by one (to two).
If we repeat the process using second-order regression, we get the following residual plot.
The only discernible trend here is that the residuals increase as X Data
increase. Since Dependent
data includes noise, which is a function of X Data
, we expect that to happen. What we don’t see are very large residuals or indications of a different shape of the dataset. This means that we now have a model that fits the data set well.
And with that, you are ready to start evaluating the performance of your machine learning models by calculating and plotting residuals!