How to Find Residuals in Regression Analysis

0

Regression models, both Single and multivariate, are the backbone of many types of machine learning. Using the structure you specify, these tools create equations that match the modeled data set as closely as possible. Regression algorithms create the optimal equation by minimizing the error between the results predicted by the model and the data provided.

That said, no regression model will ever be perfect (and if your model appears to be near perfect, I recommend checking overfitting). There will always be a difference between the values ​​predicted by a regression model and the actual data. These differences will change drastically as you modify the structure of the model, where residues get in the game.

How do you find the residues?

The residual for a specific data point is the difference between the value predicted by the regression and the observed value for that data point. Calculating the residual provides a valuable clue as to how well your model fits the data set. To calculate the residuals, we need to find the difference between the calculated value for the independent variable and the observed value for the independent variable.

The residual for a specific data point is the difference between the value predicted by the regression and the observed value for that data point. Calculating the residual provides a valuable clue as to how well your model fits the data set. A poorly fitted regression model will produce residuals for some very large data points, indicating that the model is not capturing a trend in the data set. A well-fitting regression model will produce small residuals for all data points.

Let’s talk about how to calculate residuals.

Creating a Sample Dataset

In order to calculate the residuals, we first need a dataset for the example. We can create a fairly trivial dataset using Python‘s pandas NumPy and scikit-learn packages. You can use the following code to create a dataset basically y = x with a little noise added at each point.

import pandas as pd
import numpy as np

data = pd.DataFrame(index = range(0, 10))
data['Independent'] = data.index

np.random.seed(0)
bias = 0
stdev = 15

data['Noise'] = np.random.normal(bias, stdev, size = len(data.index))
data['Dependent'] = data['Independent'] * data['Noise']/100 + data['Independent']

This code performs the following steps:

  • Imports the Pandas and NumPy packages you will need for analysis
  • Creates a Pandas dataframe with 10 Independent variables represented by the range between 0 and 10
  • Calculates a random error percentage (Noise) for each point using a normal distribution with a standard deviation of 15%
  • Calculate some Dependent data which is equal to the Independent data plus the error caused by the Noise

We can now use this dataframe as a sample dataset.

Want more data science tutorials? We have you.How to use Float in Python (with sample code!)

Implementing a linear model

The Dependent the variable is our x data sets and Independent the variable is our y. Now we need a model that predicts y as a function of x. We can do this using the linear regression model from scikit-learn with the following code.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(np.array(data['Independent']).reshape((-1, 1)), data['Dependent'])
data['Calculated'] = model.predict(np.array(data['Independent']).reshape((-1, 1)))

This code works as follows:

  • Import the scikit-learn LinearRegression model to use in the analysis
  • Create an instance of LinearRegression which will become our regression model
  • Adapts to the model using the Independent and Dependent variables in our dataset
  • Adds a new column to our data frame storing the dependent values ​​as predicted by our model (Calculated)

If the model matches the data set perfectly, the values ​​of the Calculated will correspond to the values ​​of the Dependent column. We can plot the data to see if this is the case or not.

…Nope.

How to find residue

Calculation of model residuals

We could have seen this coming because we used a first-order linear regression model to match a dataset with known noise. In other words, we know that this model would have perfectly suited y = xbut the variation we added in each data point made each y a little different from the corresponding x. Instead of perfection, we see gaps between Regression line and the Data points. These gaps are called residuals. See the following graph which highlights the residual for the point to x = 4.

how-to-find-residues

To calculate the residuals, find the difference between the calculated value of the independent variable and the observed value of the independent variable. In other words, we need to calculate the difference between the Calculated and Independent columns in our data frame. We can do this with the following code:

data['Residual'] = data['Calculated'] - data['Dependent']

We can now plot the residuals to see how they vary across the dataset. Here is an example of plotted output:

how-to-find-residues

Notice how some of the residuals are greater than zero and some are less than zero. It will always be the case! Since linear regression reduces the total error between the data and the model to zero, the result must contain errors less than zero to compensate for errors greater than zero.

You can also see how some of the errors are bigger than others. Several of the residuals fall in the range of +0.25 to 0.5, while others have an absolute value in the range of 0.75 to 1. These are the signs you are looking for to be sure that a model fits the data well. If there is a considerable difference, such as a single point or a group of points clustered together with a much larger residual, you know your model has a problem. For example, if the residual at x = 4 has been -5 that would be a clear sign of a problem. Note that a residue so big would probably indicate a outlier in the dataset, and you should consider removing the point using the interquartile range (IQR) methods.

Want to know more about the IQR? Come straight.How to find outliers with IQR using Python

Identify a poor model fit

To highlight the argument that residuals may demonstrate poor model fit, consider a second data set. To create the new dataset, I made two changes. The modified lines of code are as follows:

data = pd.DataFrame(index = range(0, 100))

data['Dependent'] = data['Independent'] * data['Noise']/100 + data['Independent']**2

The first change increased the index length of the data block to 100. This created a data set with 100 points, instead of the previous 10. The second change made the Dependent variable either a function of the Independent squared variable, creating a parabolic dataset. Performing the same linear regression as before (not a single letter of code changed) and plotting the data has the following:

how-to-find-residues

Since this is just an example to demonstrate the point, we can already say that the regression does not fit the data well. There is an obvious curve in the data, but the regression is a single straight line. The regression underpredicts at the low and high ends, and overpredicts in the middle. We also know this will be a bad fit because it is a first order linear regression on a parabolic data set.

That said, this visualization effectively shows how looking at residuals can show a model with a bad fit. Consider the following plot, which I generated using the exact same code as the previous residual plot.

how-to-find-residues

Can you see the residual trend? The residuals are very negative when the X Data is both low and high, indicating that the model is underestimating the data at these points. The residuals are also positive when the X Data falls around the midpoint, indicating that the model over-predicts data in that range. Clearly, the model has the wrong shape, and since the residual curve shows only one inflection point, we can reasonably guess that we need to increase the order of the model by one (to two).

If we repeat the process using second-order regression, we get the following residual plot.

how-to-find-the-residues

The only discernible trend here is that the residuals increase as X Data increase. Since Dependent data includes noise, which is a function of X Data, we expect that to happen. What we don’t see are very large residuals or indications of a different shape of the dataset. This means that we now have a model that fits the data set well.

And with that, you are ready to start evaluating the performance of your machine learning models by calculating and plotting residuals!

Share.

About Author

Comments are closed.