Principal component regression (PCR) is a technique for analyzing multiple regression data that suffer from multicollinearity. PCR is derived from principal component analysis (PCA). It is therefore a PCA applied to a regression algorithm which has multicollinear characteristics. Principal component regression reduces the errors in the regression estimates by adding a degree of bias and in doing so it will be possible to provide more reliable estimates. In this article, we will focus on the implementation of PCR on a regression problem. Here are the topics covered.
- About PCA Architecture
- About Multicollinearity in Regression
- About Principal Components Regression (PCR)
- Implementing PCR in Python
Let’s start with the PCA architect from which the PCR is derived.
About PCA Architecture
Principal component analysis (PCA) is the analysis of the main characteristics of data. The analysis is performed by reducing the dimensionality of the feature space. In other words, it is a tool to reduce the characteristics of the data to obtain only the main characteristics or components required for the learner. PCA has three main components that help reduce dimensionality:
- the covariance matrix is the measure of how much the variables are associated with each other.
- the eigenvectors are the directors in which the data is dispersed.
- the own values are the relative importance of the directions.
About Multicollinearity in Regression
From the name, it is clear that collinearity between independent variables in a regression problem is defined as multicollinearity in regression. The reasons behind solving a multicollinear regression problem are:
- Understanding the importance of features to the learner in regression.
- Instability in coefficient estimation
- Learner Overlearning
The multicollinearity in the regression could be vandalized using principal component regression (PCR). Let’s see how PCR controls multicollinearity.
About Principal Components Regression (PCR)
Principal component regression (PCR) algorithm is an approach to reduce the multicollinearity of a data set. Although multivariate linear regression can fit the test set well, it normally has a problem of high variance. For this reason, PCR adds a small bias to the model, so it aims to maintain a high level of precision, but significantly reduces variance. This is achieved by applying PCA to features before training. Let’s try to learn PCR by implementing it on data.
Implementing PCR in Python
The objective of this regression model is to predict the player’s salary based on different characteristics.
import numpy as np import pandas as pd
Reading the dataset
salary=pd.read_csv("Hitters.csv") df=salary.copy() df.dropna(inplace=True) df.shape,salary.shape
((263, 20), (322, 20))
This dataset is extracted from the Kaggle repository, to use this dataset the links are given as references. Copy the dataset to another dataframe for further preprocessing so that the original dataframe remains unchanged.
Then remove all missing values and finally check the number of rows and columns of the original and copied data blocks. There are two categorical columns and the others are continuous features, needed to code categorical features for later use.
Coding of categorical characteristics:
df = pd.get_dummies(df,columns=['League', 'Division', 'NewLeague']) df.head()
Encoded the data using pandas get_dummies a function. The original features used to create the mannequins are replaced with a total of six new columns, as shown in the image above. Since the regression problem is to predict the salary of the players, the dependent variable is therefore “Salary”.
There are a total of 23 columns in this data block of which only a few are important for analysis and prediction. There is a requirement for feature removal that would be performed by PCA.
Application of PCA
from sklearn.decomposition import PCA pca=PCA() X_red = pca.fit_transform(scale(X))
PCA was imported from sklearn library and stored in a variable for easier applications. The PCA is equipped with independent variables for dimensionality reduction. The percent variance of the dependent variable is explained by adding each principal component to the model.
np.cumsum(np.round(pca.explained_variance_ratio_, decimals = 4)*100)[0:5]
The above output is explained as follows:
- Using the first principal component, we can explain 32.62% of the variation of the dependent variable.
- Using the second principal component, we can explain 53.02% of the variation of the dependent variable.
- Similarly, using the others, we can explain 68.82%, 78.07%, 85.39%, 89.31%
The conclusion from this is that we need to use a total of five main components in the regression learner. Now the main components are decided, it is to split the data set in a ratio of 70:30 to train and test to train and test the learner.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
Let’s build the final model using five principal components in the linear regression model.
X_red_train = pca.fit_transform(scale(X_train)) X_red_test = pca.transform(scale(X_test))[:,0:5] lm = LinearRegression() pcr = lm.fit(X_red_train[:,0:5], y_train) y_pred = pcr.predict(X_red_test)
The final model is built and trained on the train dataset and also predicts the salary using the test dataset. Let’s check how well the model performed.
The root means the squared error (RMSE) is low at around 398, this can be improved by further tuning the model, I’ll leave that up to you.
Applying principal component analysis before regression can reduce multicollinearity and contribute to better prediction with less functionality in less time. PCR also reduces the risk of learners overlearning.