How to preprocess data in Python

0

In this article, we will prepare a machine learning model to predict who survived the Titanic. To do this, we first need to clean our data. I will show you how to apply pre-processing techniques on the Titanic Dataset.

To start, you will need:

  • Python
  • Numpy
  • Pandas
  • The Titanic database

What is data preprocessing and why do we need it?

For machine learning algorithms to work, it is necessary to convert raw data in to clean Data set, which means we need to convert the dataset to digital data. To do this, we encode all categorical labels to binary-valued column vectors. Missing values, or NaNs (not a number) in the dataset is an annoying problem. You must either remove the missing rows or fill them in with an average or interpolated values.

Note: Kaggle provides two sets of data: training data and results data. Both data sets must have the same dimensions for the model to produce accurate results.

How to Preprocess Data in Python Step by Step

  1. Load data into Pandas.
  2. Delete columns that are not useful.
  3. Remove rows with missing values.
  4. Create dummy variables.
  5. Beware of missing data.
  6. Convert the data block to NumPy.
  7. Divide the dataset into training data and test data.

1. Load data into pandas

To work on the data, you can either load the CSV in Excel or in Pandas. For the purposes of this tutorial, we will load CSV data into Pandas.

df = pd.read_csv('train.csv')

Let’s take a look at the data format below:

>>> df.info()

Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object

If you look closely at the above summary of Pandas, there are 891 rows in total but Age only displays 714 (meaning we are missing some data), Embarked two lines are missing and Cabin also miss a lot. Object data types are not numeric, so we need to find a way to encode them into numeric values.

More Afroz ChakureImplementing Random Forest Regression in Python: An Introduction

2. Delete columns that are not useful

Let’s try removing some of the columns that won’t contribute much to our machine learning model. We will start with Name, Ticket and Cabin.

cols = ['Name', 'Ticket', 'Cabin']
df = df.drop(cols, axis=1)

We have removed three columns:

>>>df.info()
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 889 non-null object

3. Delete rows with missing values

Then we can remove all rows from the data that have missing values ​​(NaN). Here’s how:

>>> df = df.dropna()

>>> df.info()
Int64Index: 712 entries, 0 to 890
Data columns (total 9 columns):
PassengerId 712 non-null int64
Survived 712 non-null int64
Pclass 712 non-null int64
Sex 712 non-null object
Age 712 non-null float64
SibSp 712 non-null int64
Parch 712 non-null int64
Fare 712 non-null float64
Embarked 712 non-null object

The problem with deleting rows

After removing the rows with missing values, we see that the dataset is reduced to 712 rows from 891, which means we are data waste. Machine learning models need data to train and perform well. So, let’s preserve the data and use it as much as possible. More on that below.

Built-in expert contributors can helpHow to Find Residuals in Regression Analysis

4. Creating dummy variables

Instead of wasting our data, let’s convert the Pclass, Sex and Embarked to columns in Pandas and drop them after conversion.

dummies = []
cols = ['Pclass', 'Sex', 'Embarked']
for col in cols:
   dummies.append(pd.get_dummies(df[col]))

So…

titanic_dummies = pd.concat(dummies, axis=1)

Now we have transformed eight columns in which 1, 2 and 3 represent the passenger class.

Finally U.S chain to the original data frame, per column:

df = pd.concat((df,titanic_dummies), axis=1)

Now that we have converted Pclass, Sexand Embarked values ​​to columns, we remove redundant columns from the data frame.

df = df.drop(['Pclass', 'Sex', 'Embarked'], axis=1)

Let’s take a look at the new data block:

>>>df.info()
PassengerId 891 non-null int64
Survived 891 non-null int64
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
1 891 non-null float64
2 891 non-null float64
3 891 non-null float64
female 891 non-null float64
male 891 non-null float64
C 891 non-null float64
Q 891 non-null float64
S 891 non-null float64

Data preprocessing using Pandas and Matplotlib

Learn more about dummy variablesBeware of dummy variable trap in pandas

5. Take care of missing data

Everything is clean now except Age, which has a lot of missing values. Calculate a median or interpolate() all ages and fill in missing age values. Pandas have a interpolate() function that will replace all missing NaNs with interpolated values.

df['Age'] = df['Age'].interpolate()

Now let’s look at the data columns. To remark Ageis now interpolated with the new imputed values.

>>>df.info()
Data columns (total 14 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
1 891 non-null float64
2 891 non-null float64
3 891 non-null float64
female 891 non-null float64
male 891 non-null float64
C 891 non-null float64
Q 891 non-null float64

More Python TutorialsHow to find outliers with IQR using Python

6. Convert data frame to NumPy

Now that we’ve converted all the data to integers, it’s time to prepare the data for the machine learning models. This is where scikit-learn and NumPy come in:

X= input set with 14 attributes

y = Small y output, in this case Survived

Now we convert our dataframe from Pandas to NumPy and assign the input and output:

X = df.values
y = df['Survived'].values

X has always Survived values ​​it contains, which should not be there. So we drop the NumPy column, which is the first column.

7. Divide the dataset into training data and test data

Now that we are ready with X and ylet’s split the dataset: we’ll allocate 70% to training and 30% to testing using scikit model_selection.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

And that’s it, friends. You can now preprocess the data yourself. Go ahead and try it yourself to start building your own models and making predictions.

Share.

About Author

Comments are closed.