How to fill in missing data using Python pandas


Data cleaning undoubtedly takes a ton of data science time, and missing data is one of the challenges you will often face. pandas is a valuable Python data manipulation tool that helps you fix missing values ​​in your data set, among other things.

You can correct missing data by deleting it or filling it with other values. In this article, we will explain and explore the different ways to fill missing data using pandas.

1. Use the fillna() method:

the fillna() The function loops through your dataset and fills all null rows with a specified value. It accepts some optional arguments – note the following:

Value: This is the value you want to insert in the missing rows.

Method: Allows you to fill missing values ​​forwards or backwards. He accepts a ‘bfill’ Where ‘to fill’ setting.

In place: This accepts a conditional statement. If True, it modifies the DataFrame permanently. Otherwise, it is not.

Before you start, make sure to install pandas in your Python virtual environment using seed in your terminal:

pip install pandas

Next, in the Python script, we’ll create a training DataFrame and insert null values ​​(Nope) in a few lines:

import pandas
df = pandas.DataFrame({'A' :[0, 3, None, 10, 3, None],
'B' : [None, None, 7.13, 13.82, 7, 7],
'C' : [None, "Pandas", None, "Pandas", "Python", "JavaScript"]})


Related:How to Import Excel Data into Python Scripts Using Pandas

Now find out how you can fill in those missing values ​​using the various methods available in pandas.

Fill missing values ​​with mean, median or mode

This method consists of replacing missing values ​​with calculated means. Fill missing data with mean or median value is applicable when the columns involved have integer or float data types.

You can also fill in the missing data with the mode value, which is the most frequent value. This also applies to integers or floats. But it’s more convenient when the columns in question contain strings.

Here’s how to insert the mean and median into the missing rows of the DataFrame you created earlier:

#To insert the mean value of each column into its missing rows: 
df.fillna(df.mean().round(1), inplace=True)
#For median:
df.fillna(df.median().round(1), inplace=True)

Inserting the modal value as you did for the mean and median above does not capture the entire DataFrame. But you can insert it in a specific column instead, for example, column VS:

df['C'].fillna(df['C'].mode()[0], inplace=True)

That said, it’s still possible to insert each column’s modal value on its missing rows at once using a for loop:

for i in df.columns:
df[i].fillna(df[i].mode()[0], inplace=True)

If you want to be column-specific when inserting the mean, median, or mode:

"B": df['B'].median(),
"C": df['C'].mode()[0]},

Fill null rows with values ​​using ffill

This involves specifying the fill method inside as fillna() a function. This method fills each missing row with the value of the nearest above.

You can also call it forward fill:

df.fillna(method='ffill', inplace=True)

Fill missing rows with values ​​using bfill

Here you will replace the to fill method mentioned above with to fill. It fills each missing row in the DataFrame with the nearest value below it.

This is called backfilling:

df.fillna(method='bfill', inplace=True)

2. The replace() method

You can replace the Nope values ​​in a specific column with mean, median, mode or any other value.

Related:pandas commands to manipulate DataFrames

See how it works by replacing null rows in a column named by its mean, median, or mode:

import pandas
import numpy #this requires that you've previously installed numpy
#Replace the null values with the mean:
df['A'].replace([numpy.nan], df[A].mean(), inplace=True)
#Replace column A with the median:
df['B'].replace([numpy.nan], df[B].median(), inplace=True)
#Use the modal value for column C:
df['C'].replace([numpy.nan], df['C'].mode()[0], inplace=True)

3. Fill missing data with interpolate()

the interpolate() The function uses existing values ​​in the DataFrame to estimate missing rows.

Run the following code to see how it works:

#Interpolate backwardly across the column:
df.interpolate(method ='linear', limit_direction ='backward', inplace=True)
#Interpolate in forward order across the column:
df.interpolate(method ='linear', limit_direction ='forward', inplace=True)

Treat missing lines carefully

Although we only considered filling in missing data with default values ​​such as means, mode, and other methods, other techniques exist to fix missing values. Data scientists, for example, sometimes remove these missing rows, as appropriate.

Moreover, it is essential to have a critical reflection on your strategy before using it. Otherwise, you may get unwanted analysis or prediction results. Some initial data visualization strategies may be helpful.

graphic image
How to draw graphics in Jupyter Notebook

Display your data with Jupyter Notebook charts.

Read more

About the Author


About Author

Comments are closed.