7.7 KiB
Handling Missing Values in Pandas
In real life, many datasets arrive with missing data either because it exists and was not collected or it never existed.
In Pandas missing data is represented by two values:
None
: None is simply iskeyword
refer as empty or none.NaN
: Acronym forNot a Number
.
There are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame:
isnull()
notnull()
dropna()
fillna()
replace()
2. Checking for missing values using isnull()
and notnull()
Let's import pandas and our fancy car-sales dataset having some missing values.
import pandas as pd
car_sales_missing_df = pd.read_csv("Datasets/car-sales-missing-data.csv")
print(car_sales_missing_df)
Make Colour Odometer Doors Price
0 Toyota White 150043.0 4.0 $4,000
1 Honda Red 87899.0 4.0 $5,000
2 Toyota Blue NaN 3.0 $7,000
3 BMW Black 11179.0 5.0 $22,000
4 Nissan White 213095.0 4.0 $3,500
5 Toyota Green NaN 4.0 $4,500
6 Honda NaN NaN 4.0 $7,500
7 Honda Blue NaN 4.0 NaN
8 Toyota White 60000.0 NaN NaN
9 NaN White 31600.0 4.0 $9,700
## Using isnull()
print(car_sales_missing_df.isnull())
Make Colour Odometer Doors Price
0 False False False False False
1 False False False False False
2 False False True False False
3 False False False False False
4 False False False False False
5 False False True False False
6 False True True False False
7 False False True False True
8 False False False True True
9 True False False False False
Note here:
True
means forNaN
valuesFalse
means for noNan
values
If we want to find the number of missing values in each column use isnull().sum()
.
print(car_sales_missing_df.isnull().sum())
Make 1
Colour 1
Odometer 4
Doors 1
Price 2
dtype: int64
You can also check presense of null values in a single column.
print(car_sales_missing_df["Odometer"].isnull())
0 False
1 False
2 True
3 False
4 False
5 True
6 True
7 True
8 False
9 False
Name: Odometer, dtype: bool
## using notnull()
print(car_sales_missing_df.notnull())
Make Colour Odometer Doors Price
0 True True True True True
1 True True True True True
2 True True False True True
3 True True True True True
4 True True True True True
5 True True False True True
6 True False False True True
7 True True False True False
8 True True True False False
9 False True True True True
Note here:
True
means noNaN
valuesFalse
means forNaN
values
isnull()
means having null values so it gives boolean True
for NaN values. And notnull()
means having no null values so it gives True
for no NaN value.
2. Filling missing values using fillna()
, replace()
.
## Filling missing values with a single value using `fillna`
print(car_sales_missing_df.fillna(0))
Make Colour Odometer Doors Price
0 Toyota White 150043.0 4.0 $4,000
1 Honda Red 87899.0 4.0 $5,000
2 Toyota Blue 0.0 3.0 $7,000
3 BMW Black 11179.0 5.0 $22,000
4 Nissan White 213095.0 4.0 $3,500
5 Toyota Green 0.0 4.0 $4,500
6 Honda 0 0.0 4.0 $7,500
7 Honda Blue 0.0 4.0 0
8 Toyota White 60000.0 0.0 0
9 0 White 31600.0 4.0 $9,700
## Filling missing values with the previous value using `ffill()`
print(car_sales_missing_df.ffill())
Make Colour Odometer Doors Price
0 Toyota White 150043.0 4.0 $4,000
1 Honda Red 87899.0 4.0 $5,000
2 Toyota Blue 87899.0 3.0 $7,000
3 BMW Black 11179.0 5.0 $22,000
4 Nissan White 213095.0 4.0 $3,500
5 Toyota Green 213095.0 4.0 $4,500
6 Honda Green 213095.0 4.0 $7,500
7 Honda Blue 213095.0 4.0 $7,500
8 Toyota White 60000.0 4.0 $7,500
9 Toyota White 31600.0 4.0 $9,700
## illing null value with the next ones using 'bfill()'
print(car_sales_missing_df.bfill())
Make Colour Odometer Doors Price
0 Toyota White 150043.0 4.0 $4,000
1 Honda Red 87899.0 4.0 $5,000
2 Toyota Blue 11179.0 3.0 $7,000
3 BMW Black 11179.0 5.0 $22,000
4 Nissan White 213095.0 4.0 $3,500
5 Toyota Green 60000.0 4.0 $4,500
6 Honda Blue 60000.0 4.0 $7,500
7 Honda Blue 60000.0 4.0 $9,700
8 Toyota White 60000.0 4.0 $9,700
9 NaN White 31600.0 4.0 $9,700
Filling a null values using replace()
method
Now we are going to replace the all NaN
value in the data frame with -125 value
For this we will also need numpy
import numpy as np
print(car_sales_missing_df.replace(to_replace = np.nan, value = -125))
Make Colour Odometer Doors Price
0 Toyota White 150043.0 4.0 $4,000
1 Honda Red 87899.0 4.0 $5,000
2 Toyota Blue -125.0 3.0 $7,000
3 BMW Black 11179.0 5.0 $22,000
4 Nissan White 213095.0 4.0 $3,500
5 Toyota Green -125.0 4.0 $4,500
6 Honda -125 -125.0 4.0 $7,500
7 Honda Blue -125.0 4.0 -125
8 Toyota White 60000.0 -125.0 -125
9 -125 White 31600.0 4.0 $9,700
3. Dropping missing values using dropna()
In order to drop a null values from a dataframe, we used dropna()
function this function drop Rows/Columns of datasets with Null values in different ways.
Dropping rows with at least 1 null value.
print(car_sales_missing_df.dropna(axis = 0)) ##Now we drop rows with at least one Nan value (Null value)
Make Colour Odometer Doors Price
0 Toyota White 150043.0 4.0 $4,000
1 Honda Red 87899.0 4.0 $5,000
3 BMW Black 11179.0 5.0 $22,000
4 Nissan White 213095.0 4.0 $3,500
Dropping rows if all values in that row are missing.
print(car_sales_missing_df.dropna(how = 'all',axis = 0)) ## If not have leave the row as it is
Make Colour Odometer Doors Price
0 Toyota White 150043.0 4.0 $4,000
1 Honda Red 87899.0 4.0 $5,000
2 Toyota Blue NaN 3.0 $7,000
3 BMW Black 11179.0 5.0 $22,000
4 Nissan White 213095.0 4.0 $3,500
5 Toyota Green NaN 4.0 $4,500
6 Honda NaN NaN 4.0 $7,500
7 Honda Blue NaN 4.0 NaN
8 Toyota White 60000.0 NaN NaN
9 NaN White 31600.0 4.0 $9,700
Dropping columns with at least 1 null value
print(car_sales_missing_df.dropna(axis = 1))
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Now we drop a columns which have at least 1 missing values.
Here the dataset becomes empty after dropna()
because each column as atleast 1 null value so it remove that columns resulting in an empty dataframe.