learn-python/contrib/pandas/handling-missing-values.md

7.7 KiB

Handling Missing Values in Pandas

In real life, many datasets arrive with missing data either because it exists and was not collected or it never existed.

In Pandas missing data is represented by two values:

  • None : None is simply is keyword refer as empty or none.
  • NaN : Acronym for Not a Number.

There are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame:

  1. isnull()
  2. notnull()
  3. dropna()
  4. fillna()
  5. replace()

2. Checking for missing values using isnull() and notnull()

Let's import pandas and our fancy car-sales dataset having some missing values.

import pandas as pd

car_sales_missing_df = pd.read_csv("Datasets/car-sales-missing-data.csv")
print(car_sales_missing_df)
     Make Colour  Odometer  Doors    Price
0  Toyota  White  150043.0    4.0   $4,000
1   Honda    Red   87899.0    4.0   $5,000
2  Toyota   Blue       NaN    3.0   $7,000
3     BMW  Black   11179.0    5.0  $22,000
4  Nissan  White  213095.0    4.0   $3,500
5  Toyota  Green       NaN    4.0   $4,500
6   Honda    NaN       NaN    4.0   $7,500
7   Honda   Blue       NaN    4.0      NaN
8  Toyota  White   60000.0    NaN      NaN
9     NaN  White   31600.0    4.0   $9,700
## Using isnull()

print(car_sales_missing_df.isnull())
    Make  Colour  Odometer  Doors  Price
0  False   False     False  False  False
1  False   False     False  False  False
2  False   False      True  False  False
3  False   False     False  False  False
4  False   False     False  False  False
5  False   False      True  False  False
6  False    True      True  False  False
7  False   False      True  False   True
8  False   False     False   True   True
9   True   False     False  False  False

Note here:

  • True means for NaN values
  • False means for no Nan values

If we want to find the number of missing values in each column use isnull().sum().

print(car_sales_missing_df.isnull().sum())
Make        1
Colour      1
Odometer    4
Doors       1
Price       2
dtype: int64

You can also check presense of null values in a single column.

print(car_sales_missing_df["Odometer"].isnull())
0    False
1    False
2     True
3    False
4    False
5     True
6     True
7     True
8    False
9    False
Name: Odometer, dtype: bool
## using notnull()

print(car_sales_missing_df.notnull())
    Make  Colour  Odometer  Doors  Price
0   True    True      True   True   True
1   True    True      True   True   True
2   True    True     False   True   True
3   True    True      True   True   True
4   True    True      True   True   True
5   True    True     False   True   True
6   True   False     False   True   True
7   True    True     False   True  False
8   True    True      True  False  False
9  False    True      True   True   True

Note here:

  • True means no NaN values
  • False means for NaN values

isnull() means having null values so it gives boolean True for NaN values. And notnull() means having no null values so it gives True for no NaN value.

2. Filling missing values using fillna(), replace().

## Filling missing values  with a single value using `fillna`
print(car_sales_missing_df.fillna(0))
     Make Colour  Odometer  Doors    Price
0  Toyota  White  150043.0    4.0   $4,000
1   Honda    Red   87899.0    4.0   $5,000
2  Toyota   Blue       0.0    3.0   $7,000
3     BMW  Black   11179.0    5.0  $22,000
4  Nissan  White  213095.0    4.0   $3,500
5  Toyota  Green       0.0    4.0   $4,500
6   Honda      0       0.0    4.0   $7,500
7   Honda   Blue       0.0    4.0        0
8  Toyota  White   60000.0    0.0        0
9       0  White   31600.0    4.0   $9,700
## Filling missing values with the previous value using `ffill()`
print(car_sales_missing_df.ffill())
     Make Colour  Odometer  Doors    Price
0  Toyota  White  150043.0    4.0   $4,000
1   Honda    Red   87899.0    4.0   $5,000
2  Toyota   Blue   87899.0    3.0   $7,000
3     BMW  Black   11179.0    5.0  $22,000
4  Nissan  White  213095.0    4.0   $3,500
5  Toyota  Green  213095.0    4.0   $4,500
6   Honda  Green  213095.0    4.0   $7,500
7   Honda   Blue  213095.0    4.0   $7,500
8  Toyota  White   60000.0    4.0   $7,500
9  Toyota  White   31600.0    4.0   $9,700
## illing null value with the next ones  using 'bfill()'
print(car_sales_missing_df.bfill())
     Make Colour  Odometer  Doors    Price
0  Toyota  White  150043.0    4.0   $4,000
1   Honda    Red   87899.0    4.0   $5,000
2  Toyota   Blue   11179.0    3.0   $7,000
3     BMW  Black   11179.0    5.0  $22,000
4  Nissan  White  213095.0    4.0   $3,500
5  Toyota  Green   60000.0    4.0   $4,500
6   Honda   Blue   60000.0    4.0   $7,500
7   Honda   Blue   60000.0    4.0   $9,700
8  Toyota  White   60000.0    4.0   $9,700
9     NaN  White   31600.0    4.0   $9,700

Filling a null values using replace() method

Now we are going to replace the all NaN value in the data frame with -125 value

For this we will also need numpy

import numpy as np

print(car_sales_missing_df.replace(to_replace = np.nan, value = -125))
     Make Colour  Odometer  Doors    Price
0  Toyota  White  150043.0    4.0   $4,000
1   Honda    Red   87899.0    4.0   $5,000
2  Toyota   Blue    -125.0    3.0   $7,000
3     BMW  Black   11179.0    5.0  $22,000
4  Nissan  White  213095.0    4.0   $3,500
5  Toyota  Green    -125.0    4.0   $4,500
6   Honda   -125    -125.0    4.0   $7,500
7   Honda   Blue    -125.0    4.0     -125
8  Toyota  White   60000.0 -125.0     -125
9    -125  White   31600.0    4.0   $9,700

3. Dropping missing values using dropna()

In order to drop a null values from a dataframe, we used dropna() function this function drop Rows/Columns of datasets with Null values in different ways.

Dropping rows with at least 1 null value.

print(car_sales_missing_df.dropna(axis = 0))  ##Now we drop rows with at least one Nan value (Null value) 
     Make Colour  Odometer  Doors    Price
0  Toyota  White  150043.0    4.0   $4,000
1   Honda    Red   87899.0    4.0   $5,000
3     BMW  Black   11179.0    5.0  $22,000
4  Nissan  White  213095.0    4.0   $3,500

Dropping rows if all values in that row are missing.

print(car_sales_missing_df.dropna(how = 'all',axis = 0))  ## If not have leave the row as it is
     Make Colour  Odometer  Doors    Price
0  Toyota  White  150043.0    4.0   $4,000
1   Honda    Red   87899.0    4.0   $5,000
2  Toyota   Blue       NaN    3.0   $7,000
3     BMW  Black   11179.0    5.0  $22,000
4  Nissan  White  213095.0    4.0   $3,500
5  Toyota  Green       NaN    4.0   $4,500
6   Honda    NaN       NaN    4.0   $7,500
7   Honda   Blue       NaN    4.0      NaN
8  Toyota  White   60000.0    NaN      NaN
9     NaN  White   31600.0    4.0   $9,700

Dropping columns with at least 1 null value

print(car_sales_missing_df.dropna(axis = 1))
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Now we drop a columns which have at least 1 missing values.

Here the dataset becomes empty after dropna() because each column as atleast 1 null value so it remove that columns resulting in an empty dataframe.