# Handling Missing Values in Pandas **Upuntil now we're working on complete data i.e not having any missing values. But in real life it is the one of the main problem.** *Many datasets arrive with missing data either because it exists and was not collected or it never existed.* In Pandas missing data is represented by two values: * `None` : None is simply is `keyword` refer as empty or none. * `NaN` : Acronym for `Not a Number`. **There are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame :** 1. isnull() 2. notnull() 3. dropna() 4. fillna() 5. replace() ## 2. Checking for missing values using `isnull()` and `notnull()` Let's import pandas and our fancy car-sales dataset having some missing values. ```python import pandas as pd ``` ```python car_sales_missing_df = pd.read_csv("https://raw.githubusercontent.com/kRiShNa-429407/learn-python/main/contrib/pandas/Datasets/car-sales-missing-data.csv") print(car_sales_missing_df) ``` Make Colour Odometer Doors Price 0 Toyota White 150043.0 4.0 $4,000 1 Honda Red 87899.0 4.0 $5,000 2 Toyota Blue NaN 3.0 $7,000 3 BMW Black 11179.0 5.0 $22,000 4 Nissan White 213095.0 4.0 $3,500 5 Toyota Green NaN 4.0 $4,500 6 Honda NaN NaN 4.0 $7,500 7 Honda Blue NaN 4.0 NaN 8 Toyota White 60000.0 NaN NaN 9 NaN White 31600.0 4.0 $9,700 ```python ## Using isnull() print(car_sales_missing_df.isnull()) ``` Make Colour Odometer Doors Price 0 False False False False False 1 False False False False False 2 False False True False False 3 False False False False False 4 False False False False False 5 False False True False False 6 False True True False False 7 False False True False True 8 False False False True True 9 True False False False False Note here: * `True` means for `NaN` values * `False` means for no `Nan` values If we want to find the number of missing values in each column use `isnull().sum()`. ```python print(car_sales_missing_df.isnull().sum()) ``` Make 1 Colour 1 Odometer 4 Doors 1 Price 2 dtype: int64 You can also check presense of null values in a single column. ```python print(car_sales_missing_df["Odometer"].isnull()) ``` 0 False 1 False 2 True 3 False 4 False 5 True 6 True 7 True 8 False 9 False Name: Odometer, dtype: bool ```python ## using notnull() print(car_sales_missing_df.notnull()) ``` Make Colour Odometer Doors Price 0 True True True True True 1 True True True True True 2 True True False True True 3 True True True True True 4 True True True True True 5 True True False True True 6 True False False True True 7 True True False True False 8 True True True False False 9 False True True True True Note here: * `True` means no `NaN` values * `False` means for `NaN` values #### A little note here : `isnull()` means having null values so it gives boolean `True` for NaN values. And `notnull()` means having no null values so it gives `True` for no NaN value. ## 2. Filling missing values using `fillna()`, `replace()`. ```python ## Filling missing values with a single value using `fillna` print(car_sales_missing_df.fillna(0)) ``` Make Colour Odometer Doors Price 0 Toyota White 150043.0 4.0 $4,000 1 Honda Red 87899.0 4.0 $5,000 2 Toyota Blue 0.0 3.0 $7,000 3 BMW Black 11179.0 5.0 $22,000 4 Nissan White 213095.0 4.0 $3,500 5 Toyota Green 0.0 4.0 $4,500 6 Honda 0 0.0 4.0 $7,500 7 Honda Blue 0.0 4.0 0 8 Toyota White 60000.0 0.0 0 9 0 White 31600.0 4.0 $9,700 ```python ## Filling missing values with the previous value using `ffill()` print(car_sales_missing_df.ffill()) ``` Make Colour Odometer Doors Price 0 Toyota White 150043.0 4.0 $4,000 1 Honda Red 87899.0 4.0 $5,000 2 Toyota Blue 87899.0 3.0 $7,000 3 BMW Black 11179.0 5.0 $22,000 4 Nissan White 213095.0 4.0 $3,500 5 Toyota Green 213095.0 4.0 $4,500 6 Honda Green 213095.0 4.0 $7,500 7 Honda Blue 213095.0 4.0 $7,500 8 Toyota White 60000.0 4.0 $7,500 9 Toyota White 31600.0 4.0 $9,700 ```python ## illing null value with the next ones using 'bfill()' print(car_sales_missing_df.bfill()) ``` Make Colour Odometer Doors Price 0 Toyota White 150043.0 4.0 $4,000 1 Honda Red 87899.0 4.0 $5,000 2 Toyota Blue 11179.0 3.0 $7,000 3 BMW Black 11179.0 5.0 $22,000 4 Nissan White 213095.0 4.0 $3,500 5 Toyota Green 60000.0 4.0 $4,500 6 Honda Blue 60000.0 4.0 $7,500 7 Honda Blue 60000.0 4.0 $9,700 8 Toyota White 60000.0 4.0 $9,700 9 NaN White 31600.0 4.0 $9,700 #### Filling a null values using `replace()` method **Now we are going to replace the all Nan value in the data frame with -125 value** *For this we will need numpy also* ```python import numpy as np ``` ```python print(car_sales_missing_df.replace(to_replace = np.nan, value = -125) ) ``` Make Colour Odometer Doors Price 0 Toyota White 150043.0 4.0 $4,000 1 Honda Red 87899.0 4.0 $5,000 2 Toyota Blue -125.0 3.0 $7,000 3 BMW Black 11179.0 5.0 $22,000 4 Nissan White 213095.0 4.0 $3,500 5 Toyota Green -125.0 4.0 $4,500 6 Honda -125 -125.0 4.0 $7,500 7 Honda Blue -125.0 4.0 -125 8 Toyota White 60000.0 -125.0 -125 9 -125 White 31600.0 4.0 $9,700 ## 3. Dropping missing values using `dropna()` **In order to drop a null values from a dataframe, we used `dropna()` function this function drop Rows/Columns of datasets with Null values in different ways.** #### Dropping rows with at least 1 null value. ```python print(car_sales_missing_df.dropna(axis = 0)) ##Now we drop rows with at least one Nan value (Null value) ``` Make Colour Odometer Doors Price 0 Toyota White 150043.0 4.0 $4,000 1 Honda Red 87899.0 4.0 $5,000 3 BMW Black 11179.0 5.0 $22,000 4 Nissan White 213095.0 4.0 $3,500 #### Dropping rows if all values in that row are missing. ```python print(car_sales_missing_df.dropna(how = 'all',axis = 0)) ## If not have leave the row as it is ``` Make Colour Odometer Doors Price 0 Toyota White 150043.0 4.0 $4,000 1 Honda Red 87899.0 4.0 $5,000 2 Toyota Blue NaN 3.0 $7,000 3 BMW Black 11179.0 5.0 $22,000 4 Nissan White 213095.0 4.0 $3,500 5 Toyota Green NaN 4.0 $4,500 6 Honda NaN NaN 4.0 $7,500 7 Honda Blue NaN 4.0 NaN 8 Toyota White 60000.0 NaN NaN 9 NaN White 31600.0 4.0 $9,700 #### Dropping columns with at least 1 null value ```python print(car_sales_missing_df.dropna(axis = 1)) ``` Empty DataFrame Columns: [] Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] Now we drop a columns which have at least 1 missing values. **Here the dataset becomes empty after dropna() because each column as atleast 1 null value so it remove that columns resulting in an empty dataframe.**