learn-python/contrib/pandas/handling-missing-values.md

# Handling Missing Values in Pandas

In real life, many datasets arrive with missing data either because it exists and was not collected or it never existed.

In Pandas missing data is represented by two values:

* `None` : None is simply is `keyword` refer as empty or none.
* `NaN` : Acronym for `Not a Number`.

There are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame:

1. `isnull()`
2. `notnull()`
3. `dropna()`
4. `fillna()`
5. `replace()`

## 2. Checking for missing values using `isnull()` and `notnull()`

Let's import pandas and our fancy car-sales dataset having some missing values.

```python
import pandas as pd

car_sales_missing_df = pd.read_csv("Datasets/car-sales-missing-data.csv")
print(car_sales_missing_df)
```

         Make Colour  Odometer  Doors    Price
    0  Toyota  White  150043.0    4.0   $4,000
    1   Honda    Red   87899.0    4.0   $5,000
    2  Toyota   Blue       NaN    3.0   $7,000
    3     BMW  Black   11179.0    5.0  $22,000
    4  Nissan  White  213095.0    4.0   $3,500
    5  Toyota  Green       NaN    4.0   $4,500
    6   Honda    NaN       NaN    4.0   $7,500
    7   Honda   Blue       NaN    4.0      NaN
    8  Toyota  White   60000.0    NaN      NaN
    9     NaN  White   31600.0    4.0   $9,700
    

```python
## Using isnull()

print(car_sales_missing_df.isnull())
```

        Make  Colour  Odometer  Doors  Price
    0  False   False     False  False  False
    1  False   False     False  False  False
    2  False   False      True  False  False
    3  False   False     False  False  False
    4  False   False     False  False  False
    5  False   False      True  False  False
    6  False    True      True  False  False
    7  False   False      True  False   True
    8  False   False     False   True   True
    9   True   False     False  False  False
    

Note here:
* `True` means for `NaN` values
* `False` means for no `Nan` values

If we want to find the number of missing values in each column use `isnull().sum()`.


```python
print(car_sales_missing_df.isnull().sum())
```

    Make        1
    Colour      1
    Odometer    4
    Doors       1
    Price       2
    dtype: int64
    

You can also check presense of null values in a single column.


```python
print(car_sales_missing_df["Odometer"].isnull())
```

    0    False
    1    False
    2     True
    3    False
    4    False
    5     True
    6     True
    7     True
    8    False
    9    False
    Name: Odometer, dtype: bool
    

```python
## using notnull()

print(car_sales_missing_df.notnull())
```

        Make  Colour  Odometer  Doors  Price
    0   True    True      True   True   True
    1   True    True      True   True   True
    2   True    True     False   True   True
    3   True    True      True   True   True
    4   True    True      True   True   True
    5   True    True     False   True   True
    6   True   False     False   True   True
    7   True    True     False   True  False
    8   True    True      True  False  False
    9  False    True      True   True   True
    

Note here:
* `True` means no `NaN` values
* `False` means for `NaN` values

`isnull()` means having null values so it gives boolean `True` for NaN values. And `notnull()` means having no null values so it gives `True` for no NaN value.

## 2. Filling missing values using `fillna()`, `replace()`.


```python
## Filling missing values  with a single value using `fillna`
print(car_sales_missing_df.fillna(0))
```

         Make Colour  Odometer  Doors    Price
    0  Toyota  White  150043.0    4.0   $4,000
    1   Honda    Red   87899.0    4.0   $5,000
    2  Toyota   Blue       0.0    3.0   $7,000
    3     BMW  Black   11179.0    5.0  $22,000
    4  Nissan  White  213095.0    4.0   $3,500
    5  Toyota  Green       0.0    4.0   $4,500
    6   Honda      0       0.0    4.0   $7,500
    7   Honda   Blue       0.0    4.0        0
    8  Toyota  White   60000.0    0.0        0
    9       0  White   31600.0    4.0   $9,700
    

```python
## Filling missing values with the previous value using `ffill()`
print(car_sales_missing_df.ffill())
```

         Make Colour  Odometer  Doors    Price
    0  Toyota  White  150043.0    4.0   $4,000
    1   Honda    Red   87899.0    4.0   $5,000
    2  Toyota   Blue   87899.0    3.0   $7,000
    3     BMW  Black   11179.0    5.0  $22,000
    4  Nissan  White  213095.0    4.0   $3,500
    5  Toyota  Green  213095.0    4.0   $4,500
    6   Honda  Green  213095.0    4.0   $7,500
    7   Honda   Blue  213095.0    4.0   $7,500
    8  Toyota  White   60000.0    4.0   $7,500
    9  Toyota  White   31600.0    4.0   $9,700
    

```python
## illing null value with the next ones  using 'bfill()'
print(car_sales_missing_df.bfill())
```

         Make Colour  Odometer  Doors    Price
    0  Toyota  White  150043.0    4.0   $4,000
    1   Honda    Red   87899.0    4.0   $5,000
    2  Toyota   Blue   11179.0    3.0   $7,000
    3     BMW  Black   11179.0    5.0  $22,000
    4  Nissan  White  213095.0    4.0   $3,500
    5  Toyota  Green   60000.0    4.0   $4,500
    6   Honda   Blue   60000.0    4.0   $7,500
    7   Honda   Blue   60000.0    4.0   $9,700
    8  Toyota  White   60000.0    4.0   $9,700
    9     NaN  White   31600.0    4.0   $9,700
    

#### Filling a null values using `replace()` method

Now we are going to replace the all `NaN` value in the data frame with -125 value

For this we will also need numpy


```python
import numpy as np

print(car_sales_missing_df.replace(to_replace = np.nan, value = -125))
```

         Make Colour  Odometer  Doors    Price
    0  Toyota  White  150043.0    4.0   $4,000
    1   Honda    Red   87899.0    4.0   $5,000
    2  Toyota   Blue    -125.0    3.0   $7,000
    3     BMW  Black   11179.0    5.0  $22,000
    4  Nissan  White  213095.0    4.0   $3,500
    5  Toyota  Green    -125.0    4.0   $4,500
    6   Honda   -125    -125.0    4.0   $7,500
    7   Honda   Blue    -125.0    4.0     -125
    8  Toyota  White   60000.0 -125.0     -125
    9    -125  White   31600.0    4.0   $9,700
    

## 3. Dropping missing values using `dropna()`

In order to drop a null values from a dataframe, we used `dropna()` function this function drop Rows/Columns of datasets with Null values in different ways.

#### Dropping rows with at least 1 null value. 


```python
print(car_sales_missing_df.dropna(axis = 0))  ##Now we drop rows with at least one Nan value (Null value) 
```

         Make Colour  Odometer  Doors    Price
    0  Toyota  White  150043.0    4.0   $4,000
    1   Honda    Red   87899.0    4.0   $5,000
    3     BMW  Black   11179.0    5.0  $22,000
    4  Nissan  White  213095.0    4.0   $3,500
    

#### Dropping rows if all values in that row are missing.


```python
print(car_sales_missing_df.dropna(how = 'all',axis = 0))  ## If not have leave the row as it is
```

         Make Colour  Odometer  Doors    Price
    0  Toyota  White  150043.0    4.0   $4,000
    1   Honda    Red   87899.0    4.0   $5,000
    2  Toyota   Blue       NaN    3.0   $7,000
    3     BMW  Black   11179.0    5.0  $22,000
    4  Nissan  White  213095.0    4.0   $3,500
    5  Toyota  Green       NaN    4.0   $4,500
    6   Honda    NaN       NaN    4.0   $7,500
    7   Honda   Blue       NaN    4.0      NaN
    8  Toyota  White   60000.0    NaN      NaN
    9     NaN  White   31600.0    4.0   $9,700
    

#### Dropping columns with at least 1 null value


```python
print(car_sales_missing_df.dropna(axis = 1))
```

    Empty DataFrame
    Columns: []
    Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    

Now we drop a columns which have at least 1 missing values.

Here the dataset becomes empty after `dropna()` because each column as atleast 1 null value so it remove that columns resulting in an empty dataframe.
Add files via upload 2024-05-26 16:18:11 +00:00			`# Handling Missing Values in Pandas`

Update handling-missing-values.md 2024-05-27 02:51:10 +00:00			`In real life, many datasets arrive with missing data either because it exists and was not collected or it never existed.`
Add files via upload 2024-05-26 16:18:11 +00:00
			`In Pandas missing data is represented by two values:`

			* `None` : None is simply is `keyword` refer as empty or none.
			* `NaN` : Acronym for `Not a Number`.

Update handling-missing-values.md 2024-05-27 02:51:10 +00:00			`There are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame:`
Add files via upload 2024-05-26 16:18:11 +00:00
Update handling-missing-values.md 2024-05-27 02:51:10 +00:00			1. `isnull()`
			2. `notnull()`
			3. `dropna()`
			4. `fillna()`
			5. `replace()`
Add files via upload 2024-05-26 16:18:11 +00:00
			## 2. Checking for missing values using `isnull()` and `notnull()`

			`Let's import pandas and our fancy car-sales dataset having some missing values.`

			```python
			`import pandas as pd`

Update handling-missing-values.md 2024-05-27 02:51:10 +00:00			`car_sales_missing_df = pd.read_csv("Datasets/car-sales-missing-data.csv")`
Add files via upload 2024-05-26 16:18:11 +00:00			`print(car_sales_missing_df)`
			```

			`Make Colour Odometer Doors Price`
			`0 Toyota White 150043.0 4.0 $4,000`
			`1 Honda Red 87899.0 4.0 $5,000`
			`2 Toyota Blue NaN 3.0 $7,000`
			`3 BMW Black 11179.0 5.0 $22,000`
			`4 Nissan White 213095.0 4.0 $3,500`
			`5 Toyota Green NaN 4.0 $4,500`
			`6 Honda NaN NaN 4.0 $7,500`
			`7 Honda Blue NaN 4.0 NaN`
			`8 Toyota White 60000.0 NaN NaN`
			`9 NaN White 31600.0 4.0 $9,700`



			```python
			`## Using isnull()`

			`print(car_sales_missing_df.isnull())`
			```

			`Make Colour Odometer Doors Price`
			`0 False False False False False`
			`1 False False False False False`
			`2 False False True False False`
			`3 False False False False False`
			`4 False False False False False`
			`5 False False True False False`
			`6 False True True False False`
			`7 False False True False True`
			`8 False False False True True`
			`9 True False False False False`


			`Note here:`
			* `True` means for `NaN` values
			* `False` means for no `Nan` values

			If we want to find the number of missing values in each column use `isnull().sum()`.


			```python
			`print(car_sales_missing_df.isnull().sum())`
			```

			`Make 1`
			`Colour 1`
			`Odometer 4`
			`Doors 1`
			`Price 2`
			`dtype: int64`


			`You can also check presense of null values in a single column.`


			```python
			`print(car_sales_missing_df["Odometer"].isnull())`
			```

			`0 False`
			`1 False`
			`2 True`
			`3 False`
			`4 False`
			`5 True`
			`6 True`
			`7 True`
			`8 False`
			`9 False`
			`Name: Odometer, dtype: bool`



			```python
			`## using notnull()`

			`print(car_sales_missing_df.notnull())`
			```

			`Make Colour Odometer Doors Price`
			`0 True True True True True`
			`1 True True True True True`
			`2 True True False True True`
			`3 True True True True True`
			`4 True True True True True`
			`5 True True False True True`
			`6 True False False True True`
			`7 True True False True False`
			`8 True True True False False`
			`9 False True True True True`


			`Note here:`
			* `True` means no `NaN` values
			* `False` means for `NaN` values

Update handling-missing-values.md 2024-05-27 02:51:10 +00:00			`isnull()` means having null values so it gives boolean `True` for NaN values. And `notnull()` means having no null values so it gives `True` for no NaN value.
Add files via upload 2024-05-26 16:18:11 +00:00
			## 2. Filling missing values using `fillna()`, `replace()`.


			```python
			## Filling missing values with a single value using `fillna`
			`print(car_sales_missing_df.fillna(0))`
			```

			`Make Colour Odometer Doors Price`
			`0 Toyota White 150043.0 4.0 $4,000`
			`1 Honda Red 87899.0 4.0 $5,000`
			`2 Toyota Blue 0.0 3.0 $7,000`
			`3 BMW Black 11179.0 5.0 $22,000`
			`4 Nissan White 213095.0 4.0 $3,500`
			`5 Toyota Green 0.0 4.0 $4,500`
			`6 Honda 0 0.0 4.0 $7,500`
			`7 Honda Blue 0.0 4.0 0`
			`8 Toyota White 60000.0 0.0 0`
			`9 0 White 31600.0 4.0 $9,700`



			```python
			## Filling missing values with the previous value using `ffill()`
			`print(car_sales_missing_df.ffill())`
			```

			`Make Colour Odometer Doors Price`
			`0 Toyota White 150043.0 4.0 $4,000`
			`1 Honda Red 87899.0 4.0 $5,000`
			`2 Toyota Blue 87899.0 3.0 $7,000`
			`3 BMW Black 11179.0 5.0 $22,000`
			`4 Nissan White 213095.0 4.0 $3,500`
			`5 Toyota Green 213095.0 4.0 $4,500`
			`6 Honda Green 213095.0 4.0 $7,500`
			`7 Honda Blue 213095.0 4.0 $7,500`
			`8 Toyota White 60000.0 4.0 $7,500`
			`9 Toyota White 31600.0 4.0 $9,700`



			```python
			`## illing null value with the next ones using 'bfill()'`
			`print(car_sales_missing_df.bfill())`
			```

			`Make Colour Odometer Doors Price`
			`0 Toyota White 150043.0 4.0 $4,000`
			`1 Honda Red 87899.0 4.0 $5,000`
			`2 Toyota Blue 11179.0 3.0 $7,000`
			`3 BMW Black 11179.0 5.0 $22,000`
			`4 Nissan White 213095.0 4.0 $3,500`
			`5 Toyota Green 60000.0 4.0 $4,500`
			`6 Honda Blue 60000.0 4.0 $7,500`
			`7 Honda Blue 60000.0 4.0 $9,700`
			`8 Toyota White 60000.0 4.0 $9,700`
			`9 NaN White 31600.0 4.0 $9,700`


			#### Filling a null values using `replace()` method

Update handling-missing-values.md 2024-05-27 02:51:10 +00:00			Now we are going to replace the all `NaN` value in the data frame with -125 value
Add files via upload 2024-05-26 16:18:11 +00:00
Update handling-missing-values.md 2024-05-27 02:51:10 +00:00			`For this we will also need numpy`
Add files via upload 2024-05-26 16:18:11 +00:00

			```python
			`import numpy as np`

Update handling-missing-values.md 2024-05-27 02:51:10 +00:00			`print(car_sales_missing_df.replace(to_replace = np.nan, value = -125))`
Add files via upload 2024-05-26 16:18:11 +00:00			```

			`Make Colour Odometer Doors Price`
			`0 Toyota White 150043.0 4.0 $4,000`
			`1 Honda Red 87899.0 4.0 $5,000`
			`2 Toyota Blue -125.0 3.0 $7,000`
			`3 BMW Black 11179.0 5.0 $22,000`
			`4 Nissan White 213095.0 4.0 $3,500`
			`5 Toyota Green -125.0 4.0 $4,500`
			`6 Honda -125 -125.0 4.0 $7,500`
			`7 Honda Blue -125.0 4.0 -125`
			`8 Toyota White 60000.0 -125.0 -125`
			`9 -125 White 31600.0 4.0 $9,700`


			## 3. Dropping missing values using `dropna()`

Update handling-missing-values.md 2024-05-27 02:51:10 +00:00			In order to drop a null values from a dataframe, we used `dropna()` function this function drop Rows/Columns of datasets with Null values in different ways.
Add files via upload 2024-05-26 16:18:11 +00:00
			`#### Dropping rows with at least 1 null value.`


			```python
			`print(car_sales_missing_df.dropna(axis = 0)) ##Now we drop rows with at least one Nan value (Null value)`
			```

			`Make Colour Odometer Doors Price`
			`0 Toyota White 150043.0 4.0 $4,000`
			`1 Honda Red 87899.0 4.0 $5,000`
			`3 BMW Black 11179.0 5.0 $22,000`
			`4 Nissan White 213095.0 4.0 $3,500`


			`#### Dropping rows if all values in that row are missing.`


			```python
			`print(car_sales_missing_df.dropna(how = 'all',axis = 0)) ## If not have leave the row as it is`
			```

			`Make Colour Odometer Doors Price`
			`0 Toyota White 150043.0 4.0 $4,000`
			`1 Honda Red 87899.0 4.0 $5,000`
			`2 Toyota Blue NaN 3.0 $7,000`
			`3 BMW Black 11179.0 5.0 $22,000`
			`4 Nissan White 213095.0 4.0 $3,500`
			`5 Toyota Green NaN 4.0 $4,500`
			`6 Honda NaN NaN 4.0 $7,500`
			`7 Honda Blue NaN 4.0 NaN`
			`8 Toyota White 60000.0 NaN NaN`
			`9 NaN White 31600.0 4.0 $9,700`


			`#### Dropping columns with at least 1 null value`


			```python
			`print(car_sales_missing_df.dropna(axis = 1))`
			```

			`Empty DataFrame`
			`Columns: []`
			`Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`


			`Now we drop a columns which have at least 1 missing values.`

Update handling-missing-values.md 2024-05-27 02:51:10 +00:00			Here the dataset becomes empty after `dropna()` because each column as atleast 1 null value so it remove that columns resulting in an empty dataframe.