learn-python/contrib/pandas/handling-missing-values.md

# Handling Missing Values in Pandas

**Upuntil now we're working on complete data i.e not having any missing values. But in real life it is the one of the main problem.**

*Many datasets arrive with missing data either because it exists and was not collected or it never existed.*

In Pandas missing data is represented by two values:

* `None` : None is simply is `keyword` refer as empty or none.
* `NaN` : Acronym for `Not a Number`.

**There are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame :**

1. isnull()
2. notnull()
3. dropna()
4. fillna()
5. replace()

## 2. Checking for missing values using `isnull()` and `notnull()`

Let's import pandas and our fancy car-sales dataset having some missing values.


```python
import pandas as pd
```


```python
car_sales_missing_df = pd.read_csv("https://raw.githubusercontent.com/kRiShNa-429407/learn-python/main/contrib/pandas/Datasets/car-sales-missing-data.csv")
print(car_sales_missing_df)
```

         Make Colour  Odometer  Doors    Price
    0  Toyota  White  150043.0    4.0   $4,000
    1   Honda    Red   87899.0    4.0   $5,000
    2  Toyota   Blue       NaN    3.0   $7,000
    3     BMW  Black   11179.0    5.0  $22,000
    4  Nissan  White  213095.0    4.0   $3,500
    5  Toyota  Green       NaN    4.0   $4,500
    6   Honda    NaN       NaN    4.0   $7,500
    7   Honda   Blue       NaN    4.0      NaN
    8  Toyota  White   60000.0    NaN      NaN
    9     NaN  White   31600.0    4.0   $9,700
    

```python
## Using isnull()

print(car_sales_missing_df.isnull())
```

        Make  Colour  Odometer  Doors  Price
    0  False   False     False  False  False
    1  False   False     False  False  False
    2  False   False      True  False  False
    3  False   False     False  False  False
    4  False   False     False  False  False
    5  False   False      True  False  False
    6  False    True      True  False  False
    7  False   False      True  False   True
    8  False   False     False   True   True
    9   True   False     False  False  False
    

Note here:
* `True` means for `NaN` values
* `False` means for no `Nan` values

If we want to find the number of missing values in each column use `isnull().sum()`.


```python
print(car_sales_missing_df.isnull().sum())
```

    Make        1
    Colour      1
    Odometer    4
    Doors       1
    Price       2
    dtype: int64
    

You can also check presense of null values in a single column.


```python
print(car_sales_missing_df["Odometer"].isnull())
```

    0    False
    1    False
    2     True
    3    False
    4    False
    5     True
    6     True
    7     True
    8    False
    9    False
    Name: Odometer, dtype: bool
    

```python
## using notnull()

print(car_sales_missing_df.notnull())
```

        Make  Colour  Odometer  Doors  Price
    0   True    True      True   True   True
    1   True    True      True   True   True
    2   True    True     False   True   True
    3   True    True      True   True   True
    4   True    True      True   True   True
    5   True    True     False   True   True
    6   True   False     False   True   True
    7   True    True     False   True  False
    8   True    True      True  False  False
    9  False    True      True   True   True
    

Note here:
* `True` means no `NaN` values
* `False` means for `NaN` values

#### A little note here : `isnull()` means having null values so it gives boolean `True` for NaN values. And `notnull()` means having no null values so it gives `True` for no NaN value.

## 2. Filling missing values using `fillna()`, `replace()`.


```python
## Filling missing values  with a single value using `fillna`
print(car_sales_missing_df.fillna(0))
```

         Make Colour  Odometer  Doors    Price
    0  Toyota  White  150043.0    4.0   $4,000
    1   Honda    Red   87899.0    4.0   $5,000
    2  Toyota   Blue       0.0    3.0   $7,000
    3     BMW  Black   11179.0    5.0  $22,000
    4  Nissan  White  213095.0    4.0   $3,500
    5  Toyota  Green       0.0    4.0   $4,500
    6   Honda      0       0.0    4.0   $7,500
    7   Honda   Blue       0.0    4.0        0
    8  Toyota  White   60000.0    0.0        0
    9       0  White   31600.0    4.0   $9,700
    

```python
## Filling missing values with the previous value using `ffill()`
print(car_sales_missing_df.ffill())
```

         Make Colour  Odometer  Doors    Price
    0  Toyota  White  150043.0    4.0   $4,000
    1   Honda    Red   87899.0    4.0   $5,000
    2  Toyota   Blue   87899.0    3.0   $7,000
    3     BMW  Black   11179.0    5.0  $22,000
    4  Nissan  White  213095.0    4.0   $3,500
    5  Toyota  Green  213095.0    4.0   $4,500
    6   Honda  Green  213095.0    4.0   $7,500
    7   Honda   Blue  213095.0    4.0   $7,500
    8  Toyota  White   60000.0    4.0   $7,500
    9  Toyota  White   31600.0    4.0   $9,700
    

```python
## illing null value with the next ones  using 'bfill()'
print(car_sales_missing_df.bfill())
```

         Make Colour  Odometer  Doors    Price
    0  Toyota  White  150043.0    4.0   $4,000
    1   Honda    Red   87899.0    4.0   $5,000
    2  Toyota   Blue   11179.0    3.0   $7,000
    3     BMW  Black   11179.0    5.0  $22,000
    4  Nissan  White  213095.0    4.0   $3,500
    5  Toyota  Green   60000.0    4.0   $4,500
    6   Honda   Blue   60000.0    4.0   $7,500
    7   Honda   Blue   60000.0    4.0   $9,700
    8  Toyota  White   60000.0    4.0   $9,700
    9     NaN  White   31600.0    4.0   $9,700
    

#### Filling a null values using `replace()` method

**Now we are going to replace the all Nan value in the data frame with -125 value**

*For this we will need numpy also*


```python
import numpy as np
```


```python
print(car_sales_missing_df.replace(to_replace = np.nan, value = -125) )
```

         Make Colour  Odometer  Doors    Price
    0  Toyota  White  150043.0    4.0   $4,000
    1   Honda    Red   87899.0    4.0   $5,000
    2  Toyota   Blue    -125.0    3.0   $7,000
    3     BMW  Black   11179.0    5.0  $22,000
    4  Nissan  White  213095.0    4.0   $3,500
    5  Toyota  Green    -125.0    4.0   $4,500
    6   Honda   -125    -125.0    4.0   $7,500
    7   Honda   Blue    -125.0    4.0     -125
    8  Toyota  White   60000.0 -125.0     -125
    9    -125  White   31600.0    4.0   $9,700
    

## 3. Dropping missing values using `dropna()`

**In order to drop a null values from a dataframe, we used `dropna()` function this function drop Rows/Columns of datasets with Null values in different ways.**

#### Dropping rows with at least 1 null value. 


```python
print(car_sales_missing_df.dropna(axis = 0))  ##Now we drop rows with at least one Nan value (Null value) 
```

         Make Colour  Odometer  Doors    Price
    0  Toyota  White  150043.0    4.0   $4,000
    1   Honda    Red   87899.0    4.0   $5,000
    3     BMW  Black   11179.0    5.0  $22,000
    4  Nissan  White  213095.0    4.0   $3,500
    

#### Dropping rows if all values in that row are missing.


```python
print(car_sales_missing_df.dropna(how = 'all',axis = 0))  ## If not have leave the row as it is
```

         Make Colour  Odometer  Doors    Price
    0  Toyota  White  150043.0    4.0   $4,000
    1   Honda    Red   87899.0    4.0   $5,000
    2  Toyota   Blue       NaN    3.0   $7,000
    3     BMW  Black   11179.0    5.0  $22,000
    4  Nissan  White  213095.0    4.0   $3,500
    5  Toyota  Green       NaN    4.0   $4,500
    6   Honda    NaN       NaN    4.0   $7,500
    7   Honda   Blue       NaN    4.0      NaN
    8  Toyota  White   60000.0    NaN      NaN
    9     NaN  White   31600.0    4.0   $9,700
    

#### Dropping columns with at least 1 null value


```python
print(car_sales_missing_df.dropna(axis = 1))
```

    Empty DataFrame
    Columns: []
    Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    

Now we drop a columns which have at least 1 missing values.

**Here the dataset becomes empty after dropna() because each column as atleast 1 null value so it remove that columns resulting in an empty dataframe.**
Add files via upload 2024-05-26 16:18:11 +00:00			`# Handling Missing Values in Pandas`

			`Upuntil now we're working on complete data i.e not having any missing values. But in real life it is the one of the main problem.`

			`Many datasets arrive with missing data either because it exists and was not collected or it never existed.`

			`In Pandas missing data is represented by two values:`

			* `None` : None is simply is `keyword` refer as empty or none.
			* `NaN` : Acronym for `Not a Number`.

			`There are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame :`

			`1. isnull()`
			`2. notnull()`
			`3. dropna()`
			`4. fillna()`
			`5. replace()`

			## 2. Checking for missing values using `isnull()` and `notnull()`

			`Let's import pandas and our fancy car-sales dataset having some missing values.`


			```python
			`import pandas as pd`
			```


			```python
			`car_sales_missing_df = pd.read_csv("https://raw.githubusercontent.com/kRiShNa-429407/learn-python/main/contrib/pandas/Datasets/car-sales-missing-data.csv")`
			`print(car_sales_missing_df)`
			```

			`Make Colour Odometer Doors Price`
			`0 Toyota White 150043.0 4.0 $4,000`
			`1 Honda Red 87899.0 4.0 $5,000`
			`2 Toyota Blue NaN 3.0 $7,000`
			`3 BMW Black 11179.0 5.0 $22,000`
			`4 Nissan White 213095.0 4.0 $3,500`
			`5 Toyota Green NaN 4.0 $4,500`
			`6 Honda NaN NaN 4.0 $7,500`
			`7 Honda Blue NaN 4.0 NaN`
			`8 Toyota White 60000.0 NaN NaN`
			`9 NaN White 31600.0 4.0 $9,700`



			```python
			`## Using isnull()`

			`print(car_sales_missing_df.isnull())`
			```

			`Make Colour Odometer Doors Price`
			`0 False False False False False`
			`1 False False False False False`
			`2 False False True False False`
			`3 False False False False False`
			`4 False False False False False`
			`5 False False True False False`
			`6 False True True False False`
			`7 False False True False True`
			`8 False False False True True`
			`9 True False False False False`


			`Note here:`
			* `True` means for `NaN` values
			* `False` means for no `Nan` values

			If we want to find the number of missing values in each column use `isnull().sum()`.


			```python
			`print(car_sales_missing_df.isnull().sum())`
			```

			`Make 1`
			`Colour 1`
			`Odometer 4`
			`Doors 1`
			`Price 2`
			`dtype: int64`


			`You can also check presense of null values in a single column.`


			```python
			`print(car_sales_missing_df["Odometer"].isnull())`
			```

			`0 False`
			`1 False`
			`2 True`
			`3 False`
			`4 False`
			`5 True`
			`6 True`
			`7 True`
			`8 False`
			`9 False`
			`Name: Odometer, dtype: bool`



			```python
			`## using notnull()`

			`print(car_sales_missing_df.notnull())`
			```

			`Make Colour Odometer Doors Price`
			`0 True True True True True`
			`1 True True True True True`
			`2 True True False True True`
			`3 True True True True True`
			`4 True True True True True`
			`5 True True False True True`
			`6 True False False True True`
			`7 True True False True False`
			`8 True True True False False`
			`9 False True True True True`


			`Note here:`
			* `True` means no `NaN` values
			* `False` means for `NaN` values

			#### A little note here : `isnull()` means having null values so it gives boolean `True` for NaN values. And `notnull()` means having no null values so it gives `True` for no NaN value.

			## 2. Filling missing values using `fillna()`, `replace()`.


			```python
			## Filling missing values with a single value using `fillna`
			`print(car_sales_missing_df.fillna(0))`
			```

			`Make Colour Odometer Doors Price`
			`0 Toyota White 150043.0 4.0 $4,000`
			`1 Honda Red 87899.0 4.0 $5,000`
			`2 Toyota Blue 0.0 3.0 $7,000`
			`3 BMW Black 11179.0 5.0 $22,000`
			`4 Nissan White 213095.0 4.0 $3,500`
			`5 Toyota Green 0.0 4.0 $4,500`
			`6 Honda 0 0.0 4.0 $7,500`
			`7 Honda Blue 0.0 4.0 0`
			`8 Toyota White 60000.0 0.0 0`
			`9 0 White 31600.0 4.0 $9,700`



			```python
			## Filling missing values with the previous value using `ffill()`
			`print(car_sales_missing_df.ffill())`
			```

			`Make Colour Odometer Doors Price`
			`0 Toyota White 150043.0 4.0 $4,000`
			`1 Honda Red 87899.0 4.0 $5,000`
			`2 Toyota Blue 87899.0 3.0 $7,000`
			`3 BMW Black 11179.0 5.0 $22,000`
			`4 Nissan White 213095.0 4.0 $3,500`
			`5 Toyota Green 213095.0 4.0 $4,500`
			`6 Honda Green 213095.0 4.0 $7,500`
			`7 Honda Blue 213095.0 4.0 $7,500`
			`8 Toyota White 60000.0 4.0 $7,500`
			`9 Toyota White 31600.0 4.0 $9,700`



			```python
			`## illing null value with the next ones using 'bfill()'`
			`print(car_sales_missing_df.bfill())`
			```

			`Make Colour Odometer Doors Price`
			`0 Toyota White 150043.0 4.0 $4,000`
			`1 Honda Red 87899.0 4.0 $5,000`
			`2 Toyota Blue 11179.0 3.0 $7,000`
			`3 BMW Black 11179.0 5.0 $22,000`
			`4 Nissan White 213095.0 4.0 $3,500`
			`5 Toyota Green 60000.0 4.0 $4,500`
			`6 Honda Blue 60000.0 4.0 $7,500`
			`7 Honda Blue 60000.0 4.0 $9,700`
			`8 Toyota White 60000.0 4.0 $9,700`
			`9 NaN White 31600.0 4.0 $9,700`


			#### Filling a null values using `replace()` method

			`Now we are going to replace the all Nan value in the data frame with -125 value`

			`For this we will need numpy also`


			```python
			`import numpy as np`
			```


			```python
			`print(car_sales_missing_df.replace(to_replace = np.nan, value = -125) )`
			```

			`Make Colour Odometer Doors Price`
			`0 Toyota White 150043.0 4.0 $4,000`
			`1 Honda Red 87899.0 4.0 $5,000`
			`2 Toyota Blue -125.0 3.0 $7,000`
			`3 BMW Black 11179.0 5.0 $22,000`
			`4 Nissan White 213095.0 4.0 $3,500`
			`5 Toyota Green -125.0 4.0 $4,500`
			`6 Honda -125 -125.0 4.0 $7,500`
			`7 Honda Blue -125.0 4.0 -125`
			`8 Toyota White 60000.0 -125.0 -125`
			`9 -125 White 31600.0 4.0 $9,700`


			## 3. Dropping missing values using `dropna()`

			In order to drop a null values from a dataframe, we used `dropna()` function this function drop Rows/Columns of datasets with Null values in different ways.

			`#### Dropping rows with at least 1 null value.`


			```python
			`print(car_sales_missing_df.dropna(axis = 0)) ##Now we drop rows with at least one Nan value (Null value)`
			```

			`Make Colour Odometer Doors Price`
			`0 Toyota White 150043.0 4.0 $4,000`
			`1 Honda Red 87899.0 4.0 $5,000`
			`3 BMW Black 11179.0 5.0 $22,000`
			`4 Nissan White 213095.0 4.0 $3,500`


			`#### Dropping rows if all values in that row are missing.`


			```python
			`print(car_sales_missing_df.dropna(how = 'all',axis = 0)) ## If not have leave the row as it is`
			```

			`Make Colour Odometer Doors Price`
			`0 Toyota White 150043.0 4.0 $4,000`
			`1 Honda Red 87899.0 4.0 $5,000`
			`2 Toyota Blue NaN 3.0 $7,000`
			`3 BMW Black 11179.0 5.0 $22,000`
			`4 Nissan White 213095.0 4.0 $3,500`
			`5 Toyota Green NaN 4.0 $4,500`
			`6 Honda NaN NaN 4.0 $7,500`
			`7 Honda Blue NaN 4.0 NaN`
			`8 Toyota White 60000.0 NaN NaN`
			`9 NaN White 31600.0 4.0 $9,700`


			`#### Dropping columns with at least 1 null value`


			```python
			`print(car_sales_missing_df.dropna(axis = 1))`
			```

			`Empty DataFrame`
			`Columns: []`
			`Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`


			`Now we drop a columns which have at least 1 missing values.`

			`Here the dataset becomes empty after dropna() because each column as atleast 1 null value so it remove that columns resulting in an empty dataframe.`