kopia lustrzana https://github.com/animator/learn-python
274 wiersze
7.9 KiB
Markdown
274 wiersze
7.9 KiB
Markdown
![]() |
# Handling Missing Values in Pandas
|
||
|
|
||
|
**Upuntil now we're working on complete data i.e not having any missing values. But in real life it is the one of the main problem.**
|
||
|
|
||
|
*Many datasets arrive with missing data either because it exists and was not collected or it never existed.*
|
||
|
|
||
|
In Pandas missing data is represented by two values:
|
||
|
|
||
|
* `None` : None is simply is `keyword` refer as empty or none.
|
||
|
* `NaN` : Acronym for `Not a Number`.
|
||
|
|
||
|
**There are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame :**
|
||
|
|
||
|
1. isnull()
|
||
|
2. notnull()
|
||
|
3. dropna()
|
||
|
4. fillna()
|
||
|
5. replace()
|
||
|
|
||
|
## 2. Checking for missing values using `isnull()` and `notnull()`
|
||
|
|
||
|
Let's import pandas and our fancy car-sales dataset having some missing values.
|
||
|
|
||
|
|
||
|
```python
|
||
|
import pandas as pd
|
||
|
```
|
||
|
|
||
|
|
||
|
```python
|
||
|
car_sales_missing_df = pd.read_csv("https://raw.githubusercontent.com/kRiShNa-429407/learn-python/main/contrib/pandas/Datasets/car-sales-missing-data.csv")
|
||
|
print(car_sales_missing_df)
|
||
|
```
|
||
|
|
||
|
Make Colour Odometer Doors Price
|
||
|
0 Toyota White 150043.0 4.0 $4,000
|
||
|
1 Honda Red 87899.0 4.0 $5,000
|
||
|
2 Toyota Blue NaN 3.0 $7,000
|
||
|
3 BMW Black 11179.0 5.0 $22,000
|
||
|
4 Nissan White 213095.0 4.0 $3,500
|
||
|
5 Toyota Green NaN 4.0 $4,500
|
||
|
6 Honda NaN NaN 4.0 $7,500
|
||
|
7 Honda Blue NaN 4.0 NaN
|
||
|
8 Toyota White 60000.0 NaN NaN
|
||
|
9 NaN White 31600.0 4.0 $9,700
|
||
|
|
||
|
|
||
|
|
||
|
```python
|
||
|
## Using isnull()
|
||
|
|
||
|
print(car_sales_missing_df.isnull())
|
||
|
```
|
||
|
|
||
|
Make Colour Odometer Doors Price
|
||
|
0 False False False False False
|
||
|
1 False False False False False
|
||
|
2 False False True False False
|
||
|
3 False False False False False
|
||
|
4 False False False False False
|
||
|
5 False False True False False
|
||
|
6 False True True False False
|
||
|
7 False False True False True
|
||
|
8 False False False True True
|
||
|
9 True False False False False
|
||
|
|
||
|
|
||
|
Note here:
|
||
|
* `True` means for `NaN` values
|
||
|
* `False` means for no `Nan` values
|
||
|
|
||
|
If we want to find the number of missing values in each column use `isnull().sum()`.
|
||
|
|
||
|
|
||
|
```python
|
||
|
print(car_sales_missing_df.isnull().sum())
|
||
|
```
|
||
|
|
||
|
Make 1
|
||
|
Colour 1
|
||
|
Odometer 4
|
||
|
Doors 1
|
||
|
Price 2
|
||
|
dtype: int64
|
||
|
|
||
|
|
||
|
You can also check presense of null values in a single column.
|
||
|
|
||
|
|
||
|
```python
|
||
|
print(car_sales_missing_df["Odometer"].isnull())
|
||
|
```
|
||
|
|
||
|
0 False
|
||
|
1 False
|
||
|
2 True
|
||
|
3 False
|
||
|
4 False
|
||
|
5 True
|
||
|
6 True
|
||
|
7 True
|
||
|
8 False
|
||
|
9 False
|
||
|
Name: Odometer, dtype: bool
|
||
|
|
||
|
|
||
|
|
||
|
```python
|
||
|
## using notnull()
|
||
|
|
||
|
print(car_sales_missing_df.notnull())
|
||
|
```
|
||
|
|
||
|
Make Colour Odometer Doors Price
|
||
|
0 True True True True True
|
||
|
1 True True True True True
|
||
|
2 True True False True True
|
||
|
3 True True True True True
|
||
|
4 True True True True True
|
||
|
5 True True False True True
|
||
|
6 True False False True True
|
||
|
7 True True False True False
|
||
|
8 True True True False False
|
||
|
9 False True True True True
|
||
|
|
||
|
|
||
|
Note here:
|
||
|
* `True` means no `NaN` values
|
||
|
* `False` means for `NaN` values
|
||
|
|
||
|
#### A little note here : `isnull()` means having null values so it gives boolean `True` for NaN values. And `notnull()` means having no null values so it gives `True` for no NaN value.
|
||
|
|
||
|
## 2. Filling missing values using `fillna()`, `replace()`.
|
||
|
|
||
|
|
||
|
```python
|
||
|
## Filling missing values with a single value using `fillna`
|
||
|
print(car_sales_missing_df.fillna(0))
|
||
|
```
|
||
|
|
||
|
Make Colour Odometer Doors Price
|
||
|
0 Toyota White 150043.0 4.0 $4,000
|
||
|
1 Honda Red 87899.0 4.0 $5,000
|
||
|
2 Toyota Blue 0.0 3.0 $7,000
|
||
|
3 BMW Black 11179.0 5.0 $22,000
|
||
|
4 Nissan White 213095.0 4.0 $3,500
|
||
|
5 Toyota Green 0.0 4.0 $4,500
|
||
|
6 Honda 0 0.0 4.0 $7,500
|
||
|
7 Honda Blue 0.0 4.0 0
|
||
|
8 Toyota White 60000.0 0.0 0
|
||
|
9 0 White 31600.0 4.0 $9,700
|
||
|
|
||
|
|
||
|
|
||
|
```python
|
||
|
## Filling missing values with the previous value using `ffill()`
|
||
|
print(car_sales_missing_df.ffill())
|
||
|
```
|
||
|
|
||
|
Make Colour Odometer Doors Price
|
||
|
0 Toyota White 150043.0 4.0 $4,000
|
||
|
1 Honda Red 87899.0 4.0 $5,000
|
||
|
2 Toyota Blue 87899.0 3.0 $7,000
|
||
|
3 BMW Black 11179.0 5.0 $22,000
|
||
|
4 Nissan White 213095.0 4.0 $3,500
|
||
|
5 Toyota Green 213095.0 4.0 $4,500
|
||
|
6 Honda Green 213095.0 4.0 $7,500
|
||
|
7 Honda Blue 213095.0 4.0 $7,500
|
||
|
8 Toyota White 60000.0 4.0 $7,500
|
||
|
9 Toyota White 31600.0 4.0 $9,700
|
||
|
|
||
|
|
||
|
|
||
|
```python
|
||
|
## illing null value with the next ones using 'bfill()'
|
||
|
print(car_sales_missing_df.bfill())
|
||
|
```
|
||
|
|
||
|
Make Colour Odometer Doors Price
|
||
|
0 Toyota White 150043.0 4.0 $4,000
|
||
|
1 Honda Red 87899.0 4.0 $5,000
|
||
|
2 Toyota Blue 11179.0 3.0 $7,000
|
||
|
3 BMW Black 11179.0 5.0 $22,000
|
||
|
4 Nissan White 213095.0 4.0 $3,500
|
||
|
5 Toyota Green 60000.0 4.0 $4,500
|
||
|
6 Honda Blue 60000.0 4.0 $7,500
|
||
|
7 Honda Blue 60000.0 4.0 $9,700
|
||
|
8 Toyota White 60000.0 4.0 $9,700
|
||
|
9 NaN White 31600.0 4.0 $9,700
|
||
|
|
||
|
|
||
|
#### Filling a null values using `replace()` method
|
||
|
|
||
|
**Now we are going to replace the all Nan value in the data frame with -125 value**
|
||
|
|
||
|
*For this we will need numpy also*
|
||
|
|
||
|
|
||
|
```python
|
||
|
import numpy as np
|
||
|
```
|
||
|
|
||
|
|
||
|
```python
|
||
|
print(car_sales_missing_df.replace(to_replace = np.nan, value = -125) )
|
||
|
```
|
||
|
|
||
|
Make Colour Odometer Doors Price
|
||
|
0 Toyota White 150043.0 4.0 $4,000
|
||
|
1 Honda Red 87899.0 4.0 $5,000
|
||
|
2 Toyota Blue -125.0 3.0 $7,000
|
||
|
3 BMW Black 11179.0 5.0 $22,000
|
||
|
4 Nissan White 213095.0 4.0 $3,500
|
||
|
5 Toyota Green -125.0 4.0 $4,500
|
||
|
6 Honda -125 -125.0 4.0 $7,500
|
||
|
7 Honda Blue -125.0 4.0 -125
|
||
|
8 Toyota White 60000.0 -125.0 -125
|
||
|
9 -125 White 31600.0 4.0 $9,700
|
||
|
|
||
|
|
||
|
## 3. Dropping missing values using `dropna()`
|
||
|
|
||
|
**In order to drop a null values from a dataframe, we used `dropna()` function this function drop Rows/Columns of datasets with Null values in different ways.**
|
||
|
|
||
|
#### Dropping rows with at least 1 null value.
|
||
|
|
||
|
|
||
|
```python
|
||
|
print(car_sales_missing_df.dropna(axis = 0)) ##Now we drop rows with at least one Nan value (Null value)
|
||
|
```
|
||
|
|
||
|
Make Colour Odometer Doors Price
|
||
|
0 Toyota White 150043.0 4.0 $4,000
|
||
|
1 Honda Red 87899.0 4.0 $5,000
|
||
|
3 BMW Black 11179.0 5.0 $22,000
|
||
|
4 Nissan White 213095.0 4.0 $3,500
|
||
|
|
||
|
|
||
|
#### Dropping rows if all values in that row are missing.
|
||
|
|
||
|
|
||
|
```python
|
||
|
print(car_sales_missing_df.dropna(how = 'all',axis = 0)) ## If not have leave the row as it is
|
||
|
```
|
||
|
|
||
|
Make Colour Odometer Doors Price
|
||
|
0 Toyota White 150043.0 4.0 $4,000
|
||
|
1 Honda Red 87899.0 4.0 $5,000
|
||
|
2 Toyota Blue NaN 3.0 $7,000
|
||
|
3 BMW Black 11179.0 5.0 $22,000
|
||
|
4 Nissan White 213095.0 4.0 $3,500
|
||
|
5 Toyota Green NaN 4.0 $4,500
|
||
|
6 Honda NaN NaN 4.0 $7,500
|
||
|
7 Honda Blue NaN 4.0 NaN
|
||
|
8 Toyota White 60000.0 NaN NaN
|
||
|
9 NaN White 31600.0 4.0 $9,700
|
||
|
|
||
|
|
||
|
#### Dropping columns with at least 1 null value
|
||
|
|
||
|
|
||
|
```python
|
||
|
print(car_sales_missing_df.dropna(axis = 1))
|
||
|
```
|
||
|
|
||
|
Empty DataFrame
|
||
|
Columns: []
|
||
|
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
|
||
|
|
||
|
|
||
|
Now we drop a columns which have at least 1 missing values.
|
||
|
|
||
|
**Here the dataset becomes empty after dropna() because each column as atleast 1 null value so it remove that columns resulting in an empty dataframe.**
|