learn-python/contrib/pandas/handling-missing-values.md

265 wiersze
7.7 KiB
Markdown
Czysty Zwykły widok Historia

2024-05-26 16:18:11 +00:00
# Handling Missing Values in Pandas
2024-05-27 02:51:10 +00:00
In real life, many datasets arrive with missing data either because it exists and was not collected or it never existed.
2024-05-26 16:18:11 +00:00
In Pandas missing data is represented by two values:
* `None` : None is simply is `keyword` refer as empty or none.
* `NaN` : Acronym for `Not a Number`.
2024-05-27 02:51:10 +00:00
There are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame:
2024-05-26 16:18:11 +00:00
2024-05-27 02:51:10 +00:00
1. `isnull()`
2. `notnull()`
3. `dropna()`
4. `fillna()`
5. `replace()`
2024-05-26 16:18:11 +00:00
## 2. Checking for missing values using `isnull()` and `notnull()`
Let's import pandas and our fancy car-sales dataset having some missing values.
```python
import pandas as pd
2024-05-27 02:51:10 +00:00
car_sales_missing_df = pd.read_csv("Datasets/car-sales-missing-data.csv")
2024-05-26 16:18:11 +00:00
print(car_sales_missing_df)
```
Make Colour Odometer Doors Price
0 Toyota White 150043.0 4.0 $4,000
1 Honda Red 87899.0 4.0 $5,000
2 Toyota Blue NaN 3.0 $7,000
3 BMW Black 11179.0 5.0 $22,000
4 Nissan White 213095.0 4.0 $3,500
5 Toyota Green NaN 4.0 $4,500
6 Honda NaN NaN 4.0 $7,500
7 Honda Blue NaN 4.0 NaN
8 Toyota White 60000.0 NaN NaN
9 NaN White 31600.0 4.0 $9,700
```python
## Using isnull()
print(car_sales_missing_df.isnull())
```
Make Colour Odometer Doors Price
0 False False False False False
1 False False False False False
2 False False True False False
3 False False False False False
4 False False False False False
5 False False True False False
6 False True True False False
7 False False True False True
8 False False False True True
9 True False False False False
Note here:
* `True` means for `NaN` values
* `False` means for no `Nan` values
If we want to find the number of missing values in each column use `isnull().sum()`.
```python
print(car_sales_missing_df.isnull().sum())
```
Make 1
Colour 1
Odometer 4
Doors 1
Price 2
dtype: int64
You can also check presense of null values in a single column.
```python
print(car_sales_missing_df["Odometer"].isnull())
```
0 False
1 False
2 True
3 False
4 False
5 True
6 True
7 True
8 False
9 False
Name: Odometer, dtype: bool
```python
## using notnull()
print(car_sales_missing_df.notnull())
```
Make Colour Odometer Doors Price
0 True True True True True
1 True True True True True
2 True True False True True
3 True True True True True
4 True True True True True
5 True True False True True
6 True False False True True
7 True True False True False
8 True True True False False
9 False True True True True
Note here:
* `True` means no `NaN` values
* `False` means for `NaN` values
2024-05-27 02:51:10 +00:00
`isnull()` means having null values so it gives boolean `True` for NaN values. And `notnull()` means having no null values so it gives `True` for no NaN value.
2024-05-26 16:18:11 +00:00
## 2. Filling missing values using `fillna()`, `replace()`.
```python
## Filling missing values with a single value using `fillna`
print(car_sales_missing_df.fillna(0))
```
Make Colour Odometer Doors Price
0 Toyota White 150043.0 4.0 $4,000
1 Honda Red 87899.0 4.0 $5,000
2 Toyota Blue 0.0 3.0 $7,000
3 BMW Black 11179.0 5.0 $22,000
4 Nissan White 213095.0 4.0 $3,500
5 Toyota Green 0.0 4.0 $4,500
6 Honda 0 0.0 4.0 $7,500
7 Honda Blue 0.0 4.0 0
8 Toyota White 60000.0 0.0 0
9 0 White 31600.0 4.0 $9,700
```python
## Filling missing values with the previous value using `ffill()`
print(car_sales_missing_df.ffill())
```
Make Colour Odometer Doors Price
0 Toyota White 150043.0 4.0 $4,000
1 Honda Red 87899.0 4.0 $5,000
2 Toyota Blue 87899.0 3.0 $7,000
3 BMW Black 11179.0 5.0 $22,000
4 Nissan White 213095.0 4.0 $3,500
5 Toyota Green 213095.0 4.0 $4,500
6 Honda Green 213095.0 4.0 $7,500
7 Honda Blue 213095.0 4.0 $7,500
8 Toyota White 60000.0 4.0 $7,500
9 Toyota White 31600.0 4.0 $9,700
```python
## illing null value with the next ones using 'bfill()'
print(car_sales_missing_df.bfill())
```
Make Colour Odometer Doors Price
0 Toyota White 150043.0 4.0 $4,000
1 Honda Red 87899.0 4.0 $5,000
2 Toyota Blue 11179.0 3.0 $7,000
3 BMW Black 11179.0 5.0 $22,000
4 Nissan White 213095.0 4.0 $3,500
5 Toyota Green 60000.0 4.0 $4,500
6 Honda Blue 60000.0 4.0 $7,500
7 Honda Blue 60000.0 4.0 $9,700
8 Toyota White 60000.0 4.0 $9,700
9 NaN White 31600.0 4.0 $9,700
#### Filling a null values using `replace()` method
2024-05-27 02:51:10 +00:00
Now we are going to replace the all `NaN` value in the data frame with -125 value
2024-05-26 16:18:11 +00:00
2024-05-27 02:51:10 +00:00
For this we will also need numpy
2024-05-26 16:18:11 +00:00
```python
import numpy as np
2024-05-27 02:51:10 +00:00
print(car_sales_missing_df.replace(to_replace = np.nan, value = -125))
2024-05-26 16:18:11 +00:00
```
Make Colour Odometer Doors Price
0 Toyota White 150043.0 4.0 $4,000
1 Honda Red 87899.0 4.0 $5,000
2 Toyota Blue -125.0 3.0 $7,000
3 BMW Black 11179.0 5.0 $22,000
4 Nissan White 213095.0 4.0 $3,500
5 Toyota Green -125.0 4.0 $4,500
6 Honda -125 -125.0 4.0 $7,500
7 Honda Blue -125.0 4.0 -125
8 Toyota White 60000.0 -125.0 -125
9 -125 White 31600.0 4.0 $9,700
## 3. Dropping missing values using `dropna()`
2024-05-27 02:51:10 +00:00
In order to drop a null values from a dataframe, we used `dropna()` function this function drop Rows/Columns of datasets with Null values in different ways.
2024-05-26 16:18:11 +00:00
#### Dropping rows with at least 1 null value.
```python
print(car_sales_missing_df.dropna(axis = 0)) ##Now we drop rows with at least one Nan value (Null value)
```
Make Colour Odometer Doors Price
0 Toyota White 150043.0 4.0 $4,000
1 Honda Red 87899.0 4.0 $5,000
3 BMW Black 11179.0 5.0 $22,000
4 Nissan White 213095.0 4.0 $3,500
#### Dropping rows if all values in that row are missing.
```python
print(car_sales_missing_df.dropna(how = 'all',axis = 0)) ## If not have leave the row as it is
```
Make Colour Odometer Doors Price
0 Toyota White 150043.0 4.0 $4,000
1 Honda Red 87899.0 4.0 $5,000
2 Toyota Blue NaN 3.0 $7,000
3 BMW Black 11179.0 5.0 $22,000
4 Nissan White 213095.0 4.0 $3,500
5 Toyota Green NaN 4.0 $4,500
6 Honda NaN NaN 4.0 $7,500
7 Honda Blue NaN 4.0 NaN
8 Toyota White 60000.0 NaN NaN
9 NaN White 31600.0 4.0 $9,700
#### Dropping columns with at least 1 null value
```python
print(car_sales_missing_df.dropna(axis = 1))
```
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Now we drop a columns which have at least 1 missing values.
2024-05-27 02:51:10 +00:00
Here the dataset becomes empty after `dropna()` because each column as atleast 1 null value so it remove that columns resulting in an empty dataframe.