What is a Missing Value?¶
In data analysis, a missing value refers to the absence of valid data in certain rows or columns of a dataset. For example, a student’s math score not being filled in a report card, or a company’s sales data for a quarter being lost. In pandas, missing values are typically represented by NaN (Not a Number), though None might also be encountered. However, NaN is the more standard representation for numerical missing values.
How to Check for Missing Values?¶
Before handling missing values, the first step is to identify where they exist in the data. pandas provides two simple and practical methods:
1. isnull(): Mark Missing Values¶
df.isnull() returns a boolean DataFrame where True indicates a missing value and False indicates valid data.
import pandas as pd
import numpy as np
# Create a sample DataFrame with missing values
data = {
'Name': ['Xiaoming', 'Xiaohong', 'Xiaogang', 'Xiaoli', 'Xiaoqiang'],
'Age': [20, np.nan, 22, 21, np.nan],
'Score': [85, 92, np.nan, 78, 88]
}
df = pd.DataFrame(data)
# Check if any values are missing
print(df.isnull())
Output:
Name Age Score
0 False False False
1 False True False
2 False False True
3 False False False
4 False True False
2. isnull().sum(): Count Missing Values¶
df.isnull().sum() quickly counts the total number of missing values per column, making it easy to identify problematic areas.
# Count missing values per column
print(df.isnull().sum())
Output:
Name 0
Age 2
Score 1
dtype: int64
3. info(): Overview of Missing Values¶
df.info() provides information about the data type and the count of non-null values per column, allowing a visual assessment of missing value distribution.
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Age 3 non-null float64
2 Score 4 non-null float64
dtypes: float64(2), object(1)
memory usage: 240.0+ bytes
Methods to Handle Missing Values¶
The core idea for handling missing values is: Either remove the missing values or fill them with reasonable values.
Method 1: Dropping Missing Values (dropna)¶
When the number of missing values is small and does not affect the overall data, rows or columns containing missing values can be directly removed.
1. Dropping Rows with Missing Values (Default)¶
df.dropna(axis=0, how='any')
- axis=0: Drop rows (default); axis=1: Drop columns.
- how='any': Drop the row/column if there is at least one missing value; how='all': Drop only if all values are missing.
# Drop all rows with missing values
df_drop_rows = df.dropna(axis=0)
print(df_drop_rows)
Output (only rows without missing values remain):
Name Age Score
0 Xiaoming 20.0 85.0
3 Xiaoli 21.0 78.0
2. Dropping Columns with Missing Values¶
# Drop all columns with missing values
df_drop_cols = df.dropna(axis=1)
print(df_drop_cols)
Output (only columns without missing values remain):
Name
0 Xiaoming
1 Xiaohong
2 Xiaogang
3 Xiaoli
4 Xiaoqiang
Method 2: Filling Missing Values (fillna)¶
When there are many missing values or deleting them would lose critical information, filling missing values is more common. pandas’ fillna() supports multiple filling methods.
1. Filling with a Fixed Value¶
Directly fill all missing values with a fixed value (e.g., 0, -1).
# Fill all missing values with 0
df_fill_0 = df.fillna(0)
print(df_fill_0)
Output:
Name Age Score
0 Xiaoming 20.0 85.0
1 Xiaohong 0.0 92.0
2 Xiaogang 22.0 0.0
3 Xiaoli 21.0 78.0
4 Xiaoqiang 0.0 88.0
2. Filling with Statistical Measures (Mean/Median/Mode)¶
Numerical data is suitable for filling with statistical measures:
- Mean: Suitable for symmetrically distributed data (e.g., normal distribution).
- Median: Suitable for skewed data (robust to extreme values).
- Mode: Suitable for categorical data (most frequent value).
# Fill missing values in the Age column with its mean
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df['Age'])
Output (the mean age is (20 + 22 + 21)/3 = 21):
0 20.0
1 21.0
2 22.0
3 21.0
4 21.0
Name: Age, dtype: float64
3. Forward/Backward Filling (ffill/bfill)¶
Suitable for time-series data or ordered data (e.g., sales data ordered by time).
- ffill(): Fill missing values with the previous non-missing value.
- bfill(): Fill missing values with the next non-missing value.
# Simulate time-series data
df_time = pd.DataFrame({'value': [1, np.nan, 3, np.nan, 5]})
# Forward fill
df_forward = df_time.fillna(method='ffill')
print("Forward Filled Result:")
print(df_forward)
# Backward fill
df_backward = df_time.fillna(method='bfill')
print("\nBackward Filled Result:")
print(df_backward)
Output:
Forward Filled Result:
value
0 1.0
1 1.0
2 3.0
3 3.0
4 5.0
Backward Filled Result:
value
0 1.0
1 3.0
2 3.0
3 5.0
4 5.0
Comprehensive Practical Case¶
Suppose we have e-commerce order data and need to handle missing values.
Data Preparation¶
# Order data
orders = {
'OrderID': [101, 102, 103, 104, 105],
'CustomerName': ['A', 'B', 'C', 'D', 'E'],
'Amount': [150, np.nan, 200, np.nan, 80],
'PaymentMethod': ['WeChat', 'Alipay', np.nan, 'WeChat', 'Alipay']
}
df_orders = pd.DataFrame(orders)
Step 1: Check for Missing Values¶
print(df_orders.isnull().sum())
Output:
OrderID 0
CustomerName 0
Amount 2
PaymentMethod 1
dtype: int64
Step 2: Handle Missing Values¶
- Amount column: Numerical type, fill with the mean.
- PaymentMethod column: Categorical type, fill with the mode (most frequent value).
# Fill missing values in the Amount column (mean)
df_orders['Amount'] = df_orders['Amount'].fillna(df_orders['Amount'].mean())
# Fill missing values in the PaymentMethod column (mode)
df_orders['PaymentMethod'] = df_orders['PaymentMethod'].fillna(df_orders['PaymentMethod'].mode()[0])
print(df_orders)
Output:
OrderID CustomerName Amount PaymentMethod
0 101 A 150.000000 WeChat
1 102 B 143.333333 Alipay # Mean = (150 + 200 + 80)/3 ≈ 143.33
2 103 C 200.000000 Alipay # Mode is Alipay (appears 2 times)
3 104 D 143.333333 WeChat
4 105 E 80.000000 Alipay
Summary¶
The core steps for handling missing values:
1. Check for missing values: Use isnull() and isnull().sum() to locate issues.
2. Choose a handling strategy:
- Few missing values (<5%) and unimportant → Drop (using dropna()).
- Many missing values or critical data → Fill (using fillna()), prefer statistical measures (mean/median) or fixed values.
3. Verify results: Recheck the data after processing to ensure missing values have been properly handled.
There is no “one-size-fits-all” method for handling missing values; it requires flexible selection based on data characteristics and business needs to avoid data bias from improper handling.