What is a Missing Value?

In data analysis, a missing value refers to the absence of valid data in certain rows or columns of a dataset. For example, a student’s math score not being filled in a report card, or a company’s sales data for a quarter being lost. In pandas, missing values are typically represented by NaN (Not a Number), though None might also be encountered. However, NaN is the more standard representation for numerical missing values.

How to Check for Missing Values?

Before handling missing values, the first step is to identify where they exist in the data. pandas provides two simple and practical methods:

1. isnull(): Mark Missing Values

df.isnull() returns a boolean DataFrame where True indicates a missing value and False indicates valid data.

import pandas as pd
import numpy as np

# Create a sample DataFrame with missing values
data = {
    'Name': ['Xiaoming', 'Xiaohong', 'Xiaogang', 'Xiaoli', 'Xiaoqiang'],
    'Age': [20, np.nan, 22, 21, np.nan],
    'Score': [85, 92, np.nan, 78, 88]
}
df = pd.DataFrame(data)

# Check if any values are missing
print(df.isnull())

Output:

    Name    Age  Score
0  False  False  False
1  False   True  False
2  False  False   True
3  False  False  False
4  False   True  False

2. isnull().sum(): Count Missing Values

df.isnull().sum() quickly counts the total number of missing values per column, making it easy to identify problematic areas.

# Count missing values per column
print(df.isnull().sum())

Output:

Name     0
Age      2
Score    1
dtype: int64

3. info(): Overview of Missing Values

df.info() provides information about the data type and the count of non-null values per column, allowing a visual assessment of missing value distribution.

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name      5 non-null      object 
 1   Age       3 non-null      float64
 2   Score     4 non-null      float64
dtypes: float64(2), object(1)
memory usage: 240.0+ bytes

Methods to Handle Missing Values

The core idea for handling missing values is: Either remove the missing values or fill them with reasonable values.

Method 1: Dropping Missing Values (dropna)

When the number of missing values is small and does not affect the overall data, rows or columns containing missing values can be directly removed.

1. Dropping Rows with Missing Values (Default)

df.dropna(axis=0, how='any')
- axis=0: Drop rows (default); axis=1: Drop columns.
- how='any': Drop the row/column if there is at least one missing value; how='all': Drop only if all values are missing.

# Drop all rows with missing values
df_drop_rows = df.dropna(axis=0)
print(df_drop_rows)

Output (only rows without missing values remain):

    Name   Age  Score
0  Xiaoming  20.0   85.0
3  Xiaoli  21.0   78.0

2. Dropping Columns with Missing Values

# Drop all columns with missing values
df_drop_cols = df.dropna(axis=1)
print(df_drop_cols)

Output (only columns without missing values remain):

    Name
0  Xiaoming
1  Xiaohong
2  Xiaogang
3  Xiaoli
4  Xiaoqiang

Method 2: Filling Missing Values (fillna)

When there are many missing values or deleting them would lose critical information, filling missing values is more common. pandas’ fillna() supports multiple filling methods.

1. Filling with a Fixed Value

Directly fill all missing values with a fixed value (e.g., 0, -1).

# Fill all missing values with 0
df_fill_0 = df.fillna(0)
print(df_fill_0)

Output:

    Name    Age  Score
0  Xiaoming  20.0   85.0
1  Xiaohong   0.0   92.0
2  Xiaogang  22.0    0.0
3  Xiaoli  21.0   78.0
4  Xiaoqiang   0.0   88.0

2. Filling with Statistical Measures (Mean/Median/Mode)

Numerical data is suitable for filling with statistical measures:
- Mean: Suitable for symmetrically distributed data (e.g., normal distribution).
- Median: Suitable for skewed data (robust to extreme values).
- Mode: Suitable for categorical data (most frequent value).

# Fill missing values in the Age column with its mean
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df['Age'])

Output (the mean age is (20 + 22 + 21)/3 = 21):

0    20.0
1    21.0
2    22.0
3    21.0
4    21.0
Name: Age, dtype: float64

3. Forward/Backward Filling (ffill/bfill)

Suitable for time-series data or ordered data (e.g., sales data ordered by time).
- ffill(): Fill missing values with the previous non-missing value.
- bfill(): Fill missing values with the next non-missing value.

# Simulate time-series data
df_time = pd.DataFrame({'value': [1, np.nan, 3, np.nan, 5]})

# Forward fill
df_forward = df_time.fillna(method='ffill')
print("Forward Filled Result:")
print(df_forward)

# Backward fill
df_backward = df_time.fillna(method='bfill')
print("\nBackward Filled Result:")
print(df_backward)

Output:

Forward Filled Result:
   value
0    1.0
1    1.0
2    3.0
3    3.0
4    5.0

Backward Filled Result:
   value
0    1.0
1    3.0
2    3.0
3    5.0
4    5.0

Comprehensive Practical Case

Suppose we have e-commerce order data and need to handle missing values.

Data Preparation

# Order data
orders = {
    'OrderID': [101, 102, 103, 104, 105],
    'CustomerName': ['A', 'B', 'C', 'D', 'E'],
    'Amount': [150, np.nan, 200, np.nan, 80],
    'PaymentMethod': ['WeChat', 'Alipay', np.nan, 'WeChat', 'Alipay']
}
df_orders = pd.DataFrame(orders)

Step 1: Check for Missing Values

print(df_orders.isnull().sum())

Output:

OrderID       0
CustomerName  0
Amount        2
PaymentMethod 1
dtype: int64

Step 2: Handle Missing Values

  • Amount column: Numerical type, fill with the mean.
  • PaymentMethod column: Categorical type, fill with the mode (most frequent value).
# Fill missing values in the Amount column (mean)
df_orders['Amount'] = df_orders['Amount'].fillna(df_orders['Amount'].mean())

# Fill missing values in the PaymentMethod column (mode)
df_orders['PaymentMethod'] = df_orders['PaymentMethod'].fillna(df_orders['PaymentMethod'].mode()[0])

print(df_orders)

Output:

   OrderID CustomerName     Amount PaymentMethod
0      101            A  150.000000       WeChat
1      102            B  143.333333       Alipay  # Mean = (150 + 200 + 80)/3 ≈ 143.33
2      103            C  200.000000       Alipay  # Mode is Alipay (appears 2 times)
3      104            D  143.333333       WeChat
4      105            E   80.000000       Alipay

Summary

The core steps for handling missing values:
1. Check for missing values: Use isnull() and isnull().sum() to locate issues.
2. Choose a handling strategy:
- Few missing values (<5%) and unimportant → Drop (using dropna()).
- Many missing values or critical data → Fill (using fillna()), prefer statistical measures (mean/median) or fixed values.
3. Verify results: Recheck the data after processing to ensure missing values have been properly handled.

There is no “one-size-fits-all” method for handling missing values; it requires flexible selection based on data characteristics and business needs to avoid data bias from improper handling.

Xiaoye