In the world of data analysis, data cleaning is like food preparation before cooking—if the ingredients are not fresh (data is problematic), the delicious dish (analysis result) will definitely be significantly compromised. As the most powerful data processing library in Python, pandas is simply a “Swiss Army knife” for data cleaning, helping us efficiently tackle various data issues. This article will guide beginners through the core techniques of data cleaning in the simplest way, allowing you to bid farewell to “dirty data” troubles!
1. Installation and Preparation: Get Your Tools Ready¶
First, ensure pandas is installed on your computer (if not, open the command line and run pip install pandas). Then, import the pandas library in your Python script or Jupyter Notebook:
import pandas as pd # Conventionally using 'pd' as an alias
2. Data Import: View Data from the “Source”¶
The first step in data cleaning is to import data into pandas. The most common method is importing CSV files using pd.read_csv():
# If you have a CSV file named "sales_data.csv", read it directly
df = pd.read_csv("sales_data.csv")
# If no file is available, create a sample DataFrame for testing
data = {
"Name": ["Xiaoming", "Xiaohong", "Xiaogang", "Xiaoming", "Xiaofang", "Xiaoli", "Xiaoqiang"],
"Age": [25, None, 30, 25, 22, 28, 29], # None represents missing values
"BirthDate": ["1998-01-15", "1999-03-20", "1995-05-10", "1998-01-15", "2000-08-22", "2000-11-05", "1995-12-18"],
"Income": [8000, None, 12000, 8000, 5000, None, 15000] # Income has missing values
}
df = pd.DataFrame(data)
Tip: After importing, use df.head() to quickly view the first few rows and df.info() to check data types and missing values:
print("First 5 rows of data:")
print(df.head())
print("\nBasic information of data:")
print(df.info())
Running this will show you the overall data structure (including missing values, duplicate rows, incorrect data types, etc.), which serves as the “target” for subsequent cleaning.
3. Cleaning Step 1: Handle Missing Values (What if Data is “Empty”?)¶
Missing values are the most common issue in data cleaning, represented by NaN or None in pandas. There are two main handling methods: deletion or imputation.
1. Identify Missing Values¶
Use df.isnull() to check where empty values are and df.isnull().sum() to count the number of missing values per column:
print("Number of missing values per column:")
print(df.isnull().sum())
The output will show: 1 missing value in the “Age” column and 1 missing value in the “Income” column.
2. Delete Missing Values (Simple but Use with Caution)¶
If missing values are rare and do not affect the overall data, use dropna() to delete rows or columns with missing values:
# Delete all rows containing missing values (default axis=0, i.e., rows)
df_drop = df.dropna()
# To delete columns (e.g., if a column is completely missing values), use axis=1
# df_drop = df.dropna(axis=1)
print("Number of rows after deleting missing values:", df_drop.shape[0]) # Originally 7 rows, now 6 (after handling duplicates later)
3. Impute Missing Values (More Common Approach)¶
Impute missing values based on business logic, such as using mean, median, or directly filling with “Unknown”:
# Fill numeric columns (Age and Income) with mean
age_mean = df["Age"].mean() # Calculate average age (159/6 = 26.5, with 6 valid data points)
df["Age"] = df["Age"].fillna(age_mean)
# Use median for Income (more robust against outliers)
income_median = df["Income"].median() # Median of valid incomes: 8000, 12000, 8000, 5000, 15000 → 8000
df["Income"] = df["Income"].fillna(income_median)
print("Filled Age and Income:")
print(df[["Age", "Income"]])
Tip: For columns with fixed rules (e.g., “Gender” with missing values), directly fill with “Unknown”: df["Gender"] = "Unknown".
4. Cleaning Step 2: Handle Duplicate Values (What if Data is “Repeated”?)¶
Duplicate rows can distort analysis results (e.g., Xiaoming’s income being counted twice). pandas uses duplicated() to identify duplicates and drop_duplicates() to remove them:
# Identify duplicate rows (considered duplicates if all columns are identical)
print("Are there duplicate rows?", df.duplicated().any()) # Output: True (Xiaoming's two rows are duplicates)
# Remove duplicate rows (keep the first occurrence by default, keep='first')
df_clean = df.drop_duplicates()
print("Number of rows after removing duplicates:", df_clean.shape[0]) # Originally 7 rows → 6 rows (Xiaoming's duplicate row removed)
5. Cleaning Step 3: Handle Outliers (What if Data is “Out of Bounds”?)¶
Outliers are values that clearly deviate from the normal range (e.g., age 200, income 1,000,000). First, identify them via descriptive statistics, then handle:
1. Find Outliers with Descriptive Statistics¶
Use df.describe() to check statistical information of numeric columns, e.g., max/min values in the “Income” column:
print("Statistical information of numeric columns:")
print(df_clean.describe())
If an outlier (e.g., extremely high income) is found, filter it using logical conditions:
# Assume income > 20000 is considered an outlier (business judgment)
df_clean = df_clean[df_clean["Income"] <= 20000] # Xiaoming's 8000 and others are valid
Advanced Method: For unclear ranges, use boxplots (df_clean.boxplot(column='Income')), but this requires matplotlib. Beginners can start with logical filtering.
6. Cleaning Step 4: Unify Data Types (What if Formats Are “Messy”?)¶
Incorrect data types can cause analysis errors (e.g., date columns as strings, unable to sort). Use astype() or pd.to_datetime() to convert types:
# 1. Convert "BirthDate" to datetime type (for easy extraction of year/month)
df_clean["BirthDate"] = pd.to_datetime(df_clean["BirthDate"])
# 2. Convert "Age" to integer (e.g., 26.5 → 26)
df_clean["Age"] = df_clean["Age"].astype(int)
print("Data types after conversion:")
print(df_clean.dtypes)
Now the “BirthDate” column is in date format, and “Age” is integer, making data types more standardized.
7. Summary: Beginner’s Data Cleaning Workflow¶
- Import Data: Use
pd.read_csv()or manually create a sample DataFrame; - Preliminary Check:
df.head()(view data),df.info()(check types),df.isnull().sum()(check missing values); - Handle Missing Values: Delete or impute with mean/median;
- Handle Duplicates: Use
duplicated()to identify,drop_duplicates()to remove; - Handle Outliers: Use descriptive stats or logical conditions to filter;
- Convert Data Types: Ensure numeric and date types are correct.
Most Important Tip: Practice hands-on! Copy the code into Jupyter Notebook and test with your own data—you’ll master it quickly. Data cleaning has no fixed formula; with more practice, you’ll develop a “sharp eye” for identifying issues!
Upcoming Preview: How does pandas efficiently clean large datasets? Follow me for advanced data cleaning in the next issue!