Pandas is a powerful tool for handling tabular data in Python, and statistical analysis is the foundation of data analysis. Today, let’s explore the 5 most commonly used pandas statistical functions, which will help you quickly master basic analysis skills—even beginners can get started easily!
1. sum(): Quick Summation, Total Amount Statistics¶
Purpose: Calculate the sum of all values in a column (or row).
Scenario: For example, summing the total score of a subject in a class, or the total sales of all students in a class.
Example:
Suppose we have a student score sheet with math, Chinese, and English scores:
import pandas as pd
# Create data
data = {
"Student ID": [1, 2, 3, 4, 5],
"Math": [85, 92, 78, 90, 88],
"Chinese": [76, 88, 95, 80, 79],
"English": [90, 85, 82, 93, 87]
}
df = pd.DataFrame(data)
# Calculate total math score (single column sum)
math_total = df["Math"].sum()
print("Total Math Score:", math_total) # Output: Total Math Score: 433 (85+92+78+90+88=433)
# Calculate total scores for all subjects (sum over all numeric columns)
all_subjects_total = df.sum()
print(all_subjects_total) # Outputs sums for each column; pandas automatically ignores non-numeric columns like Student ID
Notes:
- sum() automatically skips missing values (NaN) by default. If there are empty values in the data, the result remains unaffected.
- To sum rows (e.g., total score per student), use axis=1: df.sum(axis=1).
2. mean(): Calculate Average, Reflect Central Tendency¶
Purpose: Compute the average value of a column (or row) (total sum divided by the number of values).
Scenario: For example, calculating the average score of a subject or the class average.
Example:
Using the score sheet above to calculate the average math score:
# Math average score
math_avg = df["Math"].mean()
print("Math Average Score:", math_avg) # Output: Math Average Score: 86.6 (433/5=86.6)
# Average scores for all subjects
all_subjects_avg = df.mean()
print(all_subjects_avg) # Output: Math 86.6, Chinese 83.6, English 87.4
Notes:
- Like sum(), it automatically ignores missing values. Results are more accurate if there are no missing values.
- Averages are sensitive to extreme values (e.g., outliers like perfect scores or 0s can skew results). Use the median for comparison when extreme values exist.
3. median(): Calculate Median, Robust to Extreme Values¶
Purpose: Find the middle value after sorting the data, reflecting the “middle level” of the dataset.
Scenario: When data contains extreme values (e.g., perfect scores or 0s), the median is more reliable than the average.
Example:
Suppose we add an extreme low score in the math column (e.g., a student with a math score of 50):
# Modify data to include an extreme value
data["Math"] = [85, 92, 78, 90, 88, 50] # Add Student 6 with a low score
df = pd.DataFrame(data)
# Average (affected by extreme value)
math_avg = df["Math"].mean() # (433+50)/6 = 483/6 = 80.5
print("Math Average (with extreme value):", math_avg) # Output: 80.5
# Median (unaffected by extreme values)
math_median = df["Math"].median() # Sorted: 50,78,85,88,90,92 → (85+88)/2 = 86.5
print("Math Median (with extreme value):", math_median) # Output: 86.5
Conclusion: The median better reflects the “true middle level” of most data points, especially when the data is unevenly distributed.
4. max() and min(): Find Maximum/Minimum Values¶
Purpose: Return the maximum or minimum value in a column (or row), respectively.
Scenario: For example, identifying the highest/lowest scores or the range of an indicator.
Example:
# Highest math score
math_max = df["Math"].max()
print("Highest Math Score:", math_max) # Output: 92
# Lowest math score
math_min = df["Math"].min()
print("Lowest Math Score:", math_min) # Output: 50
# Max/min across all numeric columns
df_max = df.max() # Max per column
df_min = df.min() # Min per column
print("Highest scores per subject:", df_max[["Math", "Chinese", "English"]]) # Output: Math 92, Chinese 95, English 93
5. describe(): Comprehensive Descriptive Statistics¶
Purpose: Conduct a “full-body check” of your data, outputting count, mean, std, min/max, and percentiles (25%, 50%, 75%).
Scenario: Quickly understanding the overall distribution, range, and dispersion of data.
Example:
# Statistical summary for all numeric columns
df_describe = df[["Math", "Chinese", "English"]].describe()
print(df_describe)
Sample Output (simplified):
Math Chinese English
count 6.000000 6.000000 6.000000
mean 80.500000 82.166667 84.166667
std 14.720756 9.622553 8.516326
min 50.000000 76.000000 79.000000
25% 78.000000 79.250000 80.750000
50% 86.500000 83.000000 85.000000
75% 88.750000 88.000000 90.000000
max 92.000000 95.000000 93.000000
Key Indicator Explanations:
- std (Standard Deviation): Larger values indicate more data volatility (e.g., English std=8.5 vs. Math std=14.7, meaning math scores vary more).
- 25%/75% (Quartiles): The range where the middle 50% of data lies.
Summary: 5 Functions for Basic to Advanced Analysis¶
These 5 functions (sum/mean/median/max/min/describe) are the “foundational skills” of data analysis. They let you answer critical questions:
- What is the total volume of data? (sum)
- What is the average level? (mean)
- What is the middle level and data range? (median/std)
- What are the extreme values? (max/min)
For advanced analysis, explore grouping statistics (groupby) and standard deviation (std), and gradually enhance your data analysis capabilities!
Practice: Try applying these methods to your own score data or generate random data with pandas to practice using these functions!