Pandas Data Statistics: 5 Common Functions to Quickly Master Basic Analysis

Pandas is a powerful tool for handling tabular data in Python, and statistical analysis is the foundation of data analysis. Today, let’s explore the 5 most commonly used pandas statistical functions, which will help you quickly master basic analysis skills—even beginners can get started easily!

1. `sum()`: Quick Summation, Total Amount Statistics¶

Purpose: Calculate the sum of all values in a column (or row).
Scenario: For example, summing the total score of a subject in a class, or the total sales of all students in a class.

Example:
Suppose we have a student score sheet with math, Chinese, and English scores:

import pandas as pd

# Create data
data = {
    "Student ID": [1, 2, 3, 4, 5],
    "Math": [85, 92, 78, 90, 88],
    "Chinese": [76, 88, 95, 80, 79],
    "English": [90, 85, 82, 93, 87]
}
df = pd.DataFrame(data)

# Calculate total math score (single column sum)
math_total = df["Math"].sum()
print("Total Math Score:", math_total)  # Output: Total Math Score: 433 (85+92+78+90+88=433)

# Calculate total scores for all subjects (sum over all numeric columns)
all_subjects_total = df.sum()
print(all_subjects_total)  # Outputs sums for each column; pandas automatically ignores non-numeric columns like Student ID

Notes:
- sum() automatically skips missing values (NaN) by default. If there are empty values in the data, the result remains unaffected.
- To sum rows (e.g., total score per student), use axis=1: df.sum(axis=1).

2. `mean()`: Calculate Average, Reflect Central Tendency¶

Purpose: Compute the average value of a column (or row) (total sum divided by the number of values).
Scenario: For example, calculating the average score of a subject or the class average.

Example:
Using the score sheet above to calculate the average math score:

# Math average score
math_avg = df["Math"].mean()
print("Math Average Score:", math_avg)  # Output: Math Average Score: 86.6 (433/5=86.6)

# Average scores for all subjects
all_subjects_avg = df.mean()
print(all_subjects_avg)  # Output: Math 86.6, Chinese 83.6, English 87.4

Notes:
- Like sum(), it automatically ignores missing values. Results are more accurate if there are no missing values.
- Averages are sensitive to extreme values (e.g., outliers like perfect scores or 0s can skew results). Use the median for comparison when extreme values exist.

3. `median()`: Calculate Median, Robust to Extreme Values¶

Purpose: Find the middle value after sorting the data, reflecting the “middle level” of the dataset.
Scenario: When data contains extreme values (e.g., perfect scores or 0s), the median is more reliable than the average.

Example:
Suppose we add an extreme low score in the math column (e.g., a student with a math score of 50):

# Modify data to include an extreme value
data["Math"] = [85, 92, 78, 90, 88, 50]  # Add Student 6 with a low score
df = pd.DataFrame(data)

# Average (affected by extreme value)
math_avg = df["Math"].mean()  # (433+50)/6 = 483/6 = 80.5
print("Math Average (with extreme value):", math_avg)  # Output: 80.5

# Median (unaffected by extreme values)
math_median = df["Math"].median()  # Sorted: 50,78,85,88,90,92 → (85+88)/2 = 86.5
print("Math Median (with extreme value):", math_median)  # Output: 86.5

Conclusion: The median better reflects the “true middle level” of most data points, especially when the data is unevenly distributed.

4. `max()` and `min()`: Find Maximum/Minimum Values¶

Purpose: Return the maximum or minimum value in a column (or row), respectively.
Scenario: For example, identifying the highest/lowest scores or the range of an indicator.

Example:

# Highest math score
math_max = df["Math"].max()
print("Highest Math Score:", math_max)  # Output: 92

# Lowest math score
math_min = df["Math"].min()
print("Lowest Math Score:", math_min)  # Output: 50

# Max/min across all numeric columns
df_max = df.max()  # Max per column
df_min = df.min()  # Min per column
print("Highest scores per subject:", df_max[["Math", "Chinese", "English"]])  # Output: Math 92, Chinese 95, English 93

5. `describe()`: Comprehensive Descriptive Statistics¶

Purpose: Conduct a “full-body check” of your data, outputting count, mean, std, min/max, and percentiles (25%, 50%, 75%).
Scenario: Quickly understanding the overall distribution, range, and dispersion of data.

Example:

# Statistical summary for all numeric columns
df_describe = df[["Math", "Chinese", "English"]].describe()
print(df_describe)

Sample Output (simplified):

          Math      Chinese      English
count  6.000000    6.000000    6.000000
mean  80.500000   82.166667   84.166667
std   14.720756    9.622553    8.516326
min   50.000000   76.000000   79.000000
25%   78.000000   79.250000   80.750000
50%   86.500000   83.000000   85.000000
75%   88.750000   88.000000   90.000000
max   92.000000   95.000000   93.000000

Key Indicator Explanations:
- std (Standard Deviation): Larger values indicate more data volatility (e.g., English std=8.5 vs. Math std=14.7, meaning math scores vary more).
- 25%/75% (Quartiles): The range where the middle 50% of data lies.

Summary: 5 Functions for Basic to Advanced Analysis¶

These 5 functions (sum/mean/median/max/min/describe) are the “foundational skills” of data analysis. They let you answer critical questions:
- What is the total volume of data? (sum)
- What is the average level? (mean)
- What is the middle level and data range? (median/std)
- What are the extreme values? (max/min)

For advanced analysis, explore grouping statistics (groupby) and standard deviation (std), and gradually enhance your data analysis capabilities!

Practice: Try applying these methods to your own score data or generate random data with pandas to practice using these functions!

1. sum(): Quick Summation, Total Amount Statistics¶

2. mean(): Calculate Average, Reflect Central Tendency¶

3. median(): Calculate Median, Robust to Extreme Values¶

4. max() and min(): Find Maximum/Minimum Values¶

5. describe(): Comprehensive Descriptive Statistics¶