Why Learn pandas?

In the field of data processing, Python’s pandas library acts as a “data butler” that helps us easily read, clean, and analyze data. CSV (Comma-Separated Values) is one of the most common data formats, and many data files (such as Excel-exported tables and statistical reports) can be saved in CSV format. Learning to use pandas to read CSV files is the first step in data analysis!

Step 1: Install pandas

If pandas is not installed on your computer, you need to install it via the command line first. Open the terminal (Command Prompt for Windows, Terminal directly for Mac/Linux) and enter:

pip install pandas

Tip: If using Anaconda or Jupyter Notebook, pandas is usually pre-installed. You can skip this step.

Step 2: Import the pandas Library

To use pandas, you need to import it in your code. We typically give pandas a short alias pd for easy subsequent calls:

import pandas as pd

This line of code is like opening the “data butler’s” toolbox, and you can then use pd to call various tools.

Step 3: Read a CSV File

Suppose we have a simple CSV file named student.csv with the following content (you can create a similar file or follow along with the example):

姓名,学号,语文,数学,英语
张三,001,85,92,78
李四,002,90,88,95
王五,003,76,80,82

This file records the score information of 3 students with columns: Name, Student ID, Chinese, Math, and English.

Basic Reading: The Most Common read_csv Function

Reading a CSV file with pandas only requires one line of code:

# Read the CSV file (assuming the file is in the current directory)
df = pd.read_csv('student.csv')

Here, df is an abbreviation for “DataFrame”, which pandas uses to represent tabular data (like a worksheet in Excel). After executing this line, df will contain all the data from the CSV file.

Step 4: View and Understand the Data

After reading the file, don’t rush to analyze it! First, check what the data looks like and if there are any issues. pandas provides several simple and practical tools:

1. View the First Few Rows: head() and tail()

  • df.head(): Displays the first 5 rows (default), quickly previewing the data format.
  • df.tail(3): Displays the last 3 rows (you can specify the number of rows in parentheses).
# Display the first 5 rows
print(df.head())

# Display the last 3 rows
print(df.tail(3))

The output should match the CSV file content, showing column names and corresponding data.

2. Check Basic Data Information: info()

info() helps you check data types (e.g., numeric, string) and whether there are missing values:

df.info()

Sample output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   姓名     3 non-null      object
 1   学号     3 non-null      object
 2   语文     3 non-null      int64 
 3   数学     3 non-null      int64 
 4   英语     3 non-null      int64 
dtypes: int64(3), object(2)
memory usage: 248.0+ bytes
  • RangeIndex: Indicates the number of rows (0-2, total 3 rows).
  • Non-Null Count: Number of non-missing values per column (all 3 here, indicating no missing values).
  • Dtype: Data types (Name and Student ID are strings object, scores are integers int64).

3. Statistic Numeric Data: describe()

If there are numeric columns (e.g., Chinese, Math), describe() quickly calculates statistics like min, max, and mean:

df.describe()

Sample output (only for numeric columns):

            语文        数学        英语
count  3.000000  3.000000  3.000000
mean  83.666667  86.000000  85.000000
std    7.559289   5.163978   7.071068
min   76.000000  80.000000  78.000000
25%   80.500000  84.000000  80.500000
50%   85.000000  88.000000  82.000000
75%   87.500000  90.000000  88.500000
max   90.000000  92.000000  95.000000

Step 5: Handle Common Issues (Advanced)

If your CSV file has special formats (e.g., Chinese garbled text, non-comma separators, no headers), pandas can handle them flexibly:

1. Chinese Garbled Text: Specify Encoding

If you encounter Chinese garbled text during reading, it may be due to a mismatch between the file encoding and Python’s default encoding. Try specifying the encoding with the encoding parameter:

# Common encodings: utf-8 (default), gbk (commonly used in Chinese Windows)
df = pd.read_csv('student.csv', encoding='utf-8')

2. Non-Comma Separators: Use sep Parameter

Although CSV typically uses commas, some files may use tabs (\t) or other symbols. For example, to read a tab-separated file:

# Read a tab-separated file
df = pd.read_csv('data.tsv', sep='\t')

3. No Headers: Custom Column Names

If the CSV file has no header row (column names), read_csv will automatically number columns (0, 1, 2…). Use the names parameter to customize column names:

# Assume the file has no header; customize column names
df = pd.read_csv('student.csv', header=None, names=['姓名', '学号', '语文', '数学', '英语'])

Summary

Through this tutorial, you’ve learned:

  1. Install and Import pandas: pip install pandas + import pandas as pd
  2. Read CSV Files: pd.read_csv('filename.csv')
  3. View Data: head()/tail() to preview, info() to check types and missing values, describe() to statistics numeric data
  4. Handle Special Formats: Solutions for encoding, separators, and missing headers

Reading CSV files with pandas is just the beginning of data processing. You can further learn data filtering, cleaning, merging, and other operations. Now, try practicing with your own CSV files to get familiar with these steps!

Xiaoye