Why Learn pandas?¶
In the field of data processing, Python’s pandas library acts as a “data butler” that helps us easily read, clean, and analyze data. CSV (Comma-Separated Values) is one of the most common data formats, and many data files (such as Excel-exported tables and statistical reports) can be saved in CSV format. Learning to use pandas to read CSV files is the first step in data analysis!
Step 1: Install pandas¶
If pandas is not installed on your computer, you need to install it via the command line first. Open the terminal (Command Prompt for Windows, Terminal directly for Mac/Linux) and enter:
pip install pandas
Tip: If using Anaconda or Jupyter Notebook, pandas is usually pre-installed. You can skip this step.
Step 2: Import the pandas Library¶
To use pandas, you need to import it in your code. We typically give pandas a short alias pd for easy subsequent calls:
import pandas as pd
This line of code is like opening the “data butler’s” toolbox, and you can then use pd to call various tools.
Step 3: Read a CSV File¶
Suppose we have a simple CSV file named student.csv with the following content (you can create a similar file or follow along with the example):
姓名,学号,语文,数学,英语
张三,001,85,92,78
李四,002,90,88,95
王五,003,76,80,82
This file records the score information of 3 students with columns: Name, Student ID, Chinese, Math, and English.
Basic Reading: The Most Common read_csv Function¶
Reading a CSV file with pandas only requires one line of code:
# Read the CSV file (assuming the file is in the current directory)
df = pd.read_csv('student.csv')
Here, df is an abbreviation for “DataFrame”, which pandas uses to represent tabular data (like a worksheet in Excel). After executing this line, df will contain all the data from the CSV file.
Step 4: View and Understand the Data¶
After reading the file, don’t rush to analyze it! First, check what the data looks like and if there are any issues. pandas provides several simple and practical tools:
1. View the First Few Rows: head() and tail()¶
df.head(): Displays the first 5 rows (default), quickly previewing the data format.df.tail(3): Displays the last 3 rows (you can specify the number of rows in parentheses).
# Display the first 5 rows
print(df.head())
# Display the last 3 rows
print(df.tail(3))
The output should match the CSV file content, showing column names and corresponding data.
2. Check Basic Data Information: info()¶
info() helps you check data types (e.g., numeric, string) and whether there are missing values:
df.info()
Sample output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 姓名 3 non-null object
1 学号 3 non-null object
2 语文 3 non-null int64
3 数学 3 non-null int64
4 英语 3 non-null int64
dtypes: int64(3), object(2)
memory usage: 248.0+ bytes
- RangeIndex: Indicates the number of rows (0-2, total 3 rows).
- Non-Null Count: Number of non-missing values per column (all 3 here, indicating no missing values).
- Dtype: Data types (Name and Student ID are strings
object, scores are integersint64).
3. Statistic Numeric Data: describe()¶
If there are numeric columns (e.g., Chinese, Math), describe() quickly calculates statistics like min, max, and mean:
df.describe()
Sample output (only for numeric columns):
语文 数学 英语
count 3.000000 3.000000 3.000000
mean 83.666667 86.000000 85.000000
std 7.559289 5.163978 7.071068
min 76.000000 80.000000 78.000000
25% 80.500000 84.000000 80.500000
50% 85.000000 88.000000 82.000000
75% 87.500000 90.000000 88.500000
max 90.000000 92.000000 95.000000
Step 5: Handle Common Issues (Advanced)¶
If your CSV file has special formats (e.g., Chinese garbled text, non-comma separators, no headers), pandas can handle them flexibly:
1. Chinese Garbled Text: Specify Encoding¶
If you encounter Chinese garbled text during reading, it may be due to a mismatch between the file encoding and Python’s default encoding. Try specifying the encoding with the encoding parameter:
# Common encodings: utf-8 (default), gbk (commonly used in Chinese Windows)
df = pd.read_csv('student.csv', encoding='utf-8')
2. Non-Comma Separators: Use sep Parameter¶
Although CSV typically uses commas, some files may use tabs (\t) or other symbols. For example, to read a tab-separated file:
# Read a tab-separated file
df = pd.read_csv('data.tsv', sep='\t')
3. No Headers: Custom Column Names¶
If the CSV file has no header row (column names), read_csv will automatically number columns (0, 1, 2…). Use the names parameter to customize column names:
# Assume the file has no header; customize column names
df = pd.read_csv('student.csv', header=None, names=['姓名', '学号', '语文', '数学', '英语'])
Summary¶
Through this tutorial, you’ve learned:
- Install and Import pandas:
pip install pandas+import pandas as pd - Read CSV Files:
pd.read_csv('filename.csv') - View Data:
head()/tail()to preview,info()to check types and missing values,describe()to statistics numeric data - Handle Special Formats: Solutions for encoding, separators, and missing headers
Reading CSV files with pandas is just the beginning of data processing. You can further learn data filtering, cleaning, merging, and other operations. Now, try practicing with your own CSV files to get familiar with these steps!