In the world of data analysis, we often need to “combine” data from different sources, such as merging a user information table with an order table for combined analysis, or concatenating monthly sales data into a large table. pandas provides two extremely useful tools, merge and concat, which can help us easily accomplish these operations. Today, we’ll master these two basic operations in the simplest way possible, making it easy for beginners to get started quickly!
I. Why Data Merging?¶
Imagine you have two tables: a “Student Basic Information Table” (containing name and age) and a “Student Grades Table” (containing name, subject, and score). If you want to analyze each student’s grades and age, you need to “merge” these two tables. pandas’ merge and concat are designed for this!
II. concat: Simple Concatenation, Ideal for “Adding Rows” or “Adding Columns”¶
concat is like “gluing two pieces of paper together vertically or horizontally”—it doesn’t require a key and directly concatenates the data in sequence.
1. Basic Syntax¶
pd.concat(objs, axis=0, ignore_index=False)
objs: List of DataFrames to concatenate (e.g.,[df1, df2])axis=0: Default “vertical concatenation” (row-wise);axis=1: “horizontal concatenation” (column-wise)ignore_index=True: Reset the index (avoid duplicate indices after concatenation—critical for beginners!)
2. Vertical Concatenation (Row-wise)¶
Scenario: Merge two tables with the same structure by rows (e.g., monthly sales data).
Example:
import pandas as pd
# Create two simple DataFrames
df1 = pd.DataFrame({
'Name': ['Zhang San', 'Li Si'],
'Age': [20, 21]
})
df2 = pd.DataFrame({
'Name': ['Wang Wu', 'Zhao Liu'],
'Age': [22, 23]
})
# Vertical concatenation (default axis=0)
result_row = pd.concat([df1, df2])
print("Result without index reset:")
print(result_row)
# Reset index (recommended! Avoid duplicate indices)
result_row_reset = pd.concat([df1, df2], ignore_index=True)
print("\nResult with index reset:")
print(result_row_reset)
Output Comparison:
- Without index reset: The original indices are retained (0,1,0,1).
- With index reset: Indices become 0,1,2,3 (clearer).
3. Horizontal Concatenation (Column-wise)¶
Scenario: Merge two tables with different structures but a common “row identifier” (e.g., student info + grades table).
Example:
# Add a grades table
score = pd.DataFrame({
'Subject': ['Math', 'Chinese'],
'Score': [90, 85]
})
# Horizontal concatenation (axis=1)
result_col = pd.concat([df1, score], axis=1)
print("\nResult of column-wise concatenation:")
print(result_col)
Output:
Name Age Subject Score
0 Zhang San 20 Math 90
1 Li Si 21 Chinese 85
(Note: Here, df1 and score have different row counts. Will missing values be filled automatically? No! Column-wise concatenation requires matching row counts; otherwise, an error occurs. If row counts differ, pandas raises an error. Thus, column-wise concatenation typically requires identical row counts.)
III. merge: Key-based Merging, Like SQL JOIN¶
merge is like “matching two people’s information by ID”—it requires a common “key” (e.g., name, ID) to associate tables, similar to SQL’s JOIN operation.
1. Basic Syntax¶
pd.merge(left, right, on=None, how='inner', left_on=None, right_on=None)
left/right: The two DataFrames to mergeon: Common column name (if keys have the same name in both tables)how: Merge method (inner/left/right/outer; defaultinner)left_on/right_on: Specify keys if names differ (e.g., left table usesid, right table usesstudent_id)
2. Core Merge Methods (Using “Student” and “Score” Tables as Examples)¶
Assume we have:
- Student Table (student): Contains student_id (student number), Name, Age
- Score Table (score): Contains student_id (student number), Subject, Score
# Create sample data
student = pd.DataFrame({
'student_id': [1, 2, 3],
'Name': ['Zhang San', 'Li Si', 'Wang Wu'],
'Age': [20, 21, 22]
})
score = pd.DataFrame({
'student_id': [1, 2, 4], # Note: Student 4 does not exist in the student table
'Subject': ['Math', 'Chinese', 'English'],
'Score': [90, 85, 95]
})
(1) Inner Join (how='inner', default)¶
Effect: Keep only rows with common keys (student_id) in both tables.
inner_merge = pd.merge(student, score, on='student_id')
print("Inner Join Result:")
print(inner_merge)
Output:
student_id Name Age Subject Score
0 1 Zhang San 20 Math 90
1 2 Li Si 21 Chinese 85
- Result includes only
student_id=1and2(Student 3 and Score 4 are excluded).
(2) Left Join (how='left')¶
Effect: Retain all rows from the left table (student), filling unmatched values from the right table (score) with NaN.
left_merge = pd.merge(student, score, on='student_id', how='left')
print("\nLeft Join Result:")
print(left_merge)
Output:
student_id Name Age Subject Score
0 1 Zhang San 20 Math 90
1 2 Li Si 21 Chinese 85
2 3 Wang Wu 22 NaN NaN
- Student 3 (from the student table) has no scores, so
SubjectandScoreareNaN.
(3) Right Join (how='right')¶
Effect: Retain all rows from the right table (score), filling unmatched values from the left table (student) with NaN.
right_merge = pd.merge(student, score, on='student_id', how='right')
print("\nRight Join Result:")
print(right_merge)
Output:
student_id Name Age Subject Score
0 1 Zhang San 20.0 Math 90
1 2 Li Si 21.0 Chinese 85
2 4 NaN NaN English 95
- Score 4 (from the score table) has no student data, so
NameandAgeareNaN.
(4) Outer Join (how='outer')¶
Effect: Retain all rows from both tables, filling unmatched values with NaN.
outer_merge = pd.merge(student, score, on='student_id', how='outer')
print("\nOuter Join Result:")
print(outer_merge)
Output:
student_id Name Age Subject Score
0 1 Zhang San 20.0 Math 90
1 2 Li Si 21.0 Chinese 85
2 3 Wang Wu 22.0 NaN NaN
3 4 NaN NaN English 95
(5) When Keys Have Different Names¶
If key names differ (e.g., left table uses id, right table uses student_id), use left_on and right_on.
# Rename keys for demonstration
student_renamed = student.rename(columns={'student_id': 'id'})
score_renamed = score.rename(columns={'student_id': 'student_id'})
diff_keys_merge = pd.merge(
student_renamed,
score_renamed,
left_on='id', # Key column from left table
right_on='student_id' # Key column from right table
)
print("\nResult with different key names:")
print(diff_keys_merge)
Output:
id Name Age student_id Subject Score
0 1 Zhang San 20 1 Math 90
1 2 Li Si 21 2 Chinese 85
IV. Summary: When to Use merge vs concat?¶
| Method | Applicable Scenarios | Key Parameters |
|---|---|---|
| concat | Simple concatenation (no key) or “add rows/columns” | axis (0=row, 1=column), ignore_index |
| merge | Table merging with common keys (like SQL JOIN) | how (inner/left/right/outer), on/left_on/right_on |
V. Key Takeaways for Beginners¶
-
concat:
- No key required; direct concatenation.
- Row-wise (axis=0): Increases row count, columns unchanged.
- Column-wise (axis=1): Increases column count, row counts must match.
- Useignore_index=Trueto avoid duplicate indices. -
merge:
- Requires a common key for matching.
-howparameter controls merge type (inner/left/right/outer).
- Useleft_on/right_onif key names differ.
Practice with small examples and experiment with different merging methods—you’ll master them quickly!