In the world of data analysis, we often need to “combine” data from different sources, such as merging a user information table with an order table for combined analysis, or concatenating monthly sales data into a large table. pandas provides two extremely useful tools, merge and concat, which can help us easily accomplish these operations. Today, we’ll master these two basic operations in the simplest way possible, making it easy for beginners to get started quickly!

I. Why Data Merging?

Imagine you have two tables: a “Student Basic Information Table” (containing name and age) and a “Student Grades Table” (containing name, subject, and score). If you want to analyze each student’s grades and age, you need to “merge” these two tables. pandas’ merge and concat are designed for this!

II. concat: Simple Concatenation, Ideal for “Adding Rows” or “Adding Columns”

concat is like “gluing two pieces of paper together vertically or horizontally”—it doesn’t require a key and directly concatenates the data in sequence.

1. Basic Syntax

pd.concat(objs, axis=0, ignore_index=False)
  • objs: List of DataFrames to concatenate (e.g., [df1, df2])
  • axis=0: Default “vertical concatenation” (row-wise); axis=1: “horizontal concatenation” (column-wise)
  • ignore_index=True: Reset the index (avoid duplicate indices after concatenation—critical for beginners!)

2. Vertical Concatenation (Row-wise)

Scenario: Merge two tables with the same structure by rows (e.g., monthly sales data).

Example:

import pandas as pd

# Create two simple DataFrames
df1 = pd.DataFrame({
    'Name': ['Zhang San', 'Li Si'],
    'Age': [20, 21]
})

df2 = pd.DataFrame({
    'Name': ['Wang Wu', 'Zhao Liu'],
    'Age': [22, 23]
})

# Vertical concatenation (default axis=0)
result_row = pd.concat([df1, df2])
print("Result without index reset:")
print(result_row)

# Reset index (recommended! Avoid duplicate indices)
result_row_reset = pd.concat([df1, df2], ignore_index=True)
print("\nResult with index reset:")
print(result_row_reset)

Output Comparison:
- Without index reset: The original indices are retained (0,1,0,1).
- With index reset: Indices become 0,1,2,3 (clearer).

3. Horizontal Concatenation (Column-wise)

Scenario: Merge two tables with different structures but a common “row identifier” (e.g., student info + grades table).

Example:

# Add a grades table
score = pd.DataFrame({
    'Subject': ['Math', 'Chinese'],
    'Score': [90, 85]
})

# Horizontal concatenation (axis=1)
result_col = pd.concat([df1, score], axis=1)
print("\nResult of column-wise concatenation:")
print(result_col)

Output:

   Name  Age  Subject  Score
0  Zhang San  20    Math     90
1  Li Si  21  Chinese    85

(Note: Here, df1 and score have different row counts. Will missing values be filled automatically? No! Column-wise concatenation requires matching row counts; otherwise, an error occurs. If row counts differ, pandas raises an error. Thus, column-wise concatenation typically requires identical row counts.)

III. merge: Key-based Merging, Like SQL JOIN

merge is like “matching two people’s information by ID”—it requires a common “key” (e.g., name, ID) to associate tables, similar to SQL’s JOIN operation.

1. Basic Syntax

pd.merge(left, right, on=None, how='inner', left_on=None, right_on=None)
  • left/right: The two DataFrames to merge
  • on: Common column name (if keys have the same name in both tables)
  • how: Merge method (inner/left/right/outer; default inner)
  • left_on/right_on: Specify keys if names differ (e.g., left table uses id, right table uses student_id)

2. Core Merge Methods (Using “Student” and “Score” Tables as Examples)

Assume we have:
- Student Table (student): Contains student_id (student number), Name, Age
- Score Table (score): Contains student_id (student number), Subject, Score

# Create sample data
student = pd.DataFrame({
    'student_id': [1, 2, 3],
    'Name': ['Zhang San', 'Li Si', 'Wang Wu'],
    'Age': [20, 21, 22]
})

score = pd.DataFrame({
    'student_id': [1, 2, 4],  # Note: Student 4 does not exist in the student table
    'Subject': ['Math', 'Chinese', 'English'],
    'Score': [90, 85, 95]
})
(1) Inner Join (how='inner', default)

Effect: Keep only rows with common keys (student_id) in both tables.

inner_merge = pd.merge(student, score, on='student_id')
print("Inner Join Result:")
print(inner_merge)

Output:

   student_id       Name  Age  Subject  Score
0           1  Zhang San   20     Math     90
1           2      Li Si   21  Chinese     85
  • Result includes only student_id=1 and 2 (Student 3 and Score 4 are excluded).
(2) Left Join (how='left')

Effect: Retain all rows from the left table (student), filling unmatched values from the right table (score) with NaN.

left_merge = pd.merge(student, score, on='student_id', how='left')
print("\nLeft Join Result:")
print(left_merge)

Output:

   student_id       Name  Age  Subject  Score
0           1  Zhang San   20     Math     90
1           2      Li Si   21  Chinese     85
2           3   Wang Wu   22      NaN    NaN
  • Student 3 (from the student table) has no scores, so Subject and Score are NaN.
(3) Right Join (how='right')

Effect: Retain all rows from the right table (score), filling unmatched values from the left table (student) with NaN.

right_merge = pd.merge(student, score, on='student_id', how='right')
print("\nRight Join Result:")
print(right_merge)

Output:

   student_id       Name   Age  Subject  Score
0           1  Zhang San  20.0     Math     90
1           2      Li Si  21.0  Chinese     85
2           4        NaN   NaN  English     95
  • Score 4 (from the score table) has no student data, so Name and Age are NaN.
(4) Outer Join (how='outer')

Effect: Retain all rows from both tables, filling unmatched values with NaN.

outer_merge = pd.merge(student, score, on='student_id', how='outer')
print("\nOuter Join Result:")
print(outer_merge)

Output:

   student_id       Name   Age  Subject  Score
0           1  Zhang San  20.0     Math     90
1           2      Li Si  21.0  Chinese     85
2           3   Wang Wu  22.0      NaN    NaN
3           4        NaN   NaN  English     95
(5) When Keys Have Different Names

If key names differ (e.g., left table uses id, right table uses student_id), use left_on and right_on.

# Rename keys for demonstration
student_renamed = student.rename(columns={'student_id': 'id'})
score_renamed = score.rename(columns={'student_id': 'student_id'})

diff_keys_merge = pd.merge(
    student_renamed, 
    score_renamed, 
    left_on='id',   # Key column from left table
    right_on='student_id'  # Key column from right table
)
print("\nResult with different key names:")
print(diff_keys_merge)

Output:

   id       Name  Age  student_id  Subject  Score
0   1  Zhang San   20           1     Math     90
1   2      Li Si   21           2  Chinese     85

IV. Summary: When to Use merge vs concat?

Method Applicable Scenarios Key Parameters
concat Simple concatenation (no key) or “add rows/columns” axis (0=row, 1=column), ignore_index
merge Table merging with common keys (like SQL JOIN) how (inner/left/right/outer), on/left_on/right_on

V. Key Takeaways for Beginners

  1. concat:
    - No key required; direct concatenation.
    - Row-wise (axis=0): Increases row count, columns unchanged.
    - Column-wise (axis=1): Increases column count, row counts must match.
    - Use ignore_index=True to avoid duplicate indices.

  2. merge:
    - Requires a common key for matching.
    - how parameter controls merge type (inner/left/right/outer).
    - Use left_on/right_on if key names differ.

Practice with small examples and experiment with different merging methods—you’ll master them quickly!

Xiaoye