1. What is MongoDB Aggregation Pipeline?¶
Imagine you have a pile of scattered data, such as students’ exam scores or shopping order information, and you need to analyze it—for example, calculating average scores per subject, identifying the student with the highest score, or grouping statistics by category. MongoDB’s Aggregation Pipeline is like a “data assembly line.” You can pass data through multiple “processing stations” (called “stages”), where each station performs different operations on the data, ultimately yielding the analysis results you need.
For example:
If you have a collection named “StudentScores” (similar to an Excel table) recording each student’s name, subject, and score, to calculate the “average math score,” you don’t need to export all data to Excel and manually compute it. Instead, use the aggregation pipeline to do it in one step: first filter students with “subject = Math” (filtering stage), then calculate the average score of these students (grouping and statistics stage).
2. Core Concepts of Aggregation Pipeline¶
The aggregation pipeline consists of multiple “stages”, each acting as a “processing station.” Data flows from the first stage to the next, with the order of stages being critical because each stage processes the output of the previous one.
Core Stages and Operators¶
| Stage Name (Operator) | Simple Explanation | SQL Equivalent |
|---|---|---|
$match |
Filter data (similar to find() query) |
WHERE |
$project |
Keep only necessary fields (or rename/calculate new fields) | SELECT |
$group |
Group data by conditions and perform statistics (e.g., average, total) | GROUP BY |
$sort |
Sort the results | ORDER BY |
$limit |
Limit the amount of returned data | LIMIT |
3. Practical Cases: From Simple to Complex¶
Assume we have a collection named students with the following simplified data structure:
{
"_id": 1,
"name": "Zhang San",
"class": 1,
"subject": "Math",
"score": 85
},
{
"_id": 2,
"name": "Li Si",
"class": 1,
"subject": "Math",
"score": 92
},
{
"_id": 3,
"name": "Wang Wu",
"class": 2,
"subject": "Math",
"score": 78
}
Case 1: Filtering + Projection (Only View Math Scores)¶
Requirement: Find math scores of students in class 1, showing only name and score.
Code Implementation:
db.students.aggregate([
// Stage 1: Filter students in class 1 with subject "Math"
{ $match: { class: 1, subject: "Math" } },
// Stage 2: Keep only name and score fields
{ $project: { _id: 0, name: 1, score: 1 } }
])
Output:
{ "name": "Zhang San", "score": 85 },
{ "name": "Li Si", "score": 92 }
Explanation:
- $match is equivalent to SQL WHERE class=1 AND subject="Math", retaining only matching documents.
- _id: 0 in $project hides the default _id field, while name: 1 and score: 1 include these fields.
Case 2: Grouping Statistics (Average Score by Subject)¶
Requirement: Group by “subject” and calculate the average score for each subject.
Code Implementation:
db.students.aggregate([
// Stage 1: Group by subject and calculate average score
{ $group: {
_id: "$subject", // Group key: "subject" field
avgScore: { $avg: "$score" } // Calculate average score
}
},
// Stage 2: Sort by average score in descending order
{ $sort: { avgScore: -1 } }
])
Output:
{ "_id": "Math", "avgScore": 85.0 },
{ "_id": "Chinese", "avgScore": 76.5 } // Assume Chinese data exists
Explanation:
- _id: "$subject" in $group groups documents by the “subject” field.
- avgScore is a custom statistic, with $avg: "$score" computing the average of the “score” field in each group.
- $sort orders results by avgScore in descending order (-1 for descending, 1 for ascending).
Case 3: Multi-Stage Combination (Class + Subject Average Score + Total Students)¶
Requirement: Calculate average score and total number of students for each class and subject.
Code Implementation:
db.students.aggregate([
// Stage 1: Group by class + subject
{ $group: {
_id: { class: "$class", subject: "$subject" }, // Composite group key
totalStudents: { $sum: 1 }, // Total students: count each document
avgScore: { $avg: "$score" } // Average score
}
},
// Stage 2: Sort by class and subject
{ $sort: { "_id.class": 1, "_id.subject": 1 } }
])
Output:
{ "_id": { "class": 1, "subject": "Math" }, "totalStudents": 2, "avgScore": 88.5 },
{ "_id": { "class": 2, "subject": "Math" }, "totalStudents": 1, "avgScore": 78.0 }
Explanation:
- _id: { class: "$class", subject: "$subject" } groups by both “class” and “subject” fields.
- $sum: 1 counts the number of documents in each group (total students).
4. Quick Reference for Common Operators¶
| Operator | Purpose | Example |
|---|---|---|
$match |
Filter documents | { $match: { score: { $gt: 60 } } } (score > 60) |
$project |
Control output fields | { $project: { name: 1, newScore: { $add: ["$score", 5] } } } |
$group |
Group and aggregate | { $group: { _id: "$subject", total: { $sum: "$score" } } } |
$sort |
Sort results | { $sort: { score: -1 } } (descending order) |
$limit |
Limit result count | { $limit: 10 } (return top 10) |
$skip |
Skip documents | { $skip: 5 } (skip first 5) |
5. Summary¶
MongoDB’s aggregation pipeline is like a “data factory” where you can complete complex analysis through stage combinations:
1. Filter: Use $match to exclude irrelevant data.
2. Project: Use $project to retain only key information.
3. Group: Use $group to perform statistics (average, total, etc.) by conditions.
4. Sort/Limit: Use $sort and $limit to optimize results.
Start with simple examples, then experiment with multi-stage combinations. You’ll find the aggregation pipeline more efficient and flexible than manual data processing!
Pro Tip: Practice combining scenarios (e.g., “filter then group” or “sort then limit”) to familiarize yourself with each stage’s role. Mastery comes with regular practice!