MongoDB Aggregation Query Example: Statistical Analysis of User Data

In MongoDB, Aggregation is a powerful data processing tool that allows us to perform multi-stage transformations and analysis on documents in a collection, similar to a pipeline for data processing. For example, statistics on the number of users in different regions, calculation of average user age, analysis of total order amounts, etc., can all be easily achieved through the aggregation pipeline.

Prepare Sample Data

To facilitate understanding, we assume a user data collection users with each user document structured as follows (simplified version):

{
  "_id": ObjectId("..."),
  "name": "张三",
  "age": 25,
  "gender": "男",
  "region": "北京",
  "orders": [
    { "amount": 100, "date": "2023-01-01" },
    { "amount": 150, "date": "2023-02-15" }
  ]
}

Included fields: name (name), age (age), gender (gender), region (region), and order array (orders, where each order has an amount amount and date date).

Introduction to Basic Aggregation Stages

The MongoDB aggregation pipeline consists of multiple Stages, where each stage handles one operation on the data. The output of the previous stage serves as the input for the next stage. Common stages and operators:
- $match: Filters documents (similar to SQL’s WHERE clause)
- $group: Groups documents by specified fields and uses accumulators (e.g., $sum, $avg) for statistics
- $project: Retains only necessary fields (similar to SQL’s SELECT)
- $sort: Sorts documents by specified fields (similar to SQL’s ORDER BY)
- $unwind: Unwinds an array field (e.g., splits the orders array into multiple documents)
- Accumulator Operators: $sum (sum), $avg (average), $max (maximum), $min (minimum)

Example 1: Count Users by Gender

Requirement: Count how many male and female users there are.

Aggregation Pipeline:

db.users.aggregate([
  {
    $group: {
      _id: "$gender",  // Group by the gender field
      count: { $sum: 1 }  // Count each document in the group (+1 for each user)
    }
  },
  {
    $sort: { count: -1 }  // Sort by count in descending order
  }
])

Explanation:
- $group: Groups by gender, with _id specifying the grouping field. count is the result field, and $sum:1 increments the count for each document (i.e., each user is counted once).
- $sort: Sorts by count in descending order for a more intuitive result.

Sample Output:

[
  { "_id": "男", "count": 120 },
  { "_id": "女", "count": 80 }
]

Example 2: Calculate Average Age by Region

Requirement: Calculate the average age of users in each region (ignoring users with null age).

Aggregation Pipeline:

db.users.aggregate([
  {
    $match: { age: { $exists: true } }  // Filter out users without age data
  },
  {
    $group: {
      _id: "$region",  // Group by region
      avg_age: { $avg: "$age" }  // Calculate average age for the region
    }
  },
  {
    $project: {  // Retain only necessary fields, hide _id
      region: "$_id",
      avg_age: 1,
      _id: 0
    }
  },
  {
    $sort: { avg_age: -1 }  // Sort by average age in descending order
  }
])

Explanation:
- $match: Uses $exists: true to filter out users without age data, avoiding incorrect calculations in $avg.
- $group: $avg: "$age" calculates the average age for the group (automatically ignores null/non-existent fields).
- $project: Renames _id to region and hides the original _id for a clearer result.

Example 3: Total Spending per User

Requirement: Users may have multiple orders; calculate the total order amount for each user.

Aggregation Pipeline:

db.users.aggregate([
  {
    $unwind: "$orders"  // Unwind the orders array, splitting each order into a separate document
  },
  {
    $group: {
      _id: "$_id",  // Group by user ID (each user)
      total_amount: { $sum: "$orders.amount" }  // Sum the amount of each order
    }
  },
  {
    $sort: { total_amount: -1 },  // Sort by total amount in descending order
    $limit: 10  // Only take the top 10 users with the highest spending
  }
])

Explanation:
- $unwind: "$orders": Splits each element in the orders array into a separate document (e.g., a user with 2 orders becomes 2 documents).
- $group: Groups by user _id and accumulates the amount of each order.
- $limit: 10: Returns only the top 10 results (can be combined with $skip for pagination, e.g., $skip: 20 to skip the first 20 results).

Example 4: Count, Average Age, and Max Age by Region

Requirement: Simultaneously count the number of users, average age, and maximum age in each region.

Aggregation Pipeline:

db.users.aggregate([
  {
    $group: {
      _id: "$region",
      user_count: { $sum: 1 },
      avg_age: { $avg: "$age" },
      max_age: { $max: "$age" }
    }
  },
  {
    $sort: { user_count: -1 }
  }
])

Explanation:
- $group can use multiple accumulators simultaneously: $sum:1 counts users, $avg: "$age" calculates the average age, and $max: "$age" captures the maximum age.
- The result includes user_count, avg_age, and max_age for each region.

Summary

MongoDB aggregation queries achieve flexible data processing through pipeline stages. Key operations for beginners include:
1. Filtering: Use $match to narrow down data (similar to SQL’s WHERE).
2. Grouping & Statistics: Use $group with accumulators ($sum/$avg, etc.) for statistics.
3. Field Handling: Use $project to hide unnecessary fields for a cleaner result.
4. Sorting/Pagination: Use $sort and $limit/$skip to control result order and quantity.

Start with simple grouping statistics and gradually experiment with complex scenarios (e.g., nested aggregations, array processing). For issues, refer to the official MongoDB documentation on aggregation pipeline syntax (https://docs.mongodb.com/manual/reference/operator/aggregation/) and verify results through hands-on coding!

Xiaoye