ISLAMIC EMIRATE OF AFGHANISTAN
MINISTRY OF HIGHER EDUCATION
GHALIB UNIVERSITY
COMPUTER SCEINE FACULTY
Distributed Database(Semester 5)
(Aggregation)
LECTURER: Zabihullah Rahmani
1
Table of Contents
Aggregation
 Aggregation on the e-commerce data model
 Aggregation framework details
 Performance and limitations
 Other aggregation capabilities
6/14/2025 2
Aggregation
In this chapter, we’ll extend that topic to include more complex queries using the MongoDB aggregation
framework.
 The aggregation framework is MongoDB’s advanced query language, and it allows you to transform and
combine data from multiple documents to generate new information not available in any single
document.
For example, you might use the aggregation framework to determine sales by month, sales by product,
or order totals by user.
For those familiar with relational databases, you can think of the aggregation framework as MongoDB’s
equivalent to the SQL GROUP BY clause.
6/14/2025 3
Cont.
In this chapter, we’ll show you a number of examples using the e-commerce data model that’s used in
the rest of the book and then provide a detailed look at all the aggregation framework operators and
various options for each operator.
Up to now, you’ve designed your data model and database queries to support fast and responsive
website performance.
The aggregation framework can also help with real-time information summarization that may be
needed for an e-commerce website, but it can do much more: providing answers to a wide variety of
questions you might want to answer from your data but that may require crunching large amounts of
data.
6/14/2025 4
Aggregation framework overview
A call to the aggregation framework defines a pipeline (figure 6.1), the aggregation pipeline, where
the output from each step in the pipeline provides input to the next step.
Each step executes a single operation on the input documents to transform the input and generate
output documents.
6/14/2025 5
Aggregation framework overview
Aggregation pipeline operations include the following:
$project—Specify fields to be placed in the output document (projected).
$match—Select documents to be processed, similar to find().
$limit—Limit the number of documents to be passed to the next step.
 $skip—Skip a specified number of documents.
$unwind—Expand an array, generating one output document for each array
entry.
 $group—Group documents by a specified key.
$sort—Sort documents.
$geoNear—Select documents near a geospatial location
$out—Write the results of the pipeline to a collection (new in v2.6).
$redact—Control access to certain data (new in v2.6).
6/14/2025 6
Aggregation framework overview
This code example defines an aggregation framework pipeline that consists of a match, a group, and
then a sort:
db.products.aggregate([ {$match: ...}, {$group: ...}, {$sort: ...} ] )
This series of operations is illustrated in figure 6.2.
6/14/2025 7
Aggregation framework overview
If you’re familiar with the SQL GROUP BY clause, you know that it’s used to provide summary
information similar to the summaries outlined here.
Table 6.1 provides a detailed comparison of SQL commands to the aggregation framework operators.
6/14/2025 8
E-commerce aggregation example
In this section you’ll produce a few example queries for your e-commerce database, illustrating how
to answer a few of the many questions you may want to answer from your data using aggregation.
Before we continue, let’s revisit the e-commerce data model.
6/14/2025 9
.
6/14/2025 10
Products, categories, and reviews
Now let’s look at a simple example of how the aggregation framework can be used to summarize information
about a product.
Chapter 5 showed an example of counting the number of reviews for a given product using this query:
product = db.products.findOne({'slug': 'wheelbarrow-9092'})
reviews_count = db.reviews.count({'product_id': product['_id']})
Let’s see how to do this using the aggregation framework. First, we’ll look at a query that will calculate the total
number of reviews for all products:
6/14/2025 11
Cont.
Next, add one more operator to your pipeline so that you select only the one product you want to get a count
for:
6/14/2025 12
Cont.
CALCULATING THE AVERAGE REVIEW
To calculate the average review for a product, you use the same pipeline as in the previous example and add
one more field:
6/14/2025 13
Cont.
COUNTING REVIEWS BY RATING
Next let’s extend the product summary further and show a breakdown of review counts for each rating.
This is probably something you’ve seen before when shopping online and is illustrated in figure 6.4.
6/14/2025 14
Cont.
This aggregation call would produce an array similar to this:
[ { "_id" : 5, "count" : 5 },
{ "_id" : 4, "count" : 2 },
{ "_id" : 3, "count" : 1 } ]
6/14/2025 15
Cont.
JOINING COLLECTIONS
Next, suppose you want to examine the contents of your database and count the number of products
for each main category.
 Recall that a product has only one main category.
The aggregation command looks like this:
6/14/2025 16
Cont.
Although MongoDB doesn’t allow automatic joins, starting with MongoDB 2.6, there are a couple of
options you can use to provide the equivalent of a SQL join.
One option is to use the forEach function to process the cursor returned from the aggregation
command and add the name using a pseudo-join. Here’s an example:
6/14/2025 17
Cont.
$OUT AND $PROJECT
In a moment you’ll see a much faster option for doing joins using the $unwind operator, but first you
should understand two other operators: $out and $project.
In the previous example, you saved the results of your aggregation pipeline into a collection
named mainCategorySummary using program code to process each output document.
You then saved the document using the following:
db.mainCategorySummary.insert(doc);
6/14/2025 18
Cont.
With the $out operator, you can automatically save the output from a pipeline into a collection.
The $out operator will create the collection if it doesn’t exist, or it’ll replace the collection completely
if it does exist.
6/14/2025 19
Cont.
The following is an example of a $project operator that limits the output documents to just the list of
category IDs used for each product:
6/14/2025 20
Cont.
FASTER JOINS WITH $UNWIND
Next we’ll look at another powerful feature of the aggregation framework, the $unwind operation.
 This operator allows you to expand an array, generating one output document for every input
document array entry.
6/14/2025 21
User and order
Grouping reviews by users and summarizing sales by month.
The example grouping reviews by user showed how many reviews each reviewer had and how many
helpful votes each reviewer had on average.
6/14/2025 22
Cont.
SUMMARIZING SALES BY YEAR AND MONTH
The following is an example that summarizes orders by month and year for orders
beginning in 2010.
6/14/2025 23
Cont.
FINDING BEST MANHATTAN CUSTOMERS
In order to find the highest spenders in Upper Manhattan.
This pipeline is summarized in figure 6.5. Notice that the $match is the first step in the pipeline,
greatly reducing the number of documents your pipeline has to process.
6/14/2025 24
Cont.
The query includes these steps:
■ $match—Find orders shipped to Upper Manhattan.
■ $group—Sum the order amounts for each customer.
■ $match—Select those customers with order totals greater than $100.
■ $sort—Sort the result by descending customer order total.
Let’s develop this pipeline using an approach that may make it easy to develop and test our pipelines
in general.
First we’ll define the parameters for each of the steps:
6/14/2025 25
Cont.
These commands define the parameters you’ll be passing to each of the steps of the aggregation
pipeline.
Given these definitions, the entire pipeline call would appear as shown here:
6/14/2025 26
Cont.
For example, let’s run just the part of the pipeline that summed all customers:
Let’s say you decide to keep the count of the number of orders. To do so, modify the sumByuserId
value:
6/14/2025 27
Cont.
Once you’re satisfied with the result, you can add the $out operator to save the results to a new
collection and thus make the results easily accessible by various applications:
6/14/2025 28
Aggregation pipeline operators
The aggregation framework supports 10 operators:
■ $project—Specify document fields to be processed.
■ $group—Group documents by a specified key.
■ $match—Select documents to be processed, similar to find(...).
■ $limit—Limit the number of documents passed to the next step.
■ $skip—Skip a specified number of documents and don’t pass them to the next step.
■ $unwind—Expand an array, generating one output document for each array entry.
■ $sort—Sort documents.
■ $geoNear—Select documents near a geospatial location.
■ $out—Write the results of the pipeline to a collection (new in v2.6).
■ $redact—Control access to certain data (new in v2.6).
6/14/2025 29
Aggregation pipeline operators
$project
The $project operator contains all of the functionality available in the query projection option covered
in chapter 5 and more.
 The following is a query based on the example in section 5.1.2 for reading the user’s first and last
name:
6/14/2025 30
Aggregation pipeline operators
$group
The $group operator is the main operator used by most aggregation pipelines.
This is the operator that handles the aggregation of data from multiple documents, providing you
with summary statistics using functions such as min, max, and average.
For those familiar with SQL, the $group function is equivalent to the SQL GROUP BY clause.
You tell the $group operator how to group documents by defining the _id field.
The $group operator then groups the input documents by the specified _id field, providing
aggregated information for each group of documents.
6/14/2025 31
Aggregation pipeline operators
The following example was shown in 6.2.2, where you summarized sales by month and year:
6/14/2025 32
Aggregation pipeline operators
The remaining fields in the $group output documents are limited to being defined using the $group
functions shown in table 6.2.
6/14/2025 33
Aggregation pipeline operators
The following example creates a list of customers, each with an array of products ordered by that
customer.
The array of products is created using the $push function:
6/14/2025 34
Aggregation pipeline operators
$match, $sort, $skip, $limit
Here’s an example based on the paging query shown in section 5.1.1:
6/14/2025 35
Aggregation pipeline operators
The identical query in the aggregation framework would look like this:
6/14/2025 36
Aggregation pipeline operators
$unwind
This operator expands an array by generating one output document for every entry in an array.
 The fields from the main document, as well as the fields from each array entry,are put into the
output document.
This example shows the categories for a product before and after a $unwind:
6/14/2025 37
Aggregation pipeline operators
$out
In section 6.2.2 you created a pipeline to find the best Manhattan customers.
We’ll use that example again here, but this time the final output of the pipeline is saved in the
collection targetedCustomers using the $out operator.
The $out operator must be the last operator in your pipeline:
6/14/2025 38
Reshaping documents
The Mongo DB aggregation pipeline contains a number of functions you can use to reshape a
document and thus produce an output document that contains fields not in the original input
document.
 You’ll typically use these functions with the $project operator, but you can also use them when
defining the _id for the $group operator.
if you wanted to create a subobject called name with two fields, first and last, you could use this code:
6/14/2025 39
Reshaping documents
String functions
The string functions shown in table 6.3 allow you to manipulate strings.
6/14/2025 40
Reshaping documents
This example uses three functions, $concat, $substr, and $toUpper:
6/14/2025 41
Cont.
Arithmetic functions
Arithmetic functions include the standard list of arithmetic operators shown in table 6.4.
6/14/2025 42
Cont.
Date functions
The date functions shown in table 6.5 create a new field by extracting part of an existing date time field, or by
calculating another aspect of a date such as day of year, day of month, or day of week.
6/14/2025 43
Cont.
Logical functions
The logical functions, shown in table 6.6, should look familiar. Most are similar to the find query operators
summarized in chapter 5, section 5.2.
6/14/2025 44
Cont.
6/14/2025 45
Cont.
6/14/2025 46
Set Operators
Set operators, summarized in table 6.7, allow you to compare the contents of two arrays.
 With set operators, you can compare two arrays to see if they’re exactly the same, what elements they
have in common, or what elements are in one but not the other.
Cont.
6/14/2025 47
Miscellaneous functions
The last group, miscellaneous functions, are summarized in table 6.8.
Understanding aggregation pipeline performance
6/14/2025 48
Here are some key considerations that can have a major impact on the performance of your
aggregation pipeline:
Try to reduce the number and size of documents as early as possible in your pipeline.
■ Indexes can only be used by $match and $sort operations and can greatly speed up these
operations.
■ You can’t use an index after your pipeline uses an operator other than $match or $sort.
■ If you use sharding (a common practice for extremely large collections), the $match and
$project operators will be run on individual shards. Once you use any other operator, the
remaining pipeline will be run on the primary shard.
There are still cases, especially when using the aggregation framework, where you’re
going to have to crunch through huge amounts of data, and indexing may not be an option.
 An example of this was when you calculated sales by year and month.
Understanding aggregation pipeline performance
6/14/2025 49
Aggregation pipeline options
Until now, we’ve only shown the aggregate() function when it’s passed an array of pipeline
operations. Starting with Mongo DB v2.6, there’s a second parameter you can pass to the
aggregate() function that you can use to specify options for the aggregation call.
The options available include the following:
■ explain()—Runs the pipeline and returns only pipeline process details
■ allowDiskUse—Uses disk for intermediate results
■ cursor—Specifies initial batch size
The options are passed using this format db.collection.aggregate(pipeline,additionalOptions)
The format of the additionalOptions parameter is as follows:
{explain:true, allowDiskUse:true, cursor: {batchSize: n} }
Understanding aggregation pipeline performance
6/14/2025 50
The aggregation framework’s explain( ) function
The Mongo DB explain() function, similar to the EXPLAIN function you might have
seen in SQL , describes query paths and allows developers to diagnose slow operations
by determining indexes that a query has used.
The explain() function for the aggregation framework is a bit different from the
explain() used in the find() query function but it provides similar capabilities.
As you might expect, for an aggregation pipeline you’ll receive explain output for each
operation in the pipeline, because each step in the pipeline is almost a call unto itself.
Understanding aggregation pipeline performance
6/14/2025 51
allowDiskUse option
Eventually, if you begin working with large enough collections, you’ll see an error sim-
ilar to this:
Understanding aggregation pipeline performance
6/14/2025 52
allowDiskUse option
Eventually, if you begin working with large enough collections, you’ll see an error sim-
ilar to this:
Even more frustrating, this error will probably happen after a long wait, during which
your aggregation pipeline has been processing millions of documents, only to fail.
What’s happening in this case is that the pipeline has intermediate results that exceed
the 100 MB of RAM limit allowed by MongoDB for pipeline stages.
Understanding aggregation pipeline performance
6/14/2025 53
The fix is simple and is even specified in the error message: Pass allowDiskUse:true to opt in.
Let’s see an example with your summary of sales by month, a pipeline that would
need this option because your site will have huge sales volumes:
Understanding aggregation pipeline performance
6/14/2025 54
Aggregation cursor option
Before MongoDB v2.6, the result of your pipeline was a single document with a limit of
16 MB. Starting with v2.6, the default is to return a cursor if you’re accessing Mongo DB
via the Mongo shell. But if you’re running the pipeline from a program, to avoid “breaking”
existing programs the default is unchanged and still returns a single docu-
ment limited to 16 MB.
Understanding aggregation pipeline performance
6/14/2025 55
The cursor returned by the aggregation pipeline supports the following calls:
■ cursor.hasNext()—Determine whether there’s a next document in the results.
■ cursor.next()—Return the next document in the results.
■ cursor.toArray()—Return the entire result as an array.
■ cursor.forEach()—Execute a function for each row in the results.
■ cursor.map()—Execute a function for each row in the results and return an array of function
return values.
■ cursor.itcount()—Return a count of items (for testing only).
■ cursor.pretty()—Display an array of formatted results.
Now let’s wrap up by looking at alternatives to the pipeline for performing aggregations.
Other aggregation capabilities
6/14/2025 56
Although the aggregation pipeline is now considered the preferred method for aggre-
gating data in MongoDB, a few alternatives are available. Some are much simpler, such
as the .count() function.
 Another, more complex alternative is the older MongoDB map-reduce function.
Let’s start with the simpler alternatives first.
count( ) and .distinct( )
product = db.products.findOne({'slug': 'wheelbarrow-9092'})
reviews_count = db.reviews.count({'product_id': product['_id']})
db.orders.distinct('shipping_address.zip')
Other aggregation capabilities
6/14/2025 57
map-reduce
map-reduce was Mongo DB ’s first attempt at providing a flexible aggregation capability.
With map-reduce, you have the ability to use JavaScript in defining your entire
process.
 This provides a great deal of flexibility but generally performs much slower
than the aggregation framework.
Other aggregation capabilities
6/14/2025 58
Other aggregation capabilities
6/14/2025 59
The corresponding reduce function should make this clearer:
Other aggregation capabilities
6/14/2025 60
ADDING A QUERY FILTER AND SAVING OUTPUT
The shell’s map-reduce method requires a map and a reduce function as arguments.
But this example adds two more.
The first is a query filter that limits the documents involved in the aggregation to orders made
since the beginning of 2010.
The second argument is the name of the output collection:
Other aggregation capabilities
6/14/2025 61
The process, as illustrated in figure 6.6, includes these steps:
1 filter will select only certain orders.
2 map then emits a key-value pair, usually one output for each input, but it can emit none or
many as well.
3 reduce is passed a key with an array of values emitted by map, usually one array
for each key, but it may be passed the same key multiple times with different
arrays of values.
Any Question??
62

13-Aggregations in MongoDB and MYSQL.pdf

  • 1.
    ISLAMIC EMIRATE OFAFGHANISTAN MINISTRY OF HIGHER EDUCATION GHALIB UNIVERSITY COMPUTER SCEINE FACULTY Distributed Database(Semester 5) (Aggregation) LECTURER: Zabihullah Rahmani 1
  • 2.
    Table of Contents Aggregation Aggregation on the e-commerce data model  Aggregation framework details  Performance and limitations  Other aggregation capabilities 6/14/2025 2
  • 3.
    Aggregation In this chapter,we’ll extend that topic to include more complex queries using the MongoDB aggregation framework.  The aggregation framework is MongoDB’s advanced query language, and it allows you to transform and combine data from multiple documents to generate new information not available in any single document. For example, you might use the aggregation framework to determine sales by month, sales by product, or order totals by user. For those familiar with relational databases, you can think of the aggregation framework as MongoDB’s equivalent to the SQL GROUP BY clause. 6/14/2025 3
  • 4.
    Cont. In this chapter,we’ll show you a number of examples using the e-commerce data model that’s used in the rest of the book and then provide a detailed look at all the aggregation framework operators and various options for each operator. Up to now, you’ve designed your data model and database queries to support fast and responsive website performance. The aggregation framework can also help with real-time information summarization that may be needed for an e-commerce website, but it can do much more: providing answers to a wide variety of questions you might want to answer from your data but that may require crunching large amounts of data. 6/14/2025 4
  • 5.
    Aggregation framework overview Acall to the aggregation framework defines a pipeline (figure 6.1), the aggregation pipeline, where the output from each step in the pipeline provides input to the next step. Each step executes a single operation on the input documents to transform the input and generate output documents. 6/14/2025 5
  • 6.
    Aggregation framework overview Aggregationpipeline operations include the following: $project—Specify fields to be placed in the output document (projected). $match—Select documents to be processed, similar to find(). $limit—Limit the number of documents to be passed to the next step.  $skip—Skip a specified number of documents. $unwind—Expand an array, generating one output document for each array entry.  $group—Group documents by a specified key. $sort—Sort documents. $geoNear—Select documents near a geospatial location $out—Write the results of the pipeline to a collection (new in v2.6). $redact—Control access to certain data (new in v2.6). 6/14/2025 6
  • 7.
    Aggregation framework overview Thiscode example defines an aggregation framework pipeline that consists of a match, a group, and then a sort: db.products.aggregate([ {$match: ...}, {$group: ...}, {$sort: ...} ] ) This series of operations is illustrated in figure 6.2. 6/14/2025 7
  • 8.
    Aggregation framework overview Ifyou’re familiar with the SQL GROUP BY clause, you know that it’s used to provide summary information similar to the summaries outlined here. Table 6.1 provides a detailed comparison of SQL commands to the aggregation framework operators. 6/14/2025 8
  • 9.
    E-commerce aggregation example Inthis section you’ll produce a few example queries for your e-commerce database, illustrating how to answer a few of the many questions you may want to answer from your data using aggregation. Before we continue, let’s revisit the e-commerce data model. 6/14/2025 9
  • 10.
  • 11.
    Products, categories, andreviews Now let’s look at a simple example of how the aggregation framework can be used to summarize information about a product. Chapter 5 showed an example of counting the number of reviews for a given product using this query: product = db.products.findOne({'slug': 'wheelbarrow-9092'}) reviews_count = db.reviews.count({'product_id': product['_id']}) Let’s see how to do this using the aggregation framework. First, we’ll look at a query that will calculate the total number of reviews for all products: 6/14/2025 11
  • 12.
    Cont. Next, add onemore operator to your pipeline so that you select only the one product you want to get a count for: 6/14/2025 12
  • 13.
    Cont. CALCULATING THE AVERAGEREVIEW To calculate the average review for a product, you use the same pipeline as in the previous example and add one more field: 6/14/2025 13
  • 14.
    Cont. COUNTING REVIEWS BYRATING Next let’s extend the product summary further and show a breakdown of review counts for each rating. This is probably something you’ve seen before when shopping online and is illustrated in figure 6.4. 6/14/2025 14
  • 15.
    Cont. This aggregation callwould produce an array similar to this: [ { "_id" : 5, "count" : 5 }, { "_id" : 4, "count" : 2 }, { "_id" : 3, "count" : 1 } ] 6/14/2025 15
  • 16.
    Cont. JOINING COLLECTIONS Next, supposeyou want to examine the contents of your database and count the number of products for each main category.  Recall that a product has only one main category. The aggregation command looks like this: 6/14/2025 16
  • 17.
    Cont. Although MongoDB doesn’tallow automatic joins, starting with MongoDB 2.6, there are a couple of options you can use to provide the equivalent of a SQL join. One option is to use the forEach function to process the cursor returned from the aggregation command and add the name using a pseudo-join. Here’s an example: 6/14/2025 17
  • 18.
    Cont. $OUT AND $PROJECT Ina moment you’ll see a much faster option for doing joins using the $unwind operator, but first you should understand two other operators: $out and $project. In the previous example, you saved the results of your aggregation pipeline into a collection named mainCategorySummary using program code to process each output document. You then saved the document using the following: db.mainCategorySummary.insert(doc); 6/14/2025 18
  • 19.
    Cont. With the $outoperator, you can automatically save the output from a pipeline into a collection. The $out operator will create the collection if it doesn’t exist, or it’ll replace the collection completely if it does exist. 6/14/2025 19
  • 20.
    Cont. The following isan example of a $project operator that limits the output documents to just the list of category IDs used for each product: 6/14/2025 20
  • 21.
    Cont. FASTER JOINS WITH$UNWIND Next we’ll look at another powerful feature of the aggregation framework, the $unwind operation.  This operator allows you to expand an array, generating one output document for every input document array entry. 6/14/2025 21
  • 22.
    User and order Groupingreviews by users and summarizing sales by month. The example grouping reviews by user showed how many reviews each reviewer had and how many helpful votes each reviewer had on average. 6/14/2025 22
  • 23.
    Cont. SUMMARIZING SALES BYYEAR AND MONTH The following is an example that summarizes orders by month and year for orders beginning in 2010. 6/14/2025 23
  • 24.
    Cont. FINDING BEST MANHATTANCUSTOMERS In order to find the highest spenders in Upper Manhattan. This pipeline is summarized in figure 6.5. Notice that the $match is the first step in the pipeline, greatly reducing the number of documents your pipeline has to process. 6/14/2025 24
  • 25.
    Cont. The query includesthese steps: ■ $match—Find orders shipped to Upper Manhattan. ■ $group—Sum the order amounts for each customer. ■ $match—Select those customers with order totals greater than $100. ■ $sort—Sort the result by descending customer order total. Let’s develop this pipeline using an approach that may make it easy to develop and test our pipelines in general. First we’ll define the parameters for each of the steps: 6/14/2025 25
  • 26.
    Cont. These commands definethe parameters you’ll be passing to each of the steps of the aggregation pipeline. Given these definitions, the entire pipeline call would appear as shown here: 6/14/2025 26
  • 27.
    Cont. For example, let’srun just the part of the pipeline that summed all customers: Let’s say you decide to keep the count of the number of orders. To do so, modify the sumByuserId value: 6/14/2025 27
  • 28.
    Cont. Once you’re satisfiedwith the result, you can add the $out operator to save the results to a new collection and thus make the results easily accessible by various applications: 6/14/2025 28
  • 29.
    Aggregation pipeline operators Theaggregation framework supports 10 operators: ■ $project—Specify document fields to be processed. ■ $group—Group documents by a specified key. ■ $match—Select documents to be processed, similar to find(...). ■ $limit—Limit the number of documents passed to the next step. ■ $skip—Skip a specified number of documents and don’t pass them to the next step. ■ $unwind—Expand an array, generating one output document for each array entry. ■ $sort—Sort documents. ■ $geoNear—Select documents near a geospatial location. ■ $out—Write the results of the pipeline to a collection (new in v2.6). ■ $redact—Control access to certain data (new in v2.6). 6/14/2025 29
  • 30.
    Aggregation pipeline operators $project The$project operator contains all of the functionality available in the query projection option covered in chapter 5 and more.  The following is a query based on the example in section 5.1.2 for reading the user’s first and last name: 6/14/2025 30
  • 31.
    Aggregation pipeline operators $group The$group operator is the main operator used by most aggregation pipelines. This is the operator that handles the aggregation of data from multiple documents, providing you with summary statistics using functions such as min, max, and average. For those familiar with SQL, the $group function is equivalent to the SQL GROUP BY clause. You tell the $group operator how to group documents by defining the _id field. The $group operator then groups the input documents by the specified _id field, providing aggregated information for each group of documents. 6/14/2025 31
  • 32.
    Aggregation pipeline operators Thefollowing example was shown in 6.2.2, where you summarized sales by month and year: 6/14/2025 32
  • 33.
    Aggregation pipeline operators Theremaining fields in the $group output documents are limited to being defined using the $group functions shown in table 6.2. 6/14/2025 33
  • 34.
    Aggregation pipeline operators Thefollowing example creates a list of customers, each with an array of products ordered by that customer. The array of products is created using the $push function: 6/14/2025 34
  • 35.
    Aggregation pipeline operators $match,$sort, $skip, $limit Here’s an example based on the paging query shown in section 5.1.1: 6/14/2025 35
  • 36.
    Aggregation pipeline operators Theidentical query in the aggregation framework would look like this: 6/14/2025 36
  • 37.
    Aggregation pipeline operators $unwind Thisoperator expands an array by generating one output document for every entry in an array.  The fields from the main document, as well as the fields from each array entry,are put into the output document. This example shows the categories for a product before and after a $unwind: 6/14/2025 37
  • 38.
    Aggregation pipeline operators $out Insection 6.2.2 you created a pipeline to find the best Manhattan customers. We’ll use that example again here, but this time the final output of the pipeline is saved in the collection targetedCustomers using the $out operator. The $out operator must be the last operator in your pipeline: 6/14/2025 38
  • 39.
    Reshaping documents The MongoDB aggregation pipeline contains a number of functions you can use to reshape a document and thus produce an output document that contains fields not in the original input document.  You’ll typically use these functions with the $project operator, but you can also use them when defining the _id for the $group operator. if you wanted to create a subobject called name with two fields, first and last, you could use this code: 6/14/2025 39
  • 40.
    Reshaping documents String functions Thestring functions shown in table 6.3 allow you to manipulate strings. 6/14/2025 40
  • 41.
    Reshaping documents This exampleuses three functions, $concat, $substr, and $toUpper: 6/14/2025 41
  • 42.
    Cont. Arithmetic functions Arithmetic functionsinclude the standard list of arithmetic operators shown in table 6.4. 6/14/2025 42
  • 43.
    Cont. Date functions The datefunctions shown in table 6.5 create a new field by extracting part of an existing date time field, or by calculating another aspect of a date such as day of year, day of month, or day of week. 6/14/2025 43
  • 44.
    Cont. Logical functions The logicalfunctions, shown in table 6.6, should look familiar. Most are similar to the find query operators summarized in chapter 5, section 5.2. 6/14/2025 44
  • 45.
  • 46.
    Cont. 6/14/2025 46 Set Operators Setoperators, summarized in table 6.7, allow you to compare the contents of two arrays.  With set operators, you can compare two arrays to see if they’re exactly the same, what elements they have in common, or what elements are in one but not the other.
  • 47.
    Cont. 6/14/2025 47 Miscellaneous functions Thelast group, miscellaneous functions, are summarized in table 6.8.
  • 48.
    Understanding aggregation pipelineperformance 6/14/2025 48 Here are some key considerations that can have a major impact on the performance of your aggregation pipeline: Try to reduce the number and size of documents as early as possible in your pipeline. ■ Indexes can only be used by $match and $sort operations and can greatly speed up these operations. ■ You can’t use an index after your pipeline uses an operator other than $match or $sort. ■ If you use sharding (a common practice for extremely large collections), the $match and $project operators will be run on individual shards. Once you use any other operator, the remaining pipeline will be run on the primary shard. There are still cases, especially when using the aggregation framework, where you’re going to have to crunch through huge amounts of data, and indexing may not be an option.  An example of this was when you calculated sales by year and month.
  • 49.
    Understanding aggregation pipelineperformance 6/14/2025 49 Aggregation pipeline options Until now, we’ve only shown the aggregate() function when it’s passed an array of pipeline operations. Starting with Mongo DB v2.6, there’s a second parameter you can pass to the aggregate() function that you can use to specify options for the aggregation call. The options available include the following: ■ explain()—Runs the pipeline and returns only pipeline process details ■ allowDiskUse—Uses disk for intermediate results ■ cursor—Specifies initial batch size The options are passed using this format db.collection.aggregate(pipeline,additionalOptions) The format of the additionalOptions parameter is as follows: {explain:true, allowDiskUse:true, cursor: {batchSize: n} }
  • 50.
    Understanding aggregation pipelineperformance 6/14/2025 50 The aggregation framework’s explain( ) function The Mongo DB explain() function, similar to the EXPLAIN function you might have seen in SQL , describes query paths and allows developers to diagnose slow operations by determining indexes that a query has used. The explain() function for the aggregation framework is a bit different from the explain() used in the find() query function but it provides similar capabilities. As you might expect, for an aggregation pipeline you’ll receive explain output for each operation in the pipeline, because each step in the pipeline is almost a call unto itself.
  • 51.
    Understanding aggregation pipelineperformance 6/14/2025 51 allowDiskUse option Eventually, if you begin working with large enough collections, you’ll see an error sim- ilar to this:
  • 52.
    Understanding aggregation pipelineperformance 6/14/2025 52 allowDiskUse option Eventually, if you begin working with large enough collections, you’ll see an error sim- ilar to this: Even more frustrating, this error will probably happen after a long wait, during which your aggregation pipeline has been processing millions of documents, only to fail. What’s happening in this case is that the pipeline has intermediate results that exceed the 100 MB of RAM limit allowed by MongoDB for pipeline stages.
  • 53.
    Understanding aggregation pipelineperformance 6/14/2025 53 The fix is simple and is even specified in the error message: Pass allowDiskUse:true to opt in. Let’s see an example with your summary of sales by month, a pipeline that would need this option because your site will have huge sales volumes:
  • 54.
    Understanding aggregation pipelineperformance 6/14/2025 54 Aggregation cursor option Before MongoDB v2.6, the result of your pipeline was a single document with a limit of 16 MB. Starting with v2.6, the default is to return a cursor if you’re accessing Mongo DB via the Mongo shell. But if you’re running the pipeline from a program, to avoid “breaking” existing programs the default is unchanged and still returns a single docu- ment limited to 16 MB.
  • 55.
    Understanding aggregation pipelineperformance 6/14/2025 55 The cursor returned by the aggregation pipeline supports the following calls: ■ cursor.hasNext()—Determine whether there’s a next document in the results. ■ cursor.next()—Return the next document in the results. ■ cursor.toArray()—Return the entire result as an array. ■ cursor.forEach()—Execute a function for each row in the results. ■ cursor.map()—Execute a function for each row in the results and return an array of function return values. ■ cursor.itcount()—Return a count of items (for testing only). ■ cursor.pretty()—Display an array of formatted results. Now let’s wrap up by looking at alternatives to the pipeline for performing aggregations.
  • 56.
    Other aggregation capabilities 6/14/202556 Although the aggregation pipeline is now considered the preferred method for aggre- gating data in MongoDB, a few alternatives are available. Some are much simpler, such as the .count() function.  Another, more complex alternative is the older MongoDB map-reduce function. Let’s start with the simpler alternatives first. count( ) and .distinct( ) product = db.products.findOne({'slug': 'wheelbarrow-9092'}) reviews_count = db.reviews.count({'product_id': product['_id']}) db.orders.distinct('shipping_address.zip')
  • 57.
    Other aggregation capabilities 6/14/202557 map-reduce map-reduce was Mongo DB ’s first attempt at providing a flexible aggregation capability. With map-reduce, you have the ability to use JavaScript in defining your entire process.  This provides a great deal of flexibility but generally performs much slower than the aggregation framework.
  • 58.
  • 59.
    Other aggregation capabilities 6/14/202559 The corresponding reduce function should make this clearer:
  • 60.
    Other aggregation capabilities 6/14/202560 ADDING A QUERY FILTER AND SAVING OUTPUT The shell’s map-reduce method requires a map and a reduce function as arguments. But this example adds two more. The first is a query filter that limits the documents involved in the aggregation to orders made since the beginning of 2010. The second argument is the name of the output collection:
  • 61.
    Other aggregation capabilities 6/14/202561 The process, as illustrated in figure 6.6, includes these steps: 1 filter will select only certain orders. 2 map then emits a key-value pair, usually one output for each input, but it can emit none or many as well. 3 reduce is passed a key with an array of values emitted by map, usually one array for each key, but it may be passed the same key multiple times with different arrays of values.
  • 62.