13-Aggregations in MongoDB and MYSQL.pdf

ISLAMIC EMIRATE OF AFGHANISTAN
MINISTRY OF HIGHER EDUCATION
GHALIB UNIVERSITY
COMPUTER SCEINE FACULTY
Distributed Database(Semester 5)
(Aggregation)
LECTURER: Zabihullah Rahmani
1

Table of Contents
Aggregation
 Aggregation on the e-commerce data model
 Aggregation framework details
 Performance and limitations
 Other aggregation capabilities
6/14/2025 2

Aggregation
In this chapter, we’ll extend that topic to include more complex queries using the MongoDB aggregation
framework.
 The aggregation framework is MongoDB’s advanced query language, and it allows you to transform and
combine data from multiple documents to generate new information not available in any single
document.
For example, you might use the aggregation framework to determine sales by month, sales by product,
or order totals by user.
For those familiar with relational databases, you can think of the aggregation framework as MongoDB’s
equivalent to the SQL GROUP BY clause.
6/14/2025 3

Cont.
In this chapter, we’ll show you a number of examples using the e-commerce data model that’s used in
the rest of the book and then provide a detailed look at all the aggregation framework operators and
various options for each operator.
Up to now, you’ve designed your data model and database queries to support fast and responsive
website performance.
The aggregation framework can also help with real-time information summarization that may be
needed for an e-commerce website, but it can do much more: providing answers to a wide variety of
questions you might want to answer from your data but that may require crunching large amounts of
data.
6/14/2025 4

Aggregation framework overview
A call to the aggregation framework defines a pipeline (figure 6.1), the aggregation pipeline, where
the output from each step in the pipeline provides input to the next step.
Each step executes a single operation on the input documents to transform the input and generate
output documents.
6/14/2025 5

Aggregation pipeline operations include the following:
$project—Specify fields to be placed in the output document (projected).
$match—Select documents to be processed, similar to find().
$limit—Limit the number of documents to be passed to the next step.
 $skip—Skip a specified number of documents.
$unwind—Expand an array, generating one output document for each array
entry.
 $group—Group documents by a specified key.
$sort—Sort documents.
$geoNear—Select documents near a geospatial location
$out—Write the results of the pipeline to a collection (new in v2.6).
$redact—Control access to certain data (new in v2.6).
6/14/2025 6

This code example defines an aggregation framework pipeline that consists of a match, a group, and
then a sort:
db.products.aggregate([ {$match: ...}, {$group: ...}, {$sort: ...} ] )
This series of operations is illustrated in figure 6.2.
6/14/2025 7

If you’re familiar with the SQL GROUP BY clause, you know that it’s used to provide summary
information similar to the summaries outlined here.
Table 6.1 provides a detailed comparison of SQL commands to the aggregation framework operators.
6/14/2025 8

E-commerce aggregation example
In this section you’ll produce a few example queries for your e-commerce database, illustrating how
to answer a few of the many questions you may want to answer from your data using aggregation.
Before we continue, let’s revisit the e-commerce data model.
6/14/2025 9

Products, categories, and reviews
Now let’s look at a simple example of how the aggregation framework can be used to summarize information
about a product.
Chapter 5 showed an example of counting the number of reviews for a given product using this query:
product = db.products.findOne({'slug': 'wheelbarrow-9092'})
reviews_count = db.reviews.count({'product_id': product['_id']})
Let’s see how to do this using the aggregation framework. First, we’ll look at a query that will calculate the total
number of reviews for all products:
6/14/2025 11

Cont.
Next, add one more operator to your pipeline so that you select only the one product you want to get a count
for:
6/14/2025 12

Cont.
CALCULATING THE AVERAGE REVIEW
To calculate the average review for a product, you use the same pipeline as in the previous example and add
one more field:
6/14/2025 13

Cont.
COUNTING REVIEWS BY RATING
Next let’s extend the product summary further and show a breakdown of review counts for each rating.
This is probably something you’ve seen before when shopping online and is illustrated in figure 6.4.
6/14/2025 14

Cont.
This aggregation call would produce an array similar to this:
[ { "_id" : 5, "count" : 5 },
{ "_id" : 4, "count" : 2 },
{ "_id" : 3, "count" : 1 } ]
6/14/2025 15

Cont.
JOINING COLLECTIONS
Next, suppose you want to examine the contents of your database and count the number of products
for each main category.
 Recall that a product has only one main category.
The aggregation command looks like this:
6/14/2025 16

Cont.
Although MongoDB doesn’t allow automatic joins, starting with MongoDB 2.6, there are a couple of
options you can use to provide the equivalent of a SQL join.
One option is to use the forEach function to process the cursor returned from the aggregation
command and add the name using a pseudo-join. Here’s an example:
6/14/2025 17

Cont.
$OUT AND $PROJECT
In a moment you’ll see a much faster option for doing joins using the $unwind operator, but first you
should understand two other operators: $out and $project.
In the previous example, you saved the results of your aggregation pipeline into a collection
named mainCategorySummary using program code to process each output document.
You then saved the document using the following:
db.mainCategorySummary.insert(doc);
6/14/2025 18

Cont.
With the $out operator, you can automatically save the output from a pipeline into a collection.
The $out operator will create the collection if it doesn’t exist, or it’ll replace the collection completely
if it does exist.
6/14/2025 19

Cont.
The following is an example of a $project operator that limits the output documents to just the list of
category IDs used for each product:
6/14/2025 20

Cont.
FASTER JOINS WITH $UNWIND
Next we’ll look at another powerful feature of the aggregation framework, the $unwind operation.
 This operator allows you to expand an array, generating one output document for every input
document array entry.
6/14/2025 21

User and order
Grouping reviews by users and summarizing sales by month.
The example grouping reviews by user showed how many reviews each reviewer had and how many
helpful votes each reviewer had on average.
6/14/2025 22

Cont.
SUMMARIZING SALES BY YEAR AND MONTH
The following is an example that summarizes orders by month and year for orders
beginning in 2010.
6/14/2025 23

Cont.
FINDING BEST MANHATTAN CUSTOMERS
In order to find the highest spenders in Upper Manhattan.
This pipeline is summarized in figure 6.5. Notice that the $match is the first step in the pipeline,
greatly reducing the number of documents your pipeline has to process.
6/14/2025 24

Cont.
The query includes these steps:
■ $match—Find orders shipped to Upper Manhattan.
■ $group—Sum the order amounts for each customer.
■ $match—Select those customers with order totals greater than $100.
■ $sort—Sort the result by descending customer order total.
Let’s develop this pipeline using an approach that may make it easy to develop and test our pipelines
in general.
First we’ll define the parameters for each of the steps:
6/14/2025 25

Cont.
These commands define the parameters you’ll be passing to each of the steps of the aggregation
pipeline.
Given these definitions, the entire pipeline call would appear as shown here:
6/14/2025 26

Cont.
For example, let’s run just the part of the pipeline that summed all customers:
Let’s say you decide to keep the count of the number of orders. To do so, modify the sumByuserId
value:
6/14/2025 27

Cont.
Once you’re satisfied with the result, you can add the $out operator to save the results to a new
collection and thus make the results easily accessible by various applications:
6/14/2025 28

Aggregation pipeline operators
The aggregation framework supports 10 operators:
■ $project—Specify document fields to be processed.
■ $group—Group documents by a specified key.
■ $match—Select documents to be processed, similar to find(...).
■ $limit—Limit the number of documents passed to the next step.
■ $skip—Skip a specified number of documents and don’t pass them to the next step.
■ $unwind—Expand an array, generating one output document for each array entry.
■ $sort—Sort documents.
■ $geoNear—Select documents near a geospatial location.
■ $out—Write the results of the pipeline to a collection (new in v2.6).
■ $redact—Control access to certain data (new in v2.6).
6/14/2025 29

$project
The $project operator contains all of the functionality available in the query projection option covered
in chapter 5 and more.
 The following is a query based on the example in section 5.1.2 for reading the user’s first and last
name:
6/14/2025 30

$group
The $group operator is the main operator used by most aggregation pipelines.
This is the operator that handles the aggregation of data from multiple documents, providing you
with summary statistics using functions such as min, max, and average.
For those familiar with SQL, the $group function is equivalent to the SQL GROUP BY clause.
You tell the $group operator how to group documents by defining the _id field.
The $group operator then groups the input documents by the specified _id field, providing
aggregated information for each group of documents.
6/14/2025 31

The following example was shown in 6.2.2, where you summarized sales by month and year:
6/14/2025 32

The remaining fields in the $group output documents are limited to being defined using the $group
functions shown in table 6.2.
6/14/2025 33

The following example creates a list of customers, each with an array of products ordered by that
customer.
The array of products is created using the $push function:
6/14/2025 34

$match, $sort, $skip, $limit
Here’s an example based on the paging query shown in section 5.1.1:
6/14/2025 35

The identical query in the aggregation framework would look like this:
6/14/2025 36

$unwind
This operator expands an array by generating one output document for every entry in an array.
 The fields from the main document, as well as the fields from each array entry,are put into the
output document.
This example shows the categories for a product before and after a $unwind:
6/14/2025 37

$out
In section 6.2.2 you created a pipeline to find the best Manhattan customers.
We’ll use that example again here, but this time the final output of the pipeline is saved in the
collection targetedCustomers using the $out operator.
The $out operator must be the last operator in your pipeline:
6/14/2025 38

Reshaping documents
The Mongo DB aggregation pipeline contains a number of functions you can use to reshape a
document and thus produce an output document that contains fields not in the original input
document.
 You’ll typically use these functions with the $project operator, but you can also use them when
defining the _id for the $group operator.
if you wanted to create a subobject called name with two fields, first and last, you could use this code:
6/14/2025 39

Reshaping documents
String functions
The string functions shown in table 6.3 allow you to manipulate strings.
6/14/2025 40

Reshaping documents
This example uses three functions, $concat, $substr, and $toUpper:
6/14/2025 41

Cont.
Arithmetic functions
Arithmetic functions include the standard list of arithmetic operators shown in table 6.4.
6/14/2025 42

Cont.
Date functions
The date functions shown in table 6.5 create a new field by extracting part of an existing date time field, or by
calculating another aspect of a date such as day of year, day of month, or day of week.
6/14/2025 43

Cont.
Logical functions
The logical functions, shown in table 6.6, should look familiar. Most are similar to the find query operators
summarized in chapter 5, section 5.2.
6/14/2025 44

Cont.
6/14/2025 46
Set Operators
Set operators, summarized in table 6.7, allow you to compare the contents of two arrays.
 With set operators, you can compare two arrays to see if they’re exactly the same, what elements they
have in common, or what elements are in one but not the other.

Cont.
6/14/2025 47
Miscellaneous functions
The last group, miscellaneous functions, are summarized in table 6.8.

Understanding aggregation pipeline performance
6/14/2025 48
Here are some key considerations that can have a major impact on the performance of your
aggregation pipeline:
Try to reduce the number and size of documents as early as possible in your pipeline.
■ Indexes can only be used by $match and $sort operations and can greatly speed up these
operations.
■ You can’t use an index after your pipeline uses an operator other than $match or $sort.
■ If you use sharding (a common practice for extremely large collections), the $match and
$project operators will be run on individual shards. Once you use any other operator, the
remaining pipeline will be run on the primary shard.
There are still cases, especially when using the aggregation framework, where you’re
going to have to crunch through huge amounts of data, and indexing may not be an option.
 An example of this was when you calculated sales by year and month.

6/14/2025 49
Aggregation pipeline options
Until now, we’ve only shown the aggregate() function when it’s passed an array of pipeline
operations. Starting with Mongo DB v2.6, there’s a second parameter you can pass to the
aggregate() function that you can use to specify options for the aggregation call.
The options available include the following:
■ explain()—Runs the pipeline and returns only pipeline process details
■ allowDiskUse—Uses disk for intermediate results
■ cursor—Specifies initial batch size
The options are passed using this format db.collection.aggregate(pipeline,additionalOptions)
The format of the additionalOptions parameter is as follows:
{explain:true, allowDiskUse:true, cursor: {batchSize: n} }

6/14/2025 50
The aggregation framework’s explain( ) function
The Mongo DB explain() function, similar to the EXPLAIN function you might have
seen in SQL , describes query paths and allows developers to diagnose slow operations
by determining indexes that a query has used.
The explain() function for the aggregation framework is a bit different from the
explain() used in the find() query function but it provides similar capabilities.
As you might expect, for an aggregation pipeline you’ll receive explain output for each
operation in the pipeline, because each step in the pipeline is almost a call unto itself.

6/14/2025 51
allowDiskUse option
Eventually, if you begin working with large enough collections, you’ll see an error sim-
ilar to this:

6/14/2025 52
allowDiskUse option
Eventually, if you begin working with large enough collections, you’ll see an error sim-
ilar to this:
Even more frustrating, this error will probably happen after a long wait, during which
your aggregation pipeline has been processing millions of documents, only to fail.
What’s happening in this case is that the pipeline has intermediate results that exceed
the 100 MB of RAM limit allowed by MongoDB for pipeline stages.

6/14/2025 53
The fix is simple and is even specified in the error message: Pass allowDiskUse:true to opt in.
Let’s see an example with your summary of sales by month, a pipeline that would
need this option because your site will have huge sales volumes:

6/14/2025 54
Aggregation cursor option
Before MongoDB v2.6, the result of your pipeline was a single document with a limit of
16 MB. Starting with v2.6, the default is to return a cursor if you’re accessing Mongo DB
via the Mongo shell. But if you’re running the pipeline from a program, to avoid “breaking”
existing programs the default is unchanged and still returns a single docu-
ment limited to 16 MB.

6/14/2025 55
The cursor returned by the aggregation pipeline supports the following calls:
■ cursor.hasNext()—Determine whether there’s a next document in the results.
■ cursor.next()—Return the next document in the results.
■ cursor.toArray()—Return the entire result as an array.
■ cursor.forEach()—Execute a function for each row in the results.
■ cursor.map()—Execute a function for each row in the results and return an array of function
return values.
■ cursor.itcount()—Return a count of items (for testing only).
■ cursor.pretty()—Display an array of formatted results.
Now let’s wrap up by looking at alternatives to the pipeline for performing aggregations.

Other aggregation capabilities
6/14/2025 56
Although the aggregation pipeline is now considered the preferred method for aggre-
gating data in MongoDB, a few alternatives are available. Some are much simpler, such
as the .count() function.
 Another, more complex alternative is the older MongoDB map-reduce function.
Let’s start with the simpler alternatives first.
count( ) and .distinct( )
product = db.products.findOne({'slug': 'wheelbarrow-9092'})
reviews_count = db.reviews.count({'product_id': product['_id']})
db.orders.distinct('shipping_address.zip')

6/14/2025 57
map-reduce
map-reduce was Mongo DB ’s first attempt at providing a flexible aggregation capability.
With map-reduce, you have the ability to use JavaScript in defining your entire
process.
 This provides a great deal of flexibility but generally performs much slower
than the aggregation framework.

6/14/2025 58

6/14/2025 59
The corresponding reduce function should make this clearer:

6/14/2025 60
ADDING A QUERY FILTER AND SAVING OUTPUT
The shell’s map-reduce method requires a map and a reduce function as arguments.
But this example adds two more.
The first is a query filter that limits the documents involved in the aggregation to orders made
since the beginning of 2010.
The second argument is the name of the output collection:

6/14/2025 61
The process, as illustrated in figure 6.6, includes these steps:
1 filter will select only certain orders.
2 map then emits a key-value pair, usually one output for each input, but it can emit none or
many as well.
3 reduce is passed a key with an array of values emitted by map, usually one array
for each key, but it may be passed the same key multiple times with different
arrays of values.

13-Aggregations in MongoDB and MYSQL.pdf

More Related Content

Similar to 13-Aggregations in MongoDB and MYSQL.pdf

More from zakeraomari454

Recently uploaded

13-Aggregations in MongoDB and MYSQL.pdf