The Aggregation Framework

Aggregation Framework
Senior Solutions Architect, MongoDB
Rick Houlihan
MongoDB World

Agenda
• What is theAggregation Framework?
• The Aggregation Pipeline
• Usage and Limitations
• Aggregation and Sharding
• Summary

What is the Aggregation
Framework?

Aggregation in Nutshell
• We're storing our data in
MongoDB
• Our applications need ad-hoc
queries
• We must have a way to reshape
data easily
• You can use Aggregation Framework for
this!

• Extremely versatile, powerful
• Overkill for simple aggregation
tasks
• Averages
• Summation
• Grouping
• Reshaping
MapReduce is great, but…
• High level of complexity
• Difficult to program and debug

Aggregation Framework
• Plays nice with sharding
• Executes in native code
– Written in C++
– JSON parameters
• Flexible, functional, and simple
– Operation pipeline
– Computational expressions

What is an Aggregation Pipeline?
• ASeries of Document Transformations
– Executed in stages
– Original input is a collection
– Output as a document, cursor or a collection
• Rich Library of Functions
– Filter, compute, group, and summarize data
– Output of one stage sent to input of next
– Operations executed in sequential order
$match $project $group $sort

Pipeline Operators
• $sort
• Order documents
• $limit / $skip
• Paginate documents
• $redact
• Restrict documents
• $geoNear
• Proximity sort
documents
• $let, $map
• Subexpression variables
• $match
• Filter documents
• $project
• Reshape documents
• $group
• Summarize documents
• $unwind
• Expand documents

{
_id: 375,
title: "The Great Gatsby",
ISBN: "9781857150193",
available: true,
pages: 218,
chapters: 9,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}
Our Example Data

$match
• Filter documents
– Uses existing query syntax
– Can facilitate shard exclusion
– No $where (server side Javascript)

Matching Field Values
{
title: "Atlas Shrugged",
pages: 1088,
language: "English"
}
{
pages: 218,
language: "English"
}
{
title: "War and Peace",
pages: 1440,
language: "Russian"
}
{ $match: {
language: "Russian"
}}
{
pages: 1440,
language: "Russian"
}

Matching with Query Operators
{
pages: 1088,
language: "English"
}
{
pages: 218,
language: "English"
}
{
pages: 1440,
language: "Russian"
}
{ $match: {
pages: {$gt:100}
}}
{
pages: 1440,
language: "Russian"
}
{
title: ”Atlas Shrugged",
pages: 1088,
language: “English"
}

$project
• Reshape Documents
– Include, exclude or rename
fields
– Inject computed fields
– Create sub-document fields

Including and Excluding Fields
{
_id: 375,
title: "Great Gatsby",
ISBN: "9781857150193",
available: true,
pages: 218,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}
{ $project: {
_id: 0,
title: 1,
language: 1
}}
{
language: "English"
}

Renaming and Computing Fields
{
_id: 375,
ISBN: "9781857150193",
available: true,
pages: 218,
chapters: 9,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}
{ $project: {
avgChapterLength: {
$divide: ["$pages",
"$chapters"]
},
lang: "$language"
}}
{
_id: 375,
avgChapterLength: 24.2222,
lang: "English"
}

Creating Sub-Document Fields
{
_id: 375,
ISBN: "9781857150193",
available: true,
pages: 218,
chapters: 9,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}
{ $project: {
title: 1,
stats: {
pages: "$pages",
language: "$language",
}
}}
{
_id: 375,
stats: {
pages: 218,
language: "English"
}
}

$group
• Group documents by value
– Field reference, object, constant
– Other output fields are computed
• $max, $min, $avg, $sum
• $addToSet, $push
• $first, $last
– Processes all data in memory by
default

Calculating An Average
{
pages: 218,
language: "English"
}
{ $group: {
_id: "$language",
avgPages: { $avg:
"$pages" }
}}
{
_id: "Russian",
avgPages: 1440
}
{
pages: 1440,
language: "Russian"
}
{
pages: 1088,
language: "English"
}
{
_id: "English",
avgPages: 653
}

Summing Fields and Counting
{
pages: 218,
language: "English"
}
{ $group: {
_id: "$language",
pages: { $sum: "$pages" },
books: { $sum: 1 }
}}
{
_id: "Russian",
pages: 1440,
books: 1
}
{
pages: 1440,
language: "Russian"
}
{
pages: 1088,
language: "English"
}
{
_id: "English",
pages: 1316,
books: 2
}

Collecting Distinct Values
{
pages: 218,
language: "English"
}
{ $group: {
_id: "$language",
titles: { $addToSet: "$title" }
}}
{
_id: "Russian",
titles: [“War and Peace”]
}
{
pages: 1440,
language: "Russian"
}
{
pages: 1088,
language: "English"
}
{
_id: "English",
titles: [
"Atlas Shrugged",
"The Great Gatsby” ]
}

$unwind
• Operate on an array field
– Create documents from array elements
• Array replaced by element value
• Missing/empty fields → no output
• Non-array fields → error
– Pipe to $group to aggregate

Collecting Distinct Values
{
ISBN: "9781857150193",
subjects: [
"Long Island",
"New York",
"1920s"
]
}
{ title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "Long Island” }
ISBN: "9781857150193",
subjects: "New York” }
ISBN: "9781857150193",
subjects: "1920s” }
{ $unwind: "$subjects" }

$sort, $limit, $skip
• Sort documents by one or more fields
– Same order syntax as cursors
– Waits for earlier pipeline operator to return
– In-memory unless early and indexed
• Limit and skip follow cursor behavior

Sort All the Documents in the
Pipeline
{ title: “Animal Farm” }
{ $sort: {title: 1} }
{ title: “Brave New World” }
{ title: “Great Gatsby” }
{ title: “Grapes of Wrath, The” }
{ title: “Lord of the Flies” }
{ title: “Great Gatsby, The” }
{ title: “Grapes of Wrath” }

Limit Documents Through the
Pipeline
{ $limit: 5 }
{ title: “Fathers and Sons” }
{ title: “Invisible Man” }

Skip Documents in the Pipeline
{ $skip: 3 }

$redact
• Restrict access to Documents
– Use document fields to define privileges
– Apply conditional queries to validate users
• Field LevelAccess Control
– $$DESCEND, $$PRUNE, $$KEEP
– Applies to root and subdocument fields

{
_id: 375,
item: "Sony XBR55X900A 55Inch 4K Ultra High Definition TV",
Manufacturer: "Sony",
security: 0,
quantity: 12,
list: 4999,
pricing: {
security: 1,
sale: 2698,
wholesale: {
security: 2,
amount: 2300 }
}
}
$redact Example Data

Query by Security Level
security =
0
db.catalog.aggregate([
{
$match: {item: /^.*XBR55X900A*/}
},
{
$redact: {
$cond: {
if: { $lte: [ "$security", ?? ] },
then: "$$DESCEND",
else: "$$PRUNE"
}
}
}])
{
"_id" : 375,
"item" : "Sony XBR55X900A 55Inch 4K Ultra High Definition TV",
"Manufacturer" : "Sony”,
"security" : 0,
"quantity" : 12,
"list" : 4999
}
{
"_id" : 375,
"item" : "Sony XBR55X900A 55Inch 4K Ultra High Definition
TV",
"Manufacturer" : "Sony",
"security" : 0,
"quantity" : 12,
"list" : 4999,
"pricing" : {
"security" : 1,
"sale" : 2698,
"wholesale" : {
"security" : 2,
"amount" : 2300
}
}
}
security =
2

$geoNear
• Order/Filter Documents by Location
– Requires a geospatial index
– Output includes physical distance
– Must be first aggregation stage

{
"_id" : 10021,
"city" : “NEW YORK”,
"loc" : [
-73.958805,
40.768476
],
"pop" : 106564,
"state" : ”NY”
}
$geonear Example Data

Query by Proximity
db.catalog.aggregate([
{
$geoNear : {
near: [ -86.000, 33.000 ],
distanceField: "dist",
maxDistance: .050,
spherical: true,
num: 3
}
}])
{
"_id" : "35089",
"city" : "KELLYTON",
"loc" : [ -86.048397, 32.979068 ],
"pop" : 1584,
"state" : "AL",
"dist" : 0.0007971432165364155
},
{
"_id" : "35010",
"city" : "NEW SITE",
"loc" : [ -85.951086, 32.941445 ],
"pop" : 19942,
"state" : "AL",
"dist" : 0.0012479615347306806
},
{
"_id" : "35072",
"city" : "GOODWATER",
"loc" : [ -86.078149, 33.074642 ],
"pop" : 3813,
"state" : "AL",
"dist" : 0.0017333719627032555
}

$let / $map
• Bind variables to subexpressions
– Apply conditional logic
– Define complex calculations
– Operate on array field values

{
"_id" : 1,
”price" : 10,
”tax" : 0.50,
”discount" : true
}
$let Example Data

Subexpression Calculations
db.sales.aggregate( [
{
$project: {
finalPrice: {
$let: {
vars: {
total: { $cond: {
if: '$applyDiscount',
then: { $multiply: [0.9, '$price’] },
else: '$price'
}
}
},
in: { $add: [ "$$total", '$tax'] }
}}}}])
{ "_id" : 1, "finalPrice" : 9.5 }
{ "_id" : 2, "finalPrice" : 10.25 }

{
"_id" : 1,
”price" : 10,
”tax" : 0.50,
”discount" : true,
”units" : [ 1, 0, 3, 4, 0, 0, 10, 12, 6, 5 ]
}
$map Example Data

Subexpressions on Arrays
db.sales.aggregate( [ {
$project: {
finalPrice: {
$map: {
input: "$units",
as: "unit",
in: {
$multiply: [ “$$unit”, {
$cond: {
if: '$applyDiscount', then: {
$add : [
{ $multiply: [ 0.9, '$price'] }, '$tax’ ] },
else: { $add: [ '$price', '$tax’ ] }
} } ] } } } } } ] )
{
"_id" : 1,
"finalPrice" :
[ 9.5, 0, 28.5, 38, 0, 0, 95, 114, 57, 47.5 ]
}
{
"_id" : 2,
"finalPrice" :
[ 51.25, 30.75, 20.5, 51.25, 0, 0, 0, 30.75, 41, 71.75 ]
}

Sharding
Result
mongos
Shard 1
(Primary)
$match,
$project, $group
Shard 2
$match,
$project, $group
Shard 3
excluded
Shard 4
$match,
$project, $group
• Workload split between shards
– Shards execute pipeline up to a point
– Primary shard merges cursorsand
continues processing*
– Use explain to analyze pipeline split
– Early $match may excuse shards
– Potential CPU and memory implications
for primary shard host
* Priortov2.6secondstagepipelineprocessingwasdonebymongos

Usage
• collection.aggregate([…], {<options>})
– Returns a cursor
– Takes an optional document to specify aggregation options
• allowDiskUse, explain
– Use $out to send results to a Collection
• db.runCommand({aggregate:<collection>, pipeline:[…]})
– Returns a document, limited to 16 MB

Collection
db.books.aggregate([
{ $project: { language: 1 }},
{ $group: { _id: "$language", numTitles: { $sum: 1 }}}
])
{ _id: "Russian", numTitles: 1 },
{ _id: "English", numTitles: 2 }

Database Command
db.runCommand({
aggregate: "books",
pipeline: [
{ $project: { language: 1 }},
{ $group: { _id: "$language", numTitles: { $sum: 1
}}}
]
})
{
result : [
{ _id: "Russian", numTitles: 1 },
{ _id: "English", numTitles: 2 }
],
“ok” : 1
}

Limitations
• Pipeline operator memory limits
– Stages limited to 100 MB
– “allowDiskUse” for larger data sets
• Some BSON types unsupported
– Symbol, MinKey, MaxKey, DBRef, Code, and
CodeWScope

Aggregation Use Cases
Ad-hoc reporting
Real-timeAnalytics
Transforming Data

Enabling Developers and DBA’s
• Do more with MongoDB and do it
faster
• Eliminate MapReduce
– Replace pages of JavaScript
– More efficient data processing
• Not just a nice feature
– Enabler for real time big data analytics

The Aggregation Framework

More Related Content

What's hot

Viewers also liked

Similar to The Aggregation Framework

More from MongoDB

Recently uploaded

The Aggregation Framework