Aggregation Framework
Senior Solutions Architect, MongoDB
Norberto Leite
#mongodbdays @nleite #aggfwk
Agenda
• What is theAggregation Framework?
• The Aggregation Pipeline
• Usage and Limitations
• Aggregation and Sharding
• Summary
What is the Aggregation
Framework?
Aggregation Framework
Aggregation in Nutshell
• We're storing our data in MongoDB
• Our applications need to run ad-hoc queries for
grouping, summarizations, reporting, etc.
• We must have a way to reshape data easily to
support these access patterns
• You can useAggregation Framework for this!
• Extremely versatile, powerful
• Overkill for simple aggregation
tasks
• Averages
• Summation
• Grouping
• Reshaping
MapReduce is great, but…
• High level of complexity
• Difficult to program and debug
Aggregation Framework
• Plays nice with sharding
• Executes in native code
– Written in C++
– JSON parameters
• Flexible, functional, and simple
– Operation pipeline
– Computational expressions
Aggregation Pipeline
Pipeline
What is an Aggregation Pipeline?
• ASeries of Document Transformations
– Executed in stages
– Original input is a collection
– Output as a cursor or a collection
• Rich Library of Functions
– Filter, compute, group, and summarize data
– Output of one stage sent to input of next
– Operations executed in sequential order
$match $project $group $sort
Pipeline Operators
• $sort
• Order documents
• $limit / $skip
• Paginate documents
• $redact
• Restrict documents
• $geoNear
• Proximity sort
documents
• $let, $map
• Define variables
• $match
• Filter documents
• $project
• Reshape documents
• $group
• Summarize documents
• $unwind
• Expand documents
{
"_id" : ObjectId("54523d2d25784427c6fabce1"),
"From" : "norberto@mongodb.com",
"To" : "mongodb-user@googlegroups.com",
"Date" : ISODate("2012-08-15T22:32:34Z"),
"body" : {
"text/plain" : ”Hello Munich, nice to see yalll!"
},
"Subject" : ”Live From MongoDB World"
}
Our Example Data
$match
• Filter documents
– Uses existing query syntax
– No $where (server side Javascript)
Matching Field Values
{
subject: "Hello There",
words: 218,
from: "norberto@mongodb.com"
}
{ $match: {
from: "hipster@somemail.com"
}}
{
subject: "I love Hofbrauhaus",
words: 90,
from: "norberto@mongodb.com"
}
{
subject: "MongoDB Rules!",
words: 100,
from: "hipster@somemail.com"
}
{
subject: "MongoDB Rules!",
words: 100,
from: "hipster@somemail.com"
}
Matching with Query Operators
{
subject: "Hello There",
words: 218,
from: "norberto@mongodb.com"
}
{ $match: {
words: {$gt: 100}
}}
{
subject: "I love Hofbrauhaus",
words: 90,
from: "norberto@mongodb.com"
}
{
subject: "MongoDB Rules!",
words: 100,
from: "hipster@somemail.com"
}
{
subject: "MongoDB Rules!",
words: 100,
from: "hipster@somemail.com"
}
{
subject: "Hello There",
words: 218,
from: "norberto@mongodb.com"
}
$project
• Reshape Documents
– Include, exclude or rename
fields
– Inject computed fields
– Create sub-document fields
Including and Excluding Fields
{
_id: 12345,
subject: "Hello There",
words: 218,
from:"norberto@mongodb.com"
to: [ "marc@mongodb.com",
"sam@mongodb.com" ],
account: "mongodb mail",
date: ISODate("2012-08-05"),
replies: 3,
folder: "Inbox",
...
}
{ $project: {
_id: 0,
subject: 1,
from: 1
}}
{
subject: "Hello There",
from:"norberto@mongodb.com"
}
Including and Excluding Fields
{
_id: 12345,
subject: "Hello There",
words: 218,
from:"norberto@mongodb.com"
to: [ "marc@mongodb.com",
"sam@mongodb.com" ],
account: "mongodb mail",
date: ISODate("2012-08-05"),
replies: 3,
folder: "Inbox",
...
}
{ $project: {
_id: 0,
subject: 1,
from: 1
}}
{
subject: "Hello There",
from:"norberto@mongodb.com"
}
Renaming and Computing Fields
{ $project: {
spamIndex: {
$mul: ["$words",
"$replies"]
},
user: "$from"
}}
{
_id: 12345,
spamIndex: 72.6666 ,
user: "norberto@mongodb.com"
}
{
_id: 12345,
subject: "Hello There",
words: 218,
from:"norberto@mongodb.com"
to: [ "marc@mongodb.com",
"sam@mongodb.com" ],
account: "mongodb mail",
date: ISODate("2012-08-05"),
replies: 3,
folder: "Inbox",
...
}
Creating Sub-Document Fields
{ $project: {
subject: 1,
stats: {
replies: "$replies",
from: "$from",
date: "$date"
}}}
{
_id: 375,
subject: "Hello There",
stats: {
replies: 3,
from: "norberto@mongodb.com",
date: ISODate("2012-08-05")
}}
{
_id: 12345,
subject: "Hello There",
words: 218,
from:"norberto@mongodb.com"
to: [ "marc@mongodb.com",
"sam@mongodb.com" ],
account: "mongodb mail",
date: ISODate("2012-08-05"),
replies: 3,
folder: "Inbox",
...
}
$group
• Group documents by value
– Field reference, object, constant
– Other output fields are computed
• $max, $min, $avg, $sum
• $addToSet, $push
• $first, $last
– Processes all data in memory by
default
Calculating An Average
{ $group: {
_id: "$from",
avgWords: { $avg:
"$words" }
}}
{
_id: "norberto@mongodb.com",
avgPages: 154
}
{
_id: "hipster@somemail.com",
avgPages: 100
}
{
subject: "Hello There",
words: 218,
from: "norberto@mongodb.com"
}
{
subject: "I love Hofbrauhaus",
words: 90,
from: "norberto@mongodb.com"
}
{
subject: "MongoDB Rules!",
words: 100,
from: "hipster@somemail.com"
}
Summing Fields and Counting
{ $group: {
_id: "$from",
words: { $sum: "$words" },
mails: { $sum: 1 }
}}
{
_id: "norberto@mongodb.com",
words: 308,
mails: 2
}
{
_id: "hipster@somemail.com",
words: 100,
mails: 1
}
{
subject: "Hello There",
words: 218,
from: "norberto@mongodb.com"
}
{
subject: "I love Hofbrauhaus",
words: 90,
from: "norberto@mongodb.com"
}
{
subject: "MongoDB Rules!",
words: 100,
from: "hipster@somemail.com"
}
$unwind
• Operate on an array field
– Create documents from array elements
• Array replaced by element value
• Missing/empty fields → no output
• Non-array fields → error
– Pipe to $group to aggregate
Collecting Distinct Values
{ subject: "2.8 will be great!",
to: "marc@mongodb.com",
account : "mongodb mail” }
{ $unwind: "$to" }
{
_id: 2222,
subject: "2.8 will be great!",
to: [ "marc@mongodb.com",
"eliot@mongodb.com",
"asya@mongodb.com",
],
account: "mongodb mail"
}
{ subject: "2.8 will be great!",
to: "eliot@mongodb.com",
account : "mongodb mail” }
{ subject: "2.8 will be great!",
to: "asya@mongodb.com",
account : "mongodb mail” }
$sort, $limit, $skip
• Sort documents by one or more fields
– Same order syntax as cursors
– Waits for earlier pipeline operator to return
– In-memory unless early and indexed
• Limit and skip follow cursor behavior
$redact
• Restrict access to Documents
– Use document fields to define privileges
– Apply conditional queries to validate users
• Field LevelAccess Control
– $$DESCEND, $$PRUNE, $$KEEP
– Applies to root and subdocument fields
{
_id: 375,
item: "Sony XBR55X900A 55Inch 4K Ultra High Definition TV",
Manufacturer: "Sony",
security: 0,
quantity: 12,
list: 4999,
pricing: {
security: 1,
sale: 2698,
wholesale: {
security: 2,
amount: 2300 }
}
}
$redact Example Data
Query by Security Level
security =
0
db.catalog.aggregate([
{
$match: {item: /^.*XBR55X900A*/}
},
{
$redact: {
$cond: {
if: { $lte: [ "$security", ?? ] },
then: "$$DESCEND",
else: "$$PRUNE"
}
}
}])
{
"_id" : 375,
"item" : "Sony XBR55X900A 55Inch 4K Ultra High Definition TV",
"Manufacturer" : "Sony”,
"security" : 0,
"quantity" : 12,
"list" : 4999
}
{
"_id" : 375,
"item" : "Sony XBR55X900A 55Inch 4K Ultra High Definition
TV",
"Manufacturer" : "Sony",
"security" : 0,
"quantity" : 12,
"list" : 4999,
"pricing" : {
"security" : 1,
"sale" : 2698,
"wholesale" : {
"security" : 2,
"amount" : 2300
}
}
}
security =
2
$geoNear
• Order/Filter Documents by Location
– Requires a geospatial index
– Output includes physical distance
– Must be first aggregation stage
{
"_id" : 35089,
"city" : “Sony”,
"loc" : [
-86.048397,
32.979068
],
"pop" : 1584,
"state" : "AL”
}
$geonear Example Data
Query by Proximity
db.catalog.aggregate([
{
$geoNear : {
near: [ -86.000, 33.000 ],
distanceField: "dist",
maxDistance: .050,
spherical: true,
num: 3
}
}])
{
"_id" : "35089",
"city" : "KELLYTON",
"loc" : [ -86.048397, 32.979068 ],
"pop" : 1584,
"state" : "AL",
"dist" : 0.0007971432165364155
},
{
"_id" : "35010",
"city" : "NEW SITE",
"loc" : [ -85.951086, 32.941445 ],
"pop" : 19942,
"state" : "AL",
"dist" : 0.0012479615347306806
},
{
"_id" : "35072",
"city" : "GOODWATER",
"loc" : [ -86.078149, 33.074642 ],
"pop" : 3813,
"state" : "AL",
"dist" : 0.0017333719627032555
}
Usage and Limitations
Usage
• collection.aggregate([…], {<options>})
– Returns a cursor
– Takes an optional document to specify aggregation options
• allowDiskUse, explain
– Use $out to send results to a Collection
• db.runCommand({aggregate:<collection>, pipeline:[…]})
– Returns a document, limited to 16 MB
Collection
db.books.aggregate([
{ $project: { language: 1 }},
{ $group: { _id: "$language", numTitles: { $sum: 1 }}}
])
{ _id: "Russian", numTitles: 1 },
{ _id: "English", numTitles: 2 }
Database Command
db.runCommand({
aggregate: "books",
pipeline: [
{ $project: { language: 1 }},
{ $group: { _id: "$language", numTitles: { $sum: 1
}}}
]
})
{
result : [
{ _id: "Russian", numTitles: 1 },
{ _id: "English", numTitles: 2 }
],
“ok” : 1
}
Limitations
• Pipeline operator memory limits
– Stages limited to 100 MB
– Use “allowDiskUse” option to use disk for larger data sets
• Some BSON types unsupported
– Symbol, MinKey, MaxKey, DBRef, Code, and
CodeWScope
Aggregation and Sharding
Sharding
Result
mongos
Shard 1
(Primary)
$match,
$project, $group
Shard 2
$match,
$project, $group
Shard 3
excluded
Shard 4
$match,
$project, $group
• Workload split between shards
– Shards execute pipeline up to a point
– Primary shard merges cursors and
continues processing*
– Use explain to analyze pipeline split
– Early $match may excuse shards
– Potential CPU and memory implications
for primary shard host
* Priortov2.6secondstagepipelineprocessingwasdonebymongos
Summary
Framework Use Cases
• Basic aggregation queries
• Ad-hoc reporting
• Real-time analytics
• Visualizing and reshaping data
Extending the Framework
• Adding new pipeline operators, expressions
• $out and $tee for output control
– https://jira.mongodb.org/browse/SERVER-3253
Future Enhancements
• Automatically move $match earlier if possible
• Pipeline explain facility
• Memory usage improvements
– Grouping input sorted by _id
– Sorting with limited output
Enabling Developers
• Doing more within MongoDB, faster
• Refactoring MapReduce and groupings
– Replace pages of JavaScript
– Longer aggregation pipelines
• Quick aggregations from the shell
Obrigado!
SA | Eng – norberto@mongodb.com
Norberto Leite
#mongodbdays #aggfwk #devs @mongodb

The Aggregation Framework

  • 1.
    Aggregation Framework Senior SolutionsArchitect, MongoDB Norberto Leite #mongodbdays @nleite #aggfwk
  • 2.
    Agenda • What istheAggregation Framework? • The Aggregation Pipeline • Usage and Limitations • Aggregation and Sharding • Summary
  • 3.
    What is theAggregation Framework?
  • 4.
  • 5.
    Aggregation in Nutshell •We're storing our data in MongoDB • Our applications need to run ad-hoc queries for grouping, summarizations, reporting, etc. • We must have a way to reshape data easily to support these access patterns • You can useAggregation Framework for this!
  • 6.
    • Extremely versatile,powerful • Overkill for simple aggregation tasks • Averages • Summation • Grouping • Reshaping MapReduce is great, but… • High level of complexity • Difficult to program and debug
  • 7.
    Aggregation Framework • Playsnice with sharding • Executes in native code – Written in C++ – JSON parameters • Flexible, functional, and simple – Operation pipeline – Computational expressions
  • 8.
  • 9.
  • 10.
    What is anAggregation Pipeline? • ASeries of Document Transformations – Executed in stages – Original input is a collection – Output as a cursor or a collection • Rich Library of Functions – Filter, compute, group, and summarize data – Output of one stage sent to input of next – Operations executed in sequential order $match $project $group $sort
  • 11.
    Pipeline Operators • $sort •Order documents • $limit / $skip • Paginate documents • $redact • Restrict documents • $geoNear • Proximity sort documents • $let, $map • Define variables • $match • Filter documents • $project • Reshape documents • $group • Summarize documents • $unwind • Expand documents
  • 12.
    { "_id" : ObjectId("54523d2d25784427c6fabce1"), "From": "[email protected]", "To" : "[email protected]", "Date" : ISODate("2012-08-15T22:32:34Z"), "body" : { "text/plain" : ”Hello Munich, nice to see yalll!" }, "Subject" : ”Live From MongoDB World" } Our Example Data
  • 13.
    $match • Filter documents –Uses existing query syntax – No $where (server side Javascript)
  • 14.
    Matching Field Values { subject:"Hello There", words: 218, from: "[email protected]" } { $match: { from: "[email protected]" }} { subject: "I love Hofbrauhaus", words: 90, from: "[email protected]" } { subject: "MongoDB Rules!", words: 100, from: "[email protected]" } { subject: "MongoDB Rules!", words: 100, from: "[email protected]" }
  • 15.
    Matching with QueryOperators { subject: "Hello There", words: 218, from: "[email protected]" } { $match: { words: {$gt: 100} }} { subject: "I love Hofbrauhaus", words: 90, from: "[email protected]" } { subject: "MongoDB Rules!", words: 100, from: "[email protected]" } { subject: "MongoDB Rules!", words: 100, from: "[email protected]" } { subject: "Hello There", words: 218, from: "[email protected]" }
  • 16.
    $project • Reshape Documents –Include, exclude or rename fields – Inject computed fields – Create sub-document fields
  • 17.
    Including and ExcludingFields { _id: 12345, subject: "Hello There", words: 218, from:"[email protected]" to: [ "[email protected]", "[email protected]" ], account: "mongodb mail", date: ISODate("2012-08-05"), replies: 3, folder: "Inbox", ... } { $project: { _id: 0, subject: 1, from: 1 }} { subject: "Hello There", from:"[email protected]" }
  • 18.
    Including and ExcludingFields { _id: 12345, subject: "Hello There", words: 218, from:"[email protected]" to: [ "[email protected]", "[email protected]" ], account: "mongodb mail", date: ISODate("2012-08-05"), replies: 3, folder: "Inbox", ... } { $project: { _id: 0, subject: 1, from: 1 }} { subject: "Hello There", from:"[email protected]" }
  • 19.
    Renaming and ComputingFields { $project: { spamIndex: { $mul: ["$words", "$replies"] }, user: "$from" }} { _id: 12345, spamIndex: 72.6666 , user: "[email protected]" } { _id: 12345, subject: "Hello There", words: 218, from:"[email protected]" to: [ "[email protected]", "[email protected]" ], account: "mongodb mail", date: ISODate("2012-08-05"), replies: 3, folder: "Inbox", ... }
  • 20.
    Creating Sub-Document Fields {$project: { subject: 1, stats: { replies: "$replies", from: "$from", date: "$date" }}} { _id: 375, subject: "Hello There", stats: { replies: 3, from: "[email protected]", date: ISODate("2012-08-05") }} { _id: 12345, subject: "Hello There", words: 218, from:"[email protected]" to: [ "[email protected]", "[email protected]" ], account: "mongodb mail", date: ISODate("2012-08-05"), replies: 3, folder: "Inbox", ... }
  • 21.
    $group • Group documentsby value – Field reference, object, constant – Other output fields are computed • $max, $min, $avg, $sum • $addToSet, $push • $first, $last – Processes all data in memory by default
  • 22.
    Calculating An Average {$group: { _id: "$from", avgWords: { $avg: "$words" } }} { _id: "[email protected]", avgPages: 154 } { _id: "[email protected]", avgPages: 100 } { subject: "Hello There", words: 218, from: "[email protected]" } { subject: "I love Hofbrauhaus", words: 90, from: "[email protected]" } { subject: "MongoDB Rules!", words: 100, from: "[email protected]" }
  • 23.
    Summing Fields andCounting { $group: { _id: "$from", words: { $sum: "$words" }, mails: { $sum: 1 } }} { _id: "[email protected]", words: 308, mails: 2 } { _id: "[email protected]", words: 100, mails: 1 } { subject: "Hello There", words: 218, from: "[email protected]" } { subject: "I love Hofbrauhaus", words: 90, from: "[email protected]" } { subject: "MongoDB Rules!", words: 100, from: "[email protected]" }
  • 24.
    $unwind • Operate onan array field – Create documents from array elements • Array replaced by element value • Missing/empty fields → no output • Non-array fields → error – Pipe to $group to aggregate
  • 25.
    Collecting Distinct Values {subject: "2.8 will be great!", to: "[email protected]", account : "mongodb mail” } { $unwind: "$to" } { _id: 2222, subject: "2.8 will be great!", to: [ "[email protected]", "[email protected]", "[email protected]", ], account: "mongodb mail" } { subject: "2.8 will be great!", to: "[email protected]", account : "mongodb mail” } { subject: "2.8 will be great!", to: "[email protected]", account : "mongodb mail” }
  • 26.
    $sort, $limit, $skip •Sort documents by one or more fields – Same order syntax as cursors – Waits for earlier pipeline operator to return – In-memory unless early and indexed • Limit and skip follow cursor behavior
  • 27.
    $redact • Restrict accessto Documents – Use document fields to define privileges – Apply conditional queries to validate users • Field LevelAccess Control – $$DESCEND, $$PRUNE, $$KEEP – Applies to root and subdocument fields
  • 28.
    { _id: 375, item: "SonyXBR55X900A 55Inch 4K Ultra High Definition TV", Manufacturer: "Sony", security: 0, quantity: 12, list: 4999, pricing: { security: 1, sale: 2698, wholesale: { security: 2, amount: 2300 } } } $redact Example Data
  • 29.
    Query by SecurityLevel security = 0 db.catalog.aggregate([ { $match: {item: /^.*XBR55X900A*/} }, { $redact: { $cond: { if: { $lte: [ "$security", ?? ] }, then: "$$DESCEND", else: "$$PRUNE" } } }]) { "_id" : 375, "item" : "Sony XBR55X900A 55Inch 4K Ultra High Definition TV", "Manufacturer" : "Sony”, "security" : 0, "quantity" : 12, "list" : 4999 } { "_id" : 375, "item" : "Sony XBR55X900A 55Inch 4K Ultra High Definition TV", "Manufacturer" : "Sony", "security" : 0, "quantity" : 12, "list" : 4999, "pricing" : { "security" : 1, "sale" : 2698, "wholesale" : { "security" : 2, "amount" : 2300 } } } security = 2
  • 30.
    $geoNear • Order/Filter Documentsby Location – Requires a geospatial index – Output includes physical distance – Must be first aggregation stage
  • 31.
    { "_id" : 35089, "city": “Sony”, "loc" : [ -86.048397, 32.979068 ], "pop" : 1584, "state" : "AL” } $geonear Example Data
  • 32.
    Query by Proximity db.catalog.aggregate([ { $geoNear: { near: [ -86.000, 33.000 ], distanceField: "dist", maxDistance: .050, spherical: true, num: 3 } }]) { "_id" : "35089", "city" : "KELLYTON", "loc" : [ -86.048397, 32.979068 ], "pop" : 1584, "state" : "AL", "dist" : 0.0007971432165364155 }, { "_id" : "35010", "city" : "NEW SITE", "loc" : [ -85.951086, 32.941445 ], "pop" : 19942, "state" : "AL", "dist" : 0.0012479615347306806 }, { "_id" : "35072", "city" : "GOODWATER", "loc" : [ -86.078149, 33.074642 ], "pop" : 3813, "state" : "AL", "dist" : 0.0017333719627032555 }
  • 33.
  • 34.
    Usage • collection.aggregate([…], {<options>}) –Returns a cursor – Takes an optional document to specify aggregation options • allowDiskUse, explain – Use $out to send results to a Collection • db.runCommand({aggregate:<collection>, pipeline:[…]}) – Returns a document, limited to 16 MB
  • 35.
    Collection db.books.aggregate([ { $project: {language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]) { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 }
  • 36.
    Database Command db.runCommand({ aggregate: "books", pipeline:[ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ] }) { result : [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], “ok” : 1 }
  • 37.
    Limitations • Pipeline operatormemory limits – Stages limited to 100 MB – Use “allowDiskUse” option to use disk for larger data sets • Some BSON types unsupported – Symbol, MinKey, MaxKey, DBRef, Code, and CodeWScope
  • 38.
  • 39.
    Sharding Result mongos Shard 1 (Primary) $match, $project, $group Shard2 $match, $project, $group Shard 3 excluded Shard 4 $match, $project, $group • Workload split between shards – Shards execute pipeline up to a point – Primary shard merges cursors and continues processing* – Use explain to analyze pipeline split – Early $match may excuse shards – Potential CPU and memory implications for primary shard host * Priortov2.6secondstagepipelineprocessingwasdonebymongos
  • 40.
  • 41.
    Framework Use Cases •Basic aggregation queries • Ad-hoc reporting • Real-time analytics • Visualizing and reshaping data
  • 42.
    Extending the Framework •Adding new pipeline operators, expressions • $out and $tee for output control – https://jira.mongodb.org/browse/SERVER-3253
  • 43.
    Future Enhancements • Automaticallymove $match earlier if possible • Pipeline explain facility • Memory usage improvements – Grouping input sorted by _id – Sorting with limited output
  • 44.
    Enabling Developers • Doingmore within MongoDB, faster • Refactoring MapReduce and groupings – Replace pages of JavaScript – Longer aggregation pipelines • Quick aggregations from the shell
  • 45.
    Obrigado! SA | Eng[email protected] Norberto Leite #mongodbdays #aggfwk #devs @mongodb