Advanced MongoDB
Aggregation
Joe Drumgoole
Director of Developer Advocacy, EMEA
@jdrumgoole
#MDBW16
MongoDB Aggregation Framework
• One of the least understood parts of MongoDB
• Many customers never get to the aggregation framework
• Had limited utility 2.6 (16MB limit)
• The aggregation framework “grew up” in 3.0
• We continue to enhance (3.2 and now 3..4)
#MDBW16
A quick reminder
• A Processing Pipeline
• Design to process large groups of documents in parallel
• Is shard aware
• Can create new data from old
#MDBW16
A Typical Pipeline
Match Project Group Sort
Find Query Select Fields
Combine Fields
Rename fields
Calculate
Group By
Execute accumulators
Rename fields
Sort results
#MDBW16
Description of Data Set
• MOT : Ministry of Transport, then
• VOSA : Vehicle and Operator Services Agency and then
• DVSA : Driver and Vehicle Services Agency
• The test is still called the MOT Test
• It is a test of road worthiness
• Nearly every motorised vehicle must pass an MOT every year
• New cars are exempt for three years
#MDBW16
Introduction to the MOT Public Data Set
#MDBW16
This is a big Collection
> db.test_results.count()
253472477
>
253 million records
#MDBW16
Example Document
{ "_id" : ObjectId("5759ee6e8684975e1098af68"),
"TestID" : 400,
"VehicleID" : "278",
"TestDate" : ISODate("2013-04-23T00:00:00Z"),
"TestClassID" : "4",
"TestType" : "N",
"TestResult" : "P",
"TestMileage" : 99284,
"Postcode" : "E",
"Make" : "AUDI",
"Model" : "A3",
"Colour" : "BLACK",
"FuelType" : "P",
"CylinderCapacity" : 1598,
"FirstUseDate" : ISODate("2003-11-11T00:00:00Z“) }
#MDBW16
We will work with one year : 2013
> db.results_2013.count()
37390457
Still 37.3 million records
#MDBW16
Use $match to filter
cars = { "$match" : { "TestClassID" : { "$eq" : "4" }}}
motorcycles = {
"$match" : {
"$or" : [
{
"TestClassID" : "1"
},
{
"TestClassID" : "2"
}
]
}
}
#MDBW16
Lets make a collection of Cars
removeNulls = { "$match" : { "FirstUseDate" : { "$ne" : "NULL" }}}
carsonly = { "$match" : { "TestClassID" : “4”}}
output = { “$out” : “cars_2013” }
db.results_2013.aggregate( [ removeNulls, carsonly, output ] )
#MDBW16
Now lets create a Summary
For each car of a given make and age:
• The total number of cars
• The average mileage for this collection of cars
• The total number of passes
#MDBW16
We $project to get the data we want
ageinusecs ={ "$subtract" :[ "$TestDate", "$FirstUseDate”]}
ageinyears = { "$divide" :[ ageinusecs,(1000*3600*24*365)]}
floorage = { "$floor" : ageinyears }
ispass = { "$cond" : [{"$eq": ["$TestResult","P"]},1,0]}
#MDBW16
The actual $projection
project =
{ "$project" : { "Make” :1,
"VehicleID" :1,
"TestResult” :1,
"TestDate” :1,
"TestMileage” :1,
"FirstUseDate” :1,
"Age” :floorage,
"pass” :ispass }}
#MDBW16
We Then $group to get the accumulated values
group = { "$group" :
{ "_id" : { "make": "$Make", "age" : "$Age" },
"count" : {"$sum":1} ,
"miles” : {"$avg":"$TestMileage"},
"passes": {"$sum":"$pass” }}}
#MDBW16
Now put it all in pipeline
db.results_2013.aggregate([project,group,out])
out = { "$out" : ”cars_summary" }
The $out operator creates a new collection
#MDBW16
Restrictions
• 16MB result limit if not using cursors or $out
• 100MB in memory limit unless $allowDiskUse is specified
• $geoNear must be the first operator in a pipeline
• $out must be the last operator in the pipeline
#MDBW16
Market Size
$36 Billion
Partners
1,000+
International Offices
15
Global Employees
575+
Downloads Worldwide
15,000,000+
Make a GIANT Impact
www.mongodb.com/careers
MongoDB World 2016 : Advanced Aggregation

MongoDB World 2016 : Advanced Aggregation

  • 1.
    Advanced MongoDB Aggregation Joe Drumgoole Directorof Developer Advocacy, EMEA @jdrumgoole
  • 2.
    #MDBW16 MongoDB Aggregation Framework •One of the least understood parts of MongoDB • Many customers never get to the aggregation framework • Had limited utility 2.6 (16MB limit) • The aggregation framework “grew up” in 3.0 • We continue to enhance (3.2 and now 3..4)
  • 3.
    #MDBW16 A quick reminder •A Processing Pipeline • Design to process large groups of documents in parallel • Is shard aware • Can create new data from old
  • 4.
    #MDBW16 A Typical Pipeline MatchProject Group Sort Find Query Select Fields Combine Fields Rename fields Calculate Group By Execute accumulators Rename fields Sort results
  • 5.
    #MDBW16 Description of DataSet • MOT : Ministry of Transport, then • VOSA : Vehicle and Operator Services Agency and then • DVSA : Driver and Vehicle Services Agency • The test is still called the MOT Test • It is a test of road worthiness • Nearly every motorised vehicle must pass an MOT every year • New cars are exempt for three years
  • 6.
    #MDBW16 Introduction to theMOT Public Data Set
  • 7.
    #MDBW16 This is abig Collection > db.test_results.count() 253472477 > 253 million records
  • 8.
    #MDBW16 Example Document { "_id": ObjectId("5759ee6e8684975e1098af68"), "TestID" : 400, "VehicleID" : "278", "TestDate" : ISODate("2013-04-23T00:00:00Z"), "TestClassID" : "4", "TestType" : "N", "TestResult" : "P", "TestMileage" : 99284, "Postcode" : "E", "Make" : "AUDI", "Model" : "A3", "Colour" : "BLACK", "FuelType" : "P", "CylinderCapacity" : 1598, "FirstUseDate" : ISODate("2003-11-11T00:00:00Z“) }
  • 9.
    #MDBW16 We will workwith one year : 2013 > db.results_2013.count() 37390457 Still 37.3 million records
  • 10.
    #MDBW16 Use $match tofilter cars = { "$match" : { "TestClassID" : { "$eq" : "4" }}} motorcycles = { "$match" : { "$or" : [ { "TestClassID" : "1" }, { "TestClassID" : "2" } ] } }
  • 11.
    #MDBW16 Lets make acollection of Cars removeNulls = { "$match" : { "FirstUseDate" : { "$ne" : "NULL" }}} carsonly = { "$match" : { "TestClassID" : “4”}} output = { “$out” : “cars_2013” } db.results_2013.aggregate( [ removeNulls, carsonly, output ] )
  • 12.
    #MDBW16 Now lets createa Summary For each car of a given make and age: • The total number of cars • The average mileage for this collection of cars • The total number of passes
  • 13.
    #MDBW16 We $project toget the data we want ageinusecs ={ "$subtract" :[ "$TestDate", "$FirstUseDate”]} ageinyears = { "$divide" :[ ageinusecs,(1000*3600*24*365)]} floorage = { "$floor" : ageinyears } ispass = { "$cond" : [{"$eq": ["$TestResult","P"]},1,0]}
  • 14.
    #MDBW16 The actual $projection project= { "$project" : { "Make” :1, "VehicleID" :1, "TestResult” :1, "TestDate” :1, "TestMileage” :1, "FirstUseDate” :1, "Age” :floorage, "pass” :ispass }}
  • 15.
    #MDBW16 We Then $groupto get the accumulated values group = { "$group" : { "_id" : { "make": "$Make", "age" : "$Age" }, "count" : {"$sum":1} , "miles” : {"$avg":"$TestMileage"}, "passes": {"$sum":"$pass” }}}
  • 16.
    #MDBW16 Now put itall in pipeline db.results_2013.aggregate([project,group,out]) out = { "$out" : ”cars_summary" } The $out operator creates a new collection
  • 17.
    #MDBW16 Restrictions • 16MB resultlimit if not using cursors or $out • 100MB in memory limit unless $allowDiskUse is specified • $geoNear must be the first operator in a pipeline • $out must be the last operator in the pipeline
  • 18.
    #MDBW16 Market Size $36 Billion Partners 1,000+ InternationalOffices 15 Global Employees 575+ Downloads Worldwide 15,000,000+ Make a GIANT Impact www.mongodb.com/careers

Editor's Notes

  • #7 34 GB uncompressed
  • #11 Early matching can use indexes to reduce input set efficiently. $Match usings same syntax as the find() function.