Data Processing and Aggregation
Achille Brighton
Consulting Engineer, MongoDB
Big Data
Exponential Data Growth
Billions of URLs indexed by Google
1200
1000
800
600
400
200
0
2000

2002

2004

2006

2008
For over a decade

Big Data == Custom Software
In the past few years
Open source software has
emerged enabling the rest of
us to handle Big Data
How MongoDB Meets Our Requirements
•  MongoDB is an operational database
•  MongoDB provides high performance for storage and

retrieval at large scale
•  MongoDB has a robust query interface permitting

intelligent operations
•  MongoDB is not a data processing engine, but provides

processing functionality
MongoDB data processing options
Getting Example Data
The “hello world” of
MapReduce is counting words
in a paragraph of text.
Let’s try something a little more
interesting…
What is the most popular pub name?
Open Street Map Data
#!/usr/bin/env python
# Data Source
# http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59]
import re
import sys
from imposm.parser import OSMParser
import pymongo
class Handler(object):
def nodes(self, nodes):
if not nodes:
return
docs = []
for node in nodes:
osm_id, doc, (lon, lat) = node
if "name" not in doc:
node_points[osm_id] = (lon, lat)
continue
doc["name"] = doc["name"].title().lstrip("The ").replace("And", "&")
doc["_id"] = osm_id
doc["location"] = {"type": "Point", "coordinates": [lon, lat]}
docs.append(doc)
collection.insert(docs)
Example Pub Data
{
"_id" : 451152,
"amenity" : "pub",
"name" : "The Dignity",
"addr:housenumber" : "363",
"addr:street" : "Regents Park Road",
"addr:city" : "London",
"addr:postcode" : "N3 1DH",
"toilets" : "yes",
"toilets:access" : "customers",
"location" : {
"type" : "Point",
"coordinates" : [-0.1945732, 51.6008172]
}
}
MongoDB MapReduce
• 

map
MongoDB

reduce
finalize
MongoDB MapReduce
• 
map

Map Function
MongoDB

> var map = function() {
emit(this.name, 1);

reduce

finalize
map

Reduce Function
MongoDB

> var reduce = function (key, values) {
var sum = 0;
values.forEach( function (val) {sum += val;} );
return sum;
}

reduce

finalize
Results
> db.pub_names.find().sort({value: -1}).limit(10)
{ "_id" : "The Red Lion", "value" : 407 }
{ "_id" : "The Royal Oak", "value" : 328 }
{ "_id" : "The Crown", "value" : 242 }
{ "_id" : "The White Hart", "value" : 214 }
{ "_id" : "The White Horse", "value" : 200 }
{ "_id" : "The New Inn", "value" : 187 }
{ "_id" : "The Plough", "value" : 185 }
{ "_id" : "The Rose & Crown", "value" : 164 }
{ "_id" : "The Wheatsheaf", "value" : 147 }
{ "_id" : "The Swan", "value" : 140 }
Pub Names in the Center of London
> db.pubs.mapReduce(map, reduce, { out: "pub_names",
query: {
location: {
$within: { $centerSphere: [[-0.12, 51.516], 2 / 3959] }
}}
})
{
"result" : "pub_names",
"timeMillis" : 116,
"counts" : {
"input" : 643,
"emit" : 643,
"reduce" : 54,
"output" : 537
},
"ok" : 1,
}
Results
> db.pub_names.find().sort({value: -1}).limit(10)
{
{
{
{
{
{
{
{
{
{

"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"

:
:
:
:
:
:
:
:
:
:

"All Bar One", "value" : 11 }
"The Slug & Lettuce", "value" : 7 }
"The Coach & Horses", "value" : 6 }
"The Green Man", "value" : 5 }
"The Kings Arms", "value" : 5 }
"The Red Lion", "value" : 5 }
"Corney & Barrow", "value" : 4 }
"O'Neills", "value" : 4 }
"Pitcher & Piano", "value" : 4 }
"The Crown", "value" : 4 }
Double Checking
MongoDB MapReduce
•  Real-time
•  Output directly to document or collection
•  Runs inside MongoDB on local data

− Adds load to your DB
− In Javascript – debugging can be a challenge
− Translating in and out of C++
Aggregation Framework
•  Declared in JSON, executes in C++

Aggregation Framework
Data Processing in MongoDB
•  Declared in JSON, executes in C++
•  Flexible, functional, and simple

Aggregation Framework
Data Processing in MongoDB
•  Declared in JSON, executes in C++
•  Flexible, functional, and simple
•  Plays nice with sharding

Aggregation Framework
Data Processing in MongoDB
Pipeline
Piping command line operations

ps ax | grep mongod | head 1

Data Processing in MongoDB
Pipeline
Piping aggregation operations

$match | $group | $sort
Stream of documents

Result document

Data Processing in MongoDB
Pipeline Operators
•  $match

•  $sort

•  $project

•  $limit

•  $group

•  $skip

•  $unwind

•  $geoNear

Data Processing in MongoDB
$match
•  Filter documents
•  Uses existing query syntax
•  If using $geoNear it has to be first in pipeline
•  $where is not supported
Matching Field Values
{
"_id" : 271421,
"amenity" : "pub",
"name" : "Sir Walter Tyrrell",
"location" : {
"type" : "Point",
"coordinates" : [
-1.6192422,
50.9131996
]
}
}

{ "$match": {
"name": "The Red Lion"
}}

{
"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119
]}

{
"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119
]
}

}
$project
•  Reshape documents
•  Include, exclude or rename fields
•  Inject computed fields
•  Create sub-document fields
Including and Excluding Fields
{
"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {
"type" : "Point",

{ “$project”: {
“_id”: 0,
“amenity”: 1,
“name”: 1,
}}

"coordinates" : [
-1.5494749,
50.7837119
]
}
}

{
“amenity” : “pub”,
“name” : “The Red Lion”
}
Reformatting Documents
{
"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {
"type" : "Point",

{ “$project”: {
“_id”: 0,
“name”: 1,
“meta”: {
“type”: “$amenity”}
}}

"coordinates" : [
-1.5494749,
50.7837119
]
}
}

{
“name” : “The Red Lion”
“meta” : {
“type” : “pub”
}}
$group
•  Group documents by an ID
•  Field reference, object, constant
•  Other output fields are computed

$max, $min, $avg, $sum
$addToSet, $push $first, $last
•  Processes all data in memory
Summating fields

}

{ $group: {
_id: "$language",
numTitles: { $sum: 1 },
sumPages: { $sum: "$pages" }
}}

{

{

{
title: "The Great Gatsby",
pages: 218,
language: "English"

title: "War and Peace",
pages: 1440,
language: "Russian”
}

}

{

_id: "Russian",
numTitles: 1,
sumPages: 1440

{
title: "Atlas Shrugged",
pages: 1088,
language: "English"

}

}

_id: "English",
numTitles: 2,
sumPages: 1306
Add To Set
{
title: "The Great Gatsby",
pages: 218,
language: "English"

{ $group: {
_id: "$language",
titles: { $addToSet: "$title" }
}}

}

{
{
title: "War and Peace",
pages: 1440,
language: "Russian"

}
{

}
{
title: "Atlas Shrugged",
pages: 1088,
language: "English"
}

}

_id: "Russian",
titles: [ "War and Peace" ]

_id: "English",
titles: [
"Atlas Shrugged",
"The Great Gatsby"
]
Expanding Arrays
{ $unwind: "$subjects" }

{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: [
"Long Island",
"New York",
"1920s"
]

{

}
{

}

}
{

}

title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "Long Island"
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "New York"
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "1920s"
Back to the pub!

• 

http://www.offwestend.com/index.php/theatres/pastshows/71
Popular Pub Names
>var popular_pub_names = [
{ $match : location:
{ $within: { $centerSphere:
[[-0.12, 51.516], 2 / 3959]}}}
},
{ $group :
{ _id: “$name”
value: {$sum: 1} }
},
{ $sort : {value: -1} },
{ $limit : 10 }
Results
> db.pubs.aggregate(popular_pub_names)
{
"result" : [
{ "_id" : "All Bar One", "value" : 11 }
{ "_id" : "The Slug & Lettuce", "value" : 7 }
{ "_id" : "The Coach & Horses", "value" : 6 }
{ "_id" : "The Green Man", "value" : 5 }
{ "_id" : "The Kings Arms", "value" : 5 }
{ "_id" : "The Red Lion", "value" : 5 }
{ "_id" : "Corney & Barrow", "value" : 4 }
{ "_id" : "O'Neills", "value" : 4 }
{ "_id" : "Pitcher & Piano", "value" : 4 }
{ "_id" : "The Crown", "value" : 4 }
],
"ok" : 1
}
Aggregation Framework Benefits
•  Real-time
•  Simple yet powerful interface
•  Declared in JSON, executes in C++
•  Runs inside MongoDB on local data

− Adds load to your DB
− Limited Operators
− Data output is limited
Analyzing MongoDB Data in
External Systems
MongoDB with Hadoop
• 

MongoDB
Hadoop MongoDB Connector
•  MongoDB or BSON files as input/output
•  Source data can be filtered with queries
•  Hadoop Streaming support
–  For jobs written in Python, Ruby, Node.js

•  Supports Hadoop tools such as Pig and Hive
Map Pub Names in Python
#!/usr/bin/env python
from pymongo_hadoop import BSONMapper
def mapper(documents):
bounds = get_bounds() # ~2 mile polygon
for doc in documents:
geo = get_geo(doc["location"]) # Convert the geo type
if not geo:
continue
if bounds.intersects(geo):
yield {'_id': doc['name'], 'count': 1}
BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."
Reduce Pub Names in Python
#!/usr/bin/env python
from pymongo_hadoop import BSONReducer
def reducer(key, values):
_count = 0
for v in values:
_count += v['count']
return {'_id': key, 'value': _count}
BSONReducer(reducer)
Execute MapReduce
hadoop jar target/mongo-hadoop-streaming-assembly-1.1.0-rc0.jar 
-mapper examples/pub/map.py 
-reducer examples/pub/reduce.py 
-mongo mongodb://127.0.0.1/demo.pubs 
-outputURI mongodb://127.0.0.1/demo.pub_names
Popular Pub Names Nearby
> db.pub_names.find().sort({value: -1}).limit(10)
{
{
{
{
{
{
{
{
{
{

"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"

:
:
:
:
:
:
:
:
:
:

"All Bar One", "value" : 11 }
"The Slug & Lettuce", "value" : 7 }
"The Coach & Horses", "value" : 6 }
"The Kings Arms", "value" : 5 }
"Corney & Barrow", "value" : 4 }
"O'Neills", "value" : 4 }
"Pitcher & Piano", "value" : 4 }
"The Crown", "value" : 4 }
"The George", "value" : 4 }
"The Green Man", "value" : 4 }
MongoDB with Hadoop
• 

MongoDB

warehouse
MongoDB with Hadoop
• 

ETL

MongoDB
Limitations
•  Batch processing
•  Requires synchronization between data store and

processor
•  Adds complexity to infrastructure
Advantages
•  Processing decoupled from data store
•  Parallel processing
•  Leverage existing infrastructure
•  Java has rich set of data processing libraries
–  And other languages if using Hadoop Streaming
Storm
Storm
Storm MongoDB connector
•  Spout for MongoDB oplog or capped collections
–  Filtering capabilities
–  Threaded and non-blocking

•  Output to new or existing documents
–  Insert/update bolt
Aggregating MongoDB’s
Data Processing Options
Data Processing with MongoDB
•  Process in MongoDB using Map/Reduce
•  Process in MongoDB using Aggregation Framework
•  Also: Storing pre-aggregated data
–  An exercise in schema design
•  Process outside MongoDB using Hadoop and other

external tools
External Tools
Questions?
References
•  Map Reduce docs
–  http://docs.mongodb.org/manual/core/map-reduce/
•  Aggregation Framework
–  Examples
http://docs.mongodb.org/manual/applications/aggregation
–  SQL Comparison
http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/
•  Multi Threaded Map Reduce:

http://edgystuff.tumblr.com/post/54709368492/how-to-speedup-mongodb-map-reduce-by-20x
Thanks!
Achille Brighton
Consulting Engineer, MongoDB

Webinar: Data Processing and Aggregation Options

  • 1.
    Data Processing andAggregation Achille Brighton Consulting Engineer, MongoDB
  • 2.
  • 3.
    Exponential Data Growth Billionsof URLs indexed by Google 1200 1000 800 600 400 200 0 2000 2002 2004 2006 2008
  • 4.
    For over adecade Big Data == Custom Software
  • 5.
    In the pastfew years Open source software has emerged enabling the rest of us to handle Big Data
  • 6.
    How MongoDB MeetsOur Requirements •  MongoDB is an operational database •  MongoDB provides high performance for storage and retrieval at large scale •  MongoDB has a robust query interface permitting intelligent operations •  MongoDB is not a data processing engine, but provides processing functionality
  • 7.
  • 8.
  • 9.
    The “hello world”of MapReduce is counting words in a paragraph of text. Let’s try something a little more interesting…
  • 10.
    What is themost popular pub name?
  • 11.
    Open Street MapData #!/usr/bin/env python # Data Source # http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59] import re import sys from imposm.parser import OSMParser import pymongo class Handler(object): def nodes(self, nodes): if not nodes: return docs = [] for node in nodes: osm_id, doc, (lon, lat) = node if "name" not in doc: node_points[osm_id] = (lon, lat) continue doc["name"] = doc["name"].title().lstrip("The ").replace("And", "&") doc["_id"] = osm_id doc["location"] = {"type": "Point", "coordinates": [lon, lat]} docs.append(doc) collection.insert(docs)
  • 12.
    Example Pub Data { "_id": 451152, "amenity" : "pub", "name" : "The Dignity", "addr:housenumber" : "363", "addr:street" : "Regents Park Road", "addr:city" : "London", "addr:postcode" : "N3 1DH", "toilets" : "yes", "toilets:access" : "customers", "location" : { "type" : "Point", "coordinates" : [-0.1945732, 51.6008172] } }
  • 13.
  • 14.
  • 15.
    map Map Function MongoDB > varmap = function() { emit(this.name, 1); reduce finalize
  • 16.
    map Reduce Function MongoDB > varreduce = function (key, values) { var sum = 0; values.forEach( function (val) {sum += val;} ); return sum; } reduce finalize
  • 17.
    Results > db.pub_names.find().sort({value: -1}).limit(10) {"_id" : "The Red Lion", "value" : 407 } { "_id" : "The Royal Oak", "value" : 328 } { "_id" : "The Crown", "value" : 242 } { "_id" : "The White Hart", "value" : 214 } { "_id" : "The White Horse", "value" : 200 } { "_id" : "The New Inn", "value" : 187 } { "_id" : "The Plough", "value" : 185 } { "_id" : "The Rose & Crown", "value" : 164 } { "_id" : "The Wheatsheaf", "value" : 147 } { "_id" : "The Swan", "value" : 140 }
  • 19.
    Pub Names inthe Center of London > db.pubs.mapReduce(map, reduce, { out: "pub_names", query: { location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959] } }} }) { "result" : "pub_names", "timeMillis" : 116, "counts" : { "input" : 643, "emit" : 643, "reduce" : 54, "output" : 537 }, "ok" : 1, }
  • 20.
    Results > db.pub_names.find().sort({value: -1}).limit(10) { { { { { { { { { { "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" : : : : : : : : : : "AllBar One", "value" : 11 } "The Slug & Lettuce", "value" : 7 } "The Coach & Horses", "value" : 6 } "The Green Man", "value" : 5 } "The Kings Arms", "value" : 5 } "The Red Lion", "value" : 5 } "Corney & Barrow", "value" : 4 } "O'Neills", "value" : 4 } "Pitcher & Piano", "value" : 4 } "The Crown", "value" : 4 }
  • 21.
  • 22.
    MongoDB MapReduce •  Real-time • Output directly to document or collection •  Runs inside MongoDB on local data − Adds load to your DB − In Javascript – debugging can be a challenge − Translating in and out of C++
  • 23.
  • 24.
    •  Declared inJSON, executes in C++ Aggregation Framework Data Processing in MongoDB
  • 25.
    •  Declared inJSON, executes in C++ •  Flexible, functional, and simple Aggregation Framework Data Processing in MongoDB
  • 26.
    •  Declared inJSON, executes in C++ •  Flexible, functional, and simple •  Plays nice with sharding Aggregation Framework Data Processing in MongoDB
  • 27.
    Pipeline Piping command lineoperations ps ax | grep mongod | head 1 Data Processing in MongoDB
  • 28.
    Pipeline Piping aggregation operations $match| $group | $sort Stream of documents Result document Data Processing in MongoDB
  • 29.
    Pipeline Operators •  $match • $sort •  $project •  $limit •  $group •  $skip •  $unwind •  $geoNear Data Processing in MongoDB
  • 30.
    $match •  Filter documents • Uses existing query syntax •  If using $geoNear it has to be first in pipeline •  $where is not supported
  • 31.
    Matching Field Values { "_id": 271421, "amenity" : "pub", "name" : "Sir Walter Tyrrell", "location" : { "type" : "Point", "coordinates" : [ -1.6192422, 50.9131996 ] } } { "$match": { "name": "The Red Lion" }} { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ]} { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] } }
  • 32.
    $project •  Reshape documents • Include, exclude or rename fields •  Inject computed fields •  Create sub-document fields
  • 33.
    Including and ExcludingFields { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", { “$project”: { “_id”: 0, “amenity”: 1, “name”: 1, }} "coordinates" : [ -1.5494749, 50.7837119 ] } } { “amenity” : “pub”, “name” : “The Red Lion” }
  • 34.
    Reformatting Documents { "_id" :271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", { “$project”: { “_id”: 0, “name”: 1, “meta”: { “type”: “$amenity”} }} "coordinates" : [ -1.5494749, 50.7837119 ] } } { “name” : “The Red Lion” “meta” : { “type” : “pub” }}
  • 35.
    $group •  Group documentsby an ID •  Field reference, object, constant •  Other output fields are computed $max, $min, $avg, $sum $addToSet, $push $first, $last •  Processes all data in memory
  • 36.
    Summating fields } { $group:{ _id: "$language", numTitles: { $sum: 1 }, sumPages: { $sum: "$pages" } }} { { { title: "The Great Gatsby", pages: 218, language: "English" title: "War and Peace", pages: 1440, language: "Russian” } } { _id: "Russian", numTitles: 1, sumPages: 1440 { title: "Atlas Shrugged", pages: 1088, language: "English" } } _id: "English", numTitles: 2, sumPages: 1306
  • 37.
    Add To Set { title:"The Great Gatsby", pages: 218, language: "English" { $group: { _id: "$language", titles: { $addToSet: "$title" } }} } { { title: "War and Peace", pages: 1440, language: "Russian" } { } { title: "Atlas Shrugged", pages: 1088, language: "English" } } _id: "Russian", titles: [ "War and Peace" ] _id: "English", titles: [ "Atlas Shrugged", "The Great Gatsby" ]
  • 38.
    Expanding Arrays { $unwind:"$subjects" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ] { } { } } { } title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island" title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York" title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s"
  • 39.
    Back to thepub! •  http://www.offwestend.com/index.php/theatres/pastshows/71
  • 40.
    Popular Pub Names >varpopular_pub_names = [ { $match : location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959]}}} }, { $group : { _id: “$name” value: {$sum: 1} } }, { $sort : {value: -1} }, { $limit : 10 }
  • 41.
    Results > db.pubs.aggregate(popular_pub_names) { "result" :[ { "_id" : "All Bar One", "value" : 11 } { "_id" : "The Slug & Lettuce", "value" : 7 } { "_id" : "The Coach & Horses", "value" : 6 } { "_id" : "The Green Man", "value" : 5 } { "_id" : "The Kings Arms", "value" : 5 } { "_id" : "The Red Lion", "value" : 5 } { "_id" : "Corney & Barrow", "value" : 4 } { "_id" : "O'Neills", "value" : 4 } { "_id" : "Pitcher & Piano", "value" : 4 } { "_id" : "The Crown", "value" : 4 } ], "ok" : 1 }
  • 42.
    Aggregation Framework Benefits • Real-time •  Simple yet powerful interface •  Declared in JSON, executes in C++ •  Runs inside MongoDB on local data − Adds load to your DB − Limited Operators − Data output is limited
  • 43.
    Analyzing MongoDB Datain External Systems
  • 44.
  • 45.
    Hadoop MongoDB Connector • MongoDB or BSON files as input/output •  Source data can be filtered with queries •  Hadoop Streaming support –  For jobs written in Python, Ruby, Node.js •  Supports Hadoop tools such as Pig and Hive
  • 46.
    Map Pub Namesin Python #!/usr/bin/env python from pymongo_hadoop import BSONMapper def mapper(documents): bounds = get_bounds() # ~2 mile polygon for doc in documents: geo = get_geo(doc["location"]) # Convert the geo type if not geo: continue if bounds.intersects(geo): yield {'_id': doc['name'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping."
  • 47.
    Reduce Pub Namesin Python #!/usr/bin/env python from pymongo_hadoop import BSONReducer def reducer(key, values): _count = 0 for v in values: _count += v['count'] return {'_id': key, 'value': _count} BSONReducer(reducer)
  • 48.
    Execute MapReduce hadoop jartarget/mongo-hadoop-streaming-assembly-1.1.0-rc0.jar -mapper examples/pub/map.py -reducer examples/pub/reduce.py -mongo mongodb://127.0.0.1/demo.pubs -outputURI mongodb://127.0.0.1/demo.pub_names
  • 49.
    Popular Pub NamesNearby > db.pub_names.find().sort({value: -1}).limit(10) { { { { { { { { { { "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" : : : : : : : : : : "All Bar One", "value" : 11 } "The Slug & Lettuce", "value" : 7 } "The Coach & Horses", "value" : 6 } "The Kings Arms", "value" : 5 } "Corney & Barrow", "value" : 4 } "O'Neills", "value" : 4 } "Pitcher & Piano", "value" : 4 } "The Crown", "value" : 4 } "The George", "value" : 4 } "The Green Man", "value" : 4 }
  • 50.
  • 51.
  • 52.
    Limitations •  Batch processing • Requires synchronization between data store and processor •  Adds complexity to infrastructure
  • 53.
    Advantages •  Processing decoupledfrom data store •  Parallel processing •  Leverage existing infrastructure •  Java has rich set of data processing libraries –  And other languages if using Hadoop Streaming
  • 54.
  • 55.
  • 56.
    Storm MongoDB connector • Spout for MongoDB oplog or capped collections –  Filtering capabilities –  Threaded and non-blocking •  Output to new or existing documents –  Insert/update bolt
  • 57.
  • 58.
    Data Processing withMongoDB •  Process in MongoDB using Map/Reduce •  Process in MongoDB using Aggregation Framework •  Also: Storing pre-aggregated data –  An exercise in schema design •  Process outside MongoDB using Hadoop and other external tools
  • 59.
  • 60.
  • 61.
    References •  Map Reducedocs –  http://docs.mongodb.org/manual/core/map-reduce/ •  Aggregation Framework –  Examples http://docs.mongodb.org/manual/applications/aggregation –  SQL Comparison http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/ •  Multi Threaded Map Reduce: http://edgystuff.tumblr.com/post/54709368492/how-to-speedup-mongodb-map-reduce-by-20x
  • 62.