Aggregation Framework

#mongodbdays
Aggregation Framework
Emily Stolfo
Ruby Engineer/Evangelist, 10gen
@EmStolfo
Tuesday, January 29, 13

Agenda
• State of Aggregation
• Pipeline
• Usage and Limitations
• Optimization
• Sharding
• (Expressions)
• Looking Ahead

State of Aggregation

State of Aggregation
• We're storing our data in MongoDB
• We need to do ad-hoc reporting, grouping,
common aggregations, etc.
• What are we using for this?

Data Warehousing

Data Warehousing
• SQL for reporting and analytics
• Infrastructure complications
– Additional maintenance
– Data duplication
– ETL processes
– Real time?

MapReduce

MapReduce
• Extremely versatile, powerful
• Intended for complex data analysis
• Overkill for simple aggregation tasks, such as
– Averages
– Summation
– Grouping

MapReduce in MongoDB
• Implemented with JavaScript
– Single-threaded
– Difficult to debug
• Concurrency
– Appearance of parallelism
– Write locks


• Declared in JSON, executes in C++
• Flexible, functional, and simple
– Operation pipeline
– Computational expressions
• Works well with sharding

Enabling Developers
• Doing more within MongoDB, faster
• Refactoring MapReduce and groupings
– Replace pages of JavaScript
– Longer aggregation pipelines
• Quick aggregations from the shell

Pipeline

Pipeline
• Process a stream of documents
– Original input is a collection
– Final output is a result document
• Series of operators
– Filter or transform data
– Input/output chain
ps ax | grep mongod | head -n 1

Pipeline Operators
• $match • $sort
• $project • $limit
• $group • $skip
• $unwind

Example book data
{
_id: 375,
title: "The Great Gatsby",
ISBN: "9781857150193",
available: true,
pages: 218,
chapters: 9,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}

$match
• Filter documents
• Uses existing query syntax
• (No geospatial operations or $where)

Matching Field Values
{
{ $match: {
language: "Russian"
pages: 218,
}}
language: "English"
}
{
title: "War and Peace",
{
pages: 1440,
language: "Russian"
pages: 1440,
}
language: "Russian"
}
{
title: "Atlas Shrugged",
pages: 1088,
language: "English"
}

Matching with Query Operators
{ { $match: {
title: "The Great Gatsby", pages: { $gt: 1000 }
pages: 218, }}
language: "English"
}
{ {
title: "War and Peace", title: "War and Peace",
pages: 1440, pages: 1440,
language: "Russian" language: "Russian"
} }
{ {
title: "Atlas Shrugged", title: "Atlas Shrugged",
pages: 1088, pages: 1088,
language: "English" language: "English"
} }

$project
• Reshape documents
• Include, exclude or rename fields
• Inject computed fields
• Create sub-document fields

Including and Excluding Fields
{ { $project: {
_id: 375, _id: 0,
title: "Great Gatsby", title: 1,
ISBN: "9781857150193", language: 1
available: true, }}
pages: 218,
subjects: [
"Long Island",
"New York",
"1920s" {
], title: " Great Gatsby",
language: "English" language: "English"
} }

Renaming and Computing Fields
{ { $project: {
_id: 375, avgChapterLength: {
title: "Great Gatsby", $divide: ["$pages",
ISBN: "9781857150193", "$chapters"]
available: true, },
pages: 218, lang: "$language"
chapters: 9, }}
subjects: [
"Long Island",
"New York",
"1920s" {
], _id: 375,
language: "English" avgChapterLength: 24.2222 ,
} lang: "English"
}

Creating Sub-Document Fields
{ $project: {
{
title: 1,
_id: 375,
stats: {
title: "Great Gatsby",
pages: "$pages",
ISBN: "9781857150193",
language: "$language",
available: true,
}
pages: 218,
}}
subjects: [
"Long Island",
"New York",
"1920s"
{
],
_id: 375,
language: "English"
title: " Great Gatsby",
}
stats: {
pages: 218,
language: "English"
}

$group
• Group documents by an ID
– Field reference, object, constant
• Other output fields are computed

– $max, $min, $avg, $sum
– $addToSet, $push
– $first, $last
• Processes all data in memory

Calculating an Average
{ { $group: {
title: "The Great Gatsby", _id: "$language",
pages: 218, avgPages: { $avg:
language: "English" "$pages" }
} }}
{
pages: 1440, {
language: "Russian" _id: "Russian",
} avgPages: 1440
}
{
title: "Atlas Shrugged", {
pages: 1088, _id: "English",
language: "English" avgPages: 653
} }

Summating Fields and Counting
{ { $group: {
pages: 218, numTitles: { $sum: 1 },
language: "English" sumPages: { $sum: "$pages" }
}}
}
{
title: "War and Peace", {
pages: 1440, _id: "Russian",
language: "Russian” numTitles: 1,
} sumPages: 1440
}
{
{
title: "Atlas Shrugged",
_id: "English",
pages: 1088, numTitles: 2,
language: "English" sumPages: 1306
} }

Collecting Distinct Values
{ { $group: {
pages: 218, titles: { $addToSet: "$title" }
language: "English" }}
}
{ {
title: "War and Peace", _id: "Russian",
titles: [ "War and Peace" ]
pages: 1440, }
language: "Russian"
}
{
_id: "English",
{ titles: [
title: "Atlas Shrugged", "Atlas Shrugged",
pages: 1088, "The Great Gatsby"
language: "English" ]
}
}

$unwind
• Applied to an array field
• Yield new documents for each array element
– Array replaced by element value
– Missing/empty fields → no output
– Non-array fields → error
• Pipe to $group to aggregate array values

Yielding Multiple Documents from One
{ { $unwind: "$subjects" }
ISBN: "9781857150193",
{
subjects: [
"Long Island", ISBN: "9781857150193",
"New York", subjects: "Long Island"
"1920s" }
]
} {
ISBN: "9781857150193",
subjects: "New York"
}
{
ISBN: "9781857150193",
subjects: "1920s"
}

$sort, $limit, $skip
• Sort documents by one or more fields
– Same order syntax as cursors
– Waits for earlier pipeline operator to return
– In-memory unless early and indexed
• Limit and skip follow cursor behavior

Sort All the Documents in the Pipeline
{ title: "The Great Gatsby" } { $sort: { title: 1 }}

{ title: "Brave New World" }
{ title: "Grapes of Wrath" } { title: "Animal Farm" }
{ title: "Animal Farm" } { title: "Brave New World" }
{ title: "Lord of the Flies" } { title: "Fahrenheit 451" }
{ title: "Fathers and Sons" } { title: "Fathers and Sons" }
{ title: "Invisible Man" } { title: "Grapes of Wrath" }
{ title: "Fahrenheit 451" } { title: "Invisible Man" }
{ title: "Lord of the Flies" }
{ title: "The Great Gatsby" }

Limit Documents Through the Pipeline
{ title: "The Great Gatsby" } { $limit: 5 }

{ title: "Grapes of Wrath" } { title: "The Great Gatsby" }
{ title: "Animal Farm" } { title: "Brave New World" }
{ title: "Lord of the Flies" } { title: "Grapes of Wrath" }
{ title: "Fathers and Sons" } { title: "Animal Farm" }
{ title: "Invisible Man" } { title: "Lord of the Flies" }
{ title: "Fahrenheit 451" }

Skip Over Documents in the Pipeline
{ title: "The Great Gatsby" } { $skip: 5 }

{ title: "Grapes of Wrath" }
{ title: "Animal Farm" } { title: "Fathers and Sons" }
{ title: "Lord of the Flies" } { title: "Invisible Man" }
{ title: "Fathers and Sons" } { title: "Fahrenheit 451" }
{ title: "Invisible Man" }
{ title: "Fahrenheit 451" }

Usage and Limitations

Usage
• collection.aggregate() method
– Mongo shell
– Most drivers
• aggregate database command

Collection
db.books.aggregate([
{ $project: { language: 1 }},
{ $group: { _id: "$language", numTitles: { $sum: 1 }}}
])
{
result: [
{ _id: "Russian", numTitles: 1 },
{ _id: "English", numTitles: 2 }
],
ok: 1
}

Database Command
db.runCommand({
aggregate: "books",
pipeline: [
{ $project: { language: 1 }},
{ $group: { _id: "$language", numTitles: { $sum: 1 }}}
]
})
{
result: [
{ _id: "Russian", numTitles: 1 },
{ _id: "English", numTitles: 2 }
],
ok: 1
}

Limitations
• Result limited by BSON document size
– Final command result
– Intermediate shard results
• Pipeline operator memory limits

• Some BSON types unsupported
– Binary, Code, deprecated types

Sharding

Sharding
• Split the pipeline at first $group or $sort
– Shards execute pipeline up to that point
– mongos merges results and continues
• Early $match may excuse shards

• CPU and memory implications for mongos

Sharding
[
{ $match: { /* filter by shard key */ }},
{ $project: { /* select fields */ }},
{ $group: { /* group by some field */ }},
{ $sort: { /* sort by some field */ }},
{ $project: { /* reshape result */ }}
]

Aggregation in a sharded cluster

Expressions

Expressions
• Return computed values
• Used with $project and $group
• Reference fields using $ (e.g. "$x")
• Expressions may be nested

Boolean Operators
• Input array of one or more values
– $and, $or
– Short-circuit logic
• Invert values with $not

• Evaluation of non-boolean types
– null, undefined, zero ▶ false
– Non-zero, strings, dates, objects ▶ true
{ $and: [true, false] } ▶ false

{ $or: ["foo", 0] } ▶ true
{ $not: null } ▶ true

Comparison Operators
• Compare numbers, strings, and dates
• Input array with two operands
– $cmp, $eq, $ne
– $gt, $gte, $lt, $lte
{ $cmp: [3, 4] } ▶ -1
{ $eq: ["foo", "bar"] } ▶ false
{ $ne: ["foo", "bar"] } ▶ true
{ $gt: [9, 7] } ▶ true

Arithmetic Operators
• Input array of one or more numbers
– $add, $multiply
• Input array of two operands

– $subtract, $divide, $mod
{ $add: [1, 2, 3] } ▶ 6
{ $multiply: [2, 2, 2] } ▶ 8
{ $subtract: [10, 7] } ▶ 3
{ $divide: [10, 2] } ▶ 5
{ $mod: [8, 3] } ▶ 2

String Operators
• $strcasecmp case-insensitive comparison
– $cmp is case-sensitive
• $toLower and $toUpper case change

• $substr for sub-string extraction
• Not encoding aware (assumes ASCII alphabet)
{ $strcasecmp: ["foo", "bar"] } ▶ 1

{ $substr: ["foo", 1, 2] } ▶ "oo"
{ $toUpper: "foo" } ▶ "FOO"
{ $toLower: "BAR" } ▶ "bar"

Date Operators
• Extract values from date objects
– $dayOfYear, $dayOfMonth, $dayOfWeek
– $year, $month, $week
– $hour, $minute, $second
{ $year: ISODate("2012-10-24T00:00:00.000Z") } ▶ 2012

{ $month: ISODate("2012-10-24T00:00:00.000Z") } ▶ 10
{ $dayOfMonth: ISODate("2012-10-24T00:00:00.000Z") } ▶ 24
{ $dayOfWeek: ISODate("2012-10-24T00:00:00.000Z") } ▶ 4
{ $dayOfYear: ISODate("2012-10-24T00:00:00.000Z") } ▶ 299
{ $week: ISODate("2012-10-24T00:00:00.000Z") } ▶ 43

Conditional Operators
• $cond ternary operator
• $ifNull
{ $cond: [{ $eq: [1, 2] }, "same", "different"] } ▶ "different”
{ $ifNull: ["foo", "bar"] } ▶ "foo"

{ $ifNull: [null, "bar"] } ▶ "bar"

Looking Ahead

Framework Use Cases
• Basic aggregation queries
• Ad-hoc reporting
• Real-time analytics
• Visualizing time series data

Extending the Framework
• Adding new pipeline operators, expressions
• $out and $tee for output control
– https://jira.mongodb.org/browse/SERVER-3253

Future Enhancements
• Automatically move $match earlier if possible
• Pipeline explain facility
• Memory usage improvements
– Grouping input sorted by _id
– Sorting with limited output

#mongodbdays
Thank You
Emily Stolfo
Ruby Engineer/Evangelist, 10gen
@EmStolfo

Aggregation Framework

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aggregation Framework

Uploaded by

Copyright:

Available Formats

#mongodbdays

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

• Works well with sharding

Tuesday, January 29, 13

• Quick aggregations from the shell

Tuesday, January 29, 13

Tuesday, January 29, 13

ps ax | grep mongod | head -n 1

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

• Other output fields are computed

• Processes all data in memory

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

• Pipe to $group to aggregate array values

Tuesday, January 29, 13

Tuesday, January 29, 13

• Limit and skip follow cursor behavior

Tuesday, January 29, 13

{ title: "The Great Gatsby" } { $sort: { title: 1 }}

Tuesday, January 29, 13

{ title: "The Great Gatsby" } { $limit: 5 }

Tuesday, January 29, 13

{ title: "The Great Gatsby" } { $skip: 5 }

Tuesday, January 29, 13

Tuesday, January 29, 13

• aggregate database command

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

• Pipeline operator memory limits

Tuesday, January 29, 13

Tuesday, January 29, 13

• Early $match may excuse shards

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

Tuesday, January 29, 13

• Invert values with $not

{ $and: [true, false] } ▶ false

Tuesday, January 29, 13

Tuesday, January 29, 13

• Input array of two operands