Optimizing the order of aggregation pipelines

Optimizing the order of aggregation pipelines

Oct 13, 2017
Guy Harrison

MongoDB does have a query optimizer, and in most cases it’s effective at picking the best of multiple possible plans. However it’s worth remembering that in the case of the aggregate function the sequence in which various steps are executed is completely under your control. The optimizer won’t reorder steps into the optimal sequence to get you out of trouble.

Optimizing the order of steps probably comes mainly to reducing the amount of data in the pipeline as early as possible – this reduces the amount of work that has to be done by each successive step. The corollary is that steps that perfom a lot of work on data should be placed after any filter steps.

Nowhere is this more important that in $lookup steps. Since $lookup steps perform a separate collection lookup – hopefully using an index – we should make sure we delay them until all data has been filtered. Consider this aggregation function, which generates a “top 10” list of product purchases by customer:

     var output=db.orders.aggregate([
           {$sample:{size:sampleSize}},
           {$match:{orderStatus:"C"}},
           {$project:{CustId:1,lineItems:1}},
           {$unwind:"$lineItems"},
           {$group:{_id:{ CustId:"$CustId",ProdId:"$lineItems.prodId"},
                     "prodCount":{$sum:"$lineItems.prodCount"},
                     "prodCost":{$sum:"$lineItems.Cost"}}},
           {$sort:{prodCost:-1}},
           {$limit:10},
           {$lookup:{
                       from: "customers",
                         as: "c",
                 localField: "_id.CustId",
               foreignField: "_id"
           } },
           {$lookup:{
                       from: "products",
                       as: "p",
                 localField: "_id.ProdId",
               foreignField: "_id"
           }},
           {$unwind:"$p"},{$unwind:"$c"}, //Get rid of single element arrays
           {$project:{"Customer":"$c.CustomerName","Product":"$p.ProductName",
                      prodCount:1,prodCost:1,_id:0}}
       ]);

Lines 11-22 perform lookups on the customers and products collection to get customer and product names.

We could have done these lookups much earlier in the pipeline. So for instance, this code returns the exact same results, but does the lookup a little earlier in the sequence:

     var output=db.orders.aggregate([
           {$sample:{size:sampleSize}},
           {$match:{orderStatus:"C"}},
           {$project:{CustId:1,lineItems:1}},
           {$unwind:"$lineItems"},
           {$group:{_id:{ CustId:"$CustId",ProdId:"$lineItems.prodId"},
                     "prodCount":{$sum:"$lineItems.prodCount"},
                     "prodCost":{$sum:"$lineItems.Cost"}}},
           {$lookup:{
                       from: "customers",
                         as: "c",
                 localField: "_id.CustId",
              foreignField: "_id"
           }},
           {$lookup:{
                     from: "products",
                         as: "p",
                 localField: "_id.ProdId",
               foreignField: "_id"
           }},
           {$sort:{prodCost:-1}},
           {$limit:10},
           {$unwind:"$p"},{$unwind:"$c"}, //Get rid of single element arrays
           {$project:{"Customer":"$c.CustomerName","Product":"$p.ProductName",
                      prodCount:1,prodCost:1,_id:0}}
        ]);

The difference in performance is striking. By moving the $lookup a few lines earlier, we have created a much less scalable solution:

When the $lookups are before the $limit step, we have to perform as many lookups as there are matching rows. When we move after the $limit we only have to perform 10. It’s an obvious but important optimization.

Unlike a SQL query optimizer, MongoDB will not perform any of these optimizations for you - you have to use “the optimizer between your ears”.

Using the dbKoda aggregation builder, we can simply drag and drop the steps into whatever order we like. Providing the new order makes sense, dbKoda will rewrite the aggregation code for you. The dbKoda aggregation editor is the fastest way to create aggregration pipelines. And - like the rest of dbKoda - it is free and open source.

Optimizing the order of aggregation pipelines

Privacy Policy

dbKoda.com and dbKoda Application

Contacting us

Questions?

What personal information do we collect from the people that visit our blog, website or app?

When do we collect information?

How do we use your information?

How do we protect your information?

Do we use 'cookies'?

We use cookies to:

Third-party disclosure

Third-party links

Google

We have implemented the following:

Opting out:

COPPA (Children Online Privacy Protection Act)

Fair Information Practices

In order to be in line with Fair Information Practices we will take the following responsive action, should a data breach occur:

CAN SPAM Act

We collect your email address in order to:

To be in accordance with CANSPAM, we agree to the following:

If at any time you would like to unsubscribe from receiving future emails follow the instructions at the bottom of each email and we will promptly remove you from ALL correspondence.

dbKoda product data collections