Coding efficient MongoDB joins
One of the key tenants of MongoDB schema design is to design to avoid the need for joins. Data is joined all the time inside of application code of course, but originally there was no way to perform joins within the server itself.
This changed in 3.2 with the introduction of the
$lookup operator within the aggregation framework. The
$lookup operator performs the equivalent of a left outer join – eg: it retrieves matching data from another document and returns null data if no match is found.
Here’s an example using some of the data in our sample collections.
What we’re doing here is finding all customers who have ever hired a film staring “Christian Gable”; We start by finding those films in the films collection (lines 6-9), then use
$lookup to retrieve customer data (lines 12-18). Films embeds actors in the “Actors” array; the customers collection embeds films that have been hired in the “Rentals” array.
The result of the join contains all the customers who have borrowed a Christian Gable movie returned as an array, so we use the $unwind operator to “flatten” them out (line 19). The resulting output looks like this:
It’s not too difficult to construct a join using
$lookup, but there are some definite potential issues with respect to join performance. The
$lookup function is going to be executed once for each document returned by our
$match condition. There is no equivalent of a hash or sort merge join operation possible here, so we need to make sure that we’ve used an index to support the join.
Unfortunately, the explain() command doesn’t help us to determine if the join is efficient. For instance, here is the explain output from the operation above:
The explain output tells us if we have used an index to perform the initial
$match, but doesn’t show us if we used an index within the
However, we can see the queries created by the $lookup function if we enable profiling. For instance if we turn profiling on we can see a full collection scan of customers has have been generated for every film document that has been joined:
These “nested” collection scans are bad news. Below is the results of a benchmark in which I joined two collections using
$lookup with and without an index. As you can see, the unindexed
$lookup degrades steeply as the number of rows to be joined increases. The solution is obvious:
Always create an index on the foreignField attributes in a
$lookup, unless the collections are of trivial size.
The MongoDB company is putting a lot of new features into the aggregation framework: they clearly intend to create a very powerful and flexible capability that matches and maybe even exceeds what can be done with SQL. Anyone wanting to do any serious work in MongoDB should make sure they are very comfortable with aggregate. If you use
$lookup to perform joins in aggregate, make sure there is an index on the
Shameless dbKoda plug
All the screenshots above were taken from our dbKoda product, which makes it easy to perform all the tasks you need to write effective aggregations or perform other tasks. Our aggregation builder makes the construction of even the most complex aggregations easy, and we provide GUI access to create indexes, build queries, enable profiling and so on. Check it out at www.dbkoda.com - it’s a completely free, open source, cross platform tool for MongoDB that will improve your productivity and which has a pleasant fruity flavor.