Blog Photo

Effective MongoDB indexing (part 2)

  • Dec 7, 2017
  • Guy Harrison

In Part one of this blog series we looked at the structure and behaviour of MongoDB B-tree indexes and how to create concatenated indexes to optimise multi-key lookups. In this instalment, we’ll look at some tradeoffs to consider when using - or considering using - indexes. You might also want to read this article on execution plans, since we’ll be using MongoDB explain() to understand how indexes affect retrieval paths.

To index or not to index?

If you are planning reading a book, you don’t start at the index and jump between every index entry and the page of the book it refers to. That would be silly and extremely time consuming. You read a book by starting at the first page and reading subsequent pages in order. If you want to find a specific item in a book, that’s when you use the index.

That much is obvious and completely applicable to MongoDB - if you are reading the entire collection you don’t want to use an index. If you are reading a small number of documents, then an index is preferred. But at what point does the index become more effective than the collection scan? Should I use an index if I’m reading half the collection?

The answer is - unfortunately - it depends. Some of the factors which affect the “break even” point for indexed retrieval are:

  • Caching effects. Index retrievals tend to get very good hit rates in MongoDB cache while full collection scans generally get a much poorer hit rate. But if all of the collection is in cache then a collection scan will perform closer to index speeds.

  • Document size. Most of the time, a document will be retrieved in a single IO so the size of the document doesn’t affect index performance that much. However, larger documents mean larger collections which will push up the amount of IO needed for the collection scan.

  • Data distribution. If documents in the collection are stored in the order of the indexed attribute (which can happen if documents are inserted in key order) then the index may have less blocks to visit for every index key lookup and experience a much higher hit rate. Data that is stored in sorted order is sometimes referred to as highly clustered.

breakevenChart.png

Figure 1: Elapsed time for index and collection scans plotted against percentage of collection accessed

Figure 1 shows the elapsed time for indexed scans and collection scans for clustered and unclustered data, plotted against the percentage of the collection being retried. In one case the data was loaded into the collection in sorted order, favouring an index lookup, and in the other data was effectively in random order. For randomly distributed data, a collection scan completed more quickly than an index scan if more than about 8% of the collection was retrieved. However, if the data was highly clustered, the index scan outperformed the collection scan up to almost the 95% mark.

Although it’s not really possible to provide a “one-size fits all” cutoff point for index retrieval, the following statements are generally valid:

  • If all documents or a large proportion of documents in the collection need to be accessed, then a full collection scan will be the quickest access path.

  • If a single document is to be retrieved from a large collection, then an index based on that attribute will offer the quicker retrieval path.

  • Between these two extremes, it may be difficult to predict which access path will be quicker.

Overriding the optimizer

The mongoDB optimizer uses a combination of heuristics - rules - and “experiments” when deciding the optimum access path. It will usually try a few different plans before settling on a plan for a specific query “shape”. However, the optimiser is biased in favour of using indexes when they exist. So for instance, this query retrieves every document in the collection, because a contains only positive integers. Even though all the documents are being retrieved, MongoDB chooses an indexed path.

ixscanwithouthint.png

Figure 2: MongoDB will sometimes use an index even if all or most of the collection is being retrieved.

The execution plan shows that all 1,000,000 rows of the collection are retrieved by the IXSCAN step: clearly the use of an index was not ideal in this case.

We can change it to using a collection scan by adding a hint as follows:

collscanwithhint.png

Figure 3: Using a hint to force a collection scan.

The natural hint instructs MongoDB to perform a collection scan. Alternatively, you can specify an index condition such in the snippet below, where we force the use of the index on the b column. Without the hint, MongoDB might choose an index on the a attribute.

db.baseCollection.find({a:1000,b:1000}).hint({b:1})

Indexes aren’t always the fastest way to retrieve data - when a large proportion of a collection is being retrieved, a collection scan will often be faster

Index merges

In Part one we talked about how creating an index on all the conditions in the query was most effective. So in the query above, an index on {a:1,b:1} is indicated. However, it’s not always possible to create indexes on all combinations - the effect on insert/update/delete performance might be too severe (more on that below).

MongoDB can perform index “merges” on multiple indexes, this very occasionally happens for $and conditions and is indicated by a AND_SORTED step with two IXSCAN child steps.

Multiple indexes can also be used to support OR conditions and unlike the AND_SORTED solution, MongoDB will almost always choose this if an $or condition operates on two indexed attributes.

orplan.png

Figure 4: multiple indexes can be used to resolve $and or $or conditions.

Indexes for sorting

Indexes can be used to support returning data in sorted order. This can improve the response time for the query, and eliminate out of memory errors like this one:

Executor error during find command: OperationFailed: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit.

A query that specifies a .sort() operation and which retrieves data by a collection scan will show a SORT_KEY_GENERATOR step followed by a SORT step:

SORT_KEY_GENERATOR.png

If the index supports the sort, then we’ll just see an IXSCAN and FETCH:

ixscansort.png

Using an index to return documents in a specific order is not always the best option. If you are looking for the first few documents, then the index option will work better than the collection scan and sort. However, if you are sorting all the documents in the collection then we’ll see a reduction in sort time, but an increase in data retrieval - as we saw earlier, using an index to retrieve all the documents in a collection can be substantially slower than a collection scan.

sortGraph.png

Figure 5: Effect of an index on sorting when retrieving all documents or only the first document.

Figure 5 shows how an index radically reduces the response time to retrieve the first sorted document, but actually degrades the time required to get the last sorted document in the collection.

If you decide that you want to use a non-indexed sort on an indexed attribute, you may need to invoke the hint({natural:1}) mechanism that we looked at earlier. You might also need to increase the memory available for sorting using the following command (in this case, allowing 128M for sorting):

 db.getSiblingDB("admin").
   runCommand({ setParameter: 1, internalQueryExecMaxBlockingSortBytes: 134217728 });

   db.baseCollection.find().hint({$natural:1}).sort({a:1});

Using an index to optimize a sort is a good strategy if you are only interested in the first few documents from the sort. When sorting all documents in a collection, you might be better off with a collection scan.

Using indexes for joins

The MongoDB aggregation framework allows joins between two collections. For instance, the following aggregation joins data from the Sakila_films collection to the Sakila_actors collection:

db.Sakila_films.aggregate([
  { $unwind:  "$Actors" },
  { $lookup:
     { from:         "Sakila_actors",
       localField:   "Actors.actorId",
       foreignField: "actorId",
       as:           "actors"
     }
  },
]);

Without an index on the Sakila_films.actorId attribute, MongoDB will have to perform a collection scan for each Actor lookup. The result will be an exponential degradation in performance as the join volumes increase. Figure 6 illustrates this degradation.

joinGraph.png

Figure 6. MongoDB join performance with and without an index

You should index any attributes that appear in foreignField conditions within $lookup statements to achieve scalable join performance

See this post for more information about optimising MongoDB joins.

Effect of indexes on updates, inserts and deletes

Although indexes can dramatically improve query performance, they do reduce the performance of updates, inserts and deletes. All of a collections indexes will normally be updated when a document is inserted or deleted and an index must also be amended when an update changes any attribute which appears in the index.

It is therefore important that all our indexes contribute to query performance , since these indexes will otherwise needlessly degrade update, insert and delete performance. In particular, you should be especially careful when creating indexes on frequently updated attributes. A document can only be inserted or deleted once, but may be updated many times. Indexes on heavily updated attributes or on collections that have a very high insert/delete rate will therefore exact a particularly high cost.

indexesAndDML.png

Figure 7: Effect of adding indexes to insert and delete performance

Figure 7 illustrates the overhead of indexes on insert and delete performance. 10,000 documents were inserted and then deleted, the time was directly proportional to the number of indexes on the collection.

The overhead of indexing is critically felt during batch deletes. Whenever a document is deleted, every index entry that contains a reference to that document must be removed. There’s no direct pointer from a document address to an index entry, so that often means that a scan of all matching index entries must be performed to find the matching documents. For instance, if a document with the SURNAME ‘Smith’ is deleted, then we would scan all index entries for ‘Smith’ and remove any index entries that point to the deleted document.

Indexes always add to the overhead of insert and delete operations, and may add to the overhead of update statements. Avoid over-indexing, especially on attributes which are frequently updated.

Conclusion

Understanding the costs of indexes on update, inserts and deletes and the circumstances in which indexes can improve performance should help you determine the optimal set of indexes for your application.

Try out dbKoda - an open source, free IDE now available for MongoDB! You can experiment with indexing options, build queries and aggregates and analyse explain plans from within a modern graphical interface. The explain plans shown in this article were created using dbKoda.