While trying to optimize some slow queries in a MongoDB database, I found an unexpected and concerning surprise: Adding an index can alter the results returned by a query against the same dataset.
Supose we have a collection that looks like this (All samples from a mongo shell):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Note that some documents have a numeric
rating, one has a
null value and one does not have the field.
Suppose we query for all documents with a rating of
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Metallica document is returned, even though it does not have a
Suppose that we want to optimize this collection and now we add an index on the rating field and re-run our query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Metallica document is gone. Surprised? I definetly was.
The behavior may seem a bit contrived, but I actually encountered it while trying to optimize a produciton database. This example just boils it down to something trivial to reproduce. I should mention that if the index is created without the
sparse option, the results are correct. The
sparse option allows saving space on the index itself, by only creating an entry for documents that have the field. A non-sparse index, creates a record for all documents and sets the value to
In my opinion, the above-described behavior is awful. It is up to the database engine to decide which index to use. A sparse index may be useful in less queries than a non-sparse index. However, my expectations of indexes is that they are all about performance and trading off disk space and insert time for query time. The existance of an index should never change the result set for the same query and dataset.