Ylan Segal

Adding an Index to Mongo Can Change Query Results

While trying to optimize some slow queries in a MongoDB database, I found an unexpected and concerning surprise: Adding an index can alter the results returned by a query against the same dataset.

Demonstration

Supose we have a collection that looks like this (All samples from a mongo shell):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> db.example.find()
{
  "_id" : ObjectId("5542ef97b08a749f8e8e4f0d"),
  "title" : "Pink Floyd",
  "rating" : 1
}
{
  "_id" : ObjectId("5542efa2b08a749f8e8e4f0e"),
  "title" : "Led Zeppelin",
  "rating" : 2
}
{
  "_id" : ObjectId("5542efb3b08a749f8e8e4f0f"),
  "title" : "Aerosmith",
  "rating" : null
}
{
  "_id" : ObjectId("5542efbab08a749f8e8e4f10"),
  "title" : "Metallica"
}

Note that some documents have a numeric rating, one has a null value and one does not have the field.

Suppose we query for all documents with a rating of 1 or null:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> db.example.find({rating: { $in: [1, null]}})
{
  "_id" : ObjectId("5542ef97b08a749f8e8e4f0d"),
  "title" : "Pink Floyd",
  "rating" : 1
}
{
  "_id" : ObjectId("5542efb3b08a749f8e8e4f0f"),
  "title" : "Aerosmith",
  "rating" : null
}
{
  "_id" : ObjectId("5542efbab08a749f8e8e4f10"),
  "title" : "Metallica"
}

The Metallica document is returned, even though it does not have a rating field.

Suppose that we want to optimize this collection and now we add an index on the rating field and re-run our query:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
> db.example.ensureIndex({rating: 1}, {sparse: true})
{
  "createdCollectionAutomatically" : false,
  "numIndexesBefore" : 1,
  "numIndexesAfter" : 2,
  "ok" : 1
}
> db.example.find({rating: { $in: [1, null]}})
{
  "_id" : ObjectId("5542efb3b08a749f8e8e4f0f"),
  "title" : "Aerosmith",
  "rating" : null
}
{
  "_id" : ObjectId("5542ef97b08a749f8e8e4f0d"),
  "title" : "Pink Floyd",
  "rating" : 1
}

The Metallica document is gone. Surprised? I definetly was.

Thoughts

The behavior may seem a bit contrived, but I actually encountered it while trying to optimize a produciton database. This example just boils it down to something trivial to reproduce. I should mention that if the index is created without the sparse option, the results are correct. The sparse option allows saving space on the index itself, by only creating an entry for documents that have the field. A non-sparse index, creates a record for all documents and sets the value to null.

In my opinion, the above-described behavior is awful. It is up to the database engine to decide which index to use. A sparse index may be useful in less queries than a non-sparse index. However, my expectations of indexes is that they are all about performance and trading off disk space and insert time for query time. The existance of an index should never change the result set for the same query and dataset.

Recursion and Pattern Matching in Elixir

In order to teach myself Elixir, I have been working my way through Exercism.io, which is a set of practice coding exercises with mentorship from the community. All exercises have the tests written for you and it’s up to the user to write a passing implementation.

Being new to Elixir and functional programming, the exercises are a great way for me to learn about syntax, idiomatic code and functional programming patterns. One of exercises consists of re-implementing common list operations, like count, map and reduce.

Implementing Count With Recursion

The test that the implementation must pass looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
defmodule ListOpsTest do
  alias ListOps, as: L

  use ExUnit.Case, async: true

  test "count of empty list" do
    assert L.count([]) == 0
  end

  test "count of normal list" do
    assert L.count([1,3,5,7]) == 4
  end

  test "count of huge list" do
    assert L.count(Enum.to_list(1..1_000_000)) == 1_000_000
  end
end

My first implementation, looked like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
defmodule ListOps do
  def count(list) do
    count(0, list)
  end

  def count(acc, []) do
    acc
  end

  def count(acc, [_|tail]) do
    count(acc + 1, tail)
  end
end

First thing of note: count/21 is defined twice. This is part of the language provided functionality. In Java, method overloading required a different number of parameters which is how the dispatching picked the correct method at runtime. In Ruby, there can’t exist to method definitions in the same scope. In Elixir, the correct function is called at run-time depending on which pattern is matched.

On our first test, when L.count([]) is called, the count/1 function matches, because it only has one parameters. That function calls count(0, []). This will match the first count/2 definition, because it is being passed with an empty list. (Any acc will match). That in turn returns acc, which is 0, making the test pass.

For the second test, count/1 is matched, which ends up calling count(0, [1,3,5,7]). That call, matches the second count/2 definition, because it matches a list that is not empty2. That function call will call recursively, adding 1 to the accumulator each call, until the list is empty and the accumulator is returned.

The calls will look like:

1
2
3
4
5
6
count([1,3,5,7])
count(0, [1,3,5,7])
count(1, [3,5,7])
count(2, [5,7])
count(3, [5])
count(4, []) # Returns 4

Note that recursion and pattern matching have taken the place of conditionals or explicit loops in the code, as you would have in non-functional programming languages.

Implementing Count With Reduce

The same exercise asks to implement a reduce function that will run a generic function on each element of a list and pass the resulting accumulator. My implementation looks like this:

1
2
3
4
5
6
7
8
9
defmodule ListOps do
  def reduce([], acc, _fun) do
    acc
  end

  def reduce([head|tail], acc, fun) do
    reduce(tail, fun.(head, acc), fun)
  end
end

The same trick as before is used here, where matching on an empty list returns the accumulator. When a list has at list one member, the function is called for that member and reduce/3 is called with the tail of the list recursively.

With reduce/3 in place, the count/1 implementation becomes much simpler:

1
2
3
4
5
defmodule ListOps do
  def count(list) do
    reduce(list, 0, fn(_, acc) -> acc + 1 end)
  end
end

Conclusion

The exercise has some other operations as well: map, reverse, filter, append and concat. I learned a lot working on the solutions and started to get a feel for functional programming. If you are learning a new language, I would recommend trying Exercism.io. It currently supports 23 languages!


  1. In Elixir, when referring to functions, it is customary to add / and the arity to the name. foo/2 refers to the function foo defined with 2 parameters.

  2. Elixir includes matching a list to it’s head and tail with the [head|tail] syntax. The _ signals that the parameter will not be used.

The REPL: Issue 8 - March 2015

Turning The Database Inside Out With Apache Samza

Based on a talk at Strange Loop 2014, this post was eye-opening. Although it’s supposed to be about Apache Samza, most of the talk is devoted to talking about databases in general and what they are good at: Keeping global state, replication, secondary indexing, caching, and materialized views. This high-level view provided me with a lot of new perspective of how to think of databases. The many illustrations in the article are beautiful. Please go and read.

Your Most Important Skill: Empathy

The legendary Chad Fowler makes the case that empathy is a skill that everyone will benefit from developing further. Provides great list of why that is. Most importantly, he also details how to practice.

Git From The Inside Out

Git has often been criticized for having an inconsistent interface and leaking unneeded abstractions to the user. Some of that criticism is warranted. Nonetheless, git is one of my favorite programs. I use it hundreds of times throughout the day, always on the command-line, complemented by tig, the ncurses client for git. This article talks about the internals of git: How it stores data on disk for commits, trees, objects, tags, branches, etc. It is well written, well organized and a pleasure to read. If you read this guide, it will make it easier for you to interact with git because you will understand it’s intrenals. However, I think you should read it because it shows how great functionality can be achieved with software with minimal dependencies and using only the local filesystem as a data store.

Dipping My Toes in Elixir

I recently came across two interesting posts about the Phoenix framework: A benchmark comparing the performance against Rails and a how-to for creating a simple JSON API. I was interested by the performance characteristics described in both articles. What really got my attention though was the syntax: The feel for it was very ruby-like.

Phoenix is written in Elixir, a relatively young functional programming language design by Jose Valim, committer in the Rails Core team. Elixir compiles to byte-code that runs on the Erlang virtual machine, which lists as it’s main features scalability and fault-tolerance.

The main Elixir website has the best getting started guide I have ever read. I started off by installing Elixir, which on my Mac was as simple as $ brew install elixir. That took care of the Erlang VM and whatever else it needed. A few moments later I had a working elixir installation, ready to go.

The guide is written in clear and concise manner, introducing the reader to the language, the syntax, the tools and the concepts in a gradual manner. I am not familiar with functional programming, other than what I have heard described in a few podcast. Even so, I was able to follow along quickly and start exploring. The guide also includes a more advanced section covering mix and OTP.

mix is the main tool that Elixir uses for compiling code, resolving and getting dependencies, running tests, etc. Think of it as the equivalent of make or rake. OTP stands for “Open Telecom Platform” and is a set of libraries included with Erlang that allows developers to build fault-tolerant, distributed applications.

mix is a pleasure to work with: Be it creating a new project, installing dependencies or running your test, you go through mix. (As opposed to rails, rake, bundler)

ExUnit, the included testing framework is familiar to anyone having used any XUnit framework before. The error messages are helpful by default. It includes a great feature where examples inside the documentation of a module can be executed as tests directly. I was very happy to see that the guide introduces tests and TDD early on. It shows the values of the Elixir community, load and clear.

Concurrency is prominent. In fact, in order to have any kind of state at all, you need to use separate Elixir Processes, which are like light-weight threads, not system processes. Messages are sent between processes. I like the idea of having to think about concurrency early, as opposed than doing later in the application life-cycle. Elixir processes are allowed to fail fast on errors. It’s not a big deal, since they are under Supervisors that will just restart them if they fail.

Pattern-matching at first seems weird. They grow on you. It allows the programmer to deal with different cases easily and separate them out into their own functions, foregoing a lot of conditionals. Defining functions many times, with different guard clauses is great. It’s hard to describe (because I don’t know the correct language yet), but here is my take on the Fibonacci sequence:

1
2
3
4
5
6
7
8
9
10
11
defmodule F do
  def fib(x) when x in 1..2  do
    1
  end

  def fib(x) do
    fib(x - 2) + fib(x - 1)
  end
end

IO.puts F.fib(5)

In conclusion: I have barely gotten my feet wet, but have been impressed with what I have seen. I was not expecting this level of polish from a new language. I am itching to find a personal project to write in Elixir!

Book Review: The Ruby Way

Probably one of the most well-known books among rubyists, “The Ruby Way” by Hal Fulton with AndrĂ© Arko, has now been updated and released in its third edition. The first part of the book is dedicated to the language itself and covers syntax, semantics, some comparison to other languages and specific issues, like garbage collection, that developers are well served to know when writing ruby.

We do find OOP to be a useful tool and a meaningful way of thinking about problems; we do not claim that it cures cancer

The Ruby Way

The majority of the book is divided into sections that deal with specific task that a developer may encounter. From basics like working with String, numerical calculations and Enumerable collections to more advanced techniques like Threads and Concurrency, Metaprogramming, Network Programming and Distributed Ruby. Each chapter has plenty of code examples and thorough explanations.

I expect my copy to get plenty of used as my programming takes me to unknown or forgotten parts of ruby.

Links: