Ylan Segal

World Cup Player Age, Unix Style

I love watching the World Cup: It’s more soccer than you could hope for, mixed with national rivalries. What could be better. Now that I am older than most of the players, it dawned on me the intense pressure that they are under, to perform for their country and a question came to me: Just how old are this kids? Let’s find out.

Source Data

With a quick google search, I came up with what seemed like a good source of data for the task at hand. I simply copy and pasted the data from my browser into a text editor to get a file that looks like this:

1
2
3
4
5
6
7
8
9
10
11
Alan PULIDO  Mexico  08/03/1991  5   4
Adam TAGGART  Australia   02/06/1993  4   3
Reza GHOOCHANNEJAD    Iran    20/09/1987  13  9
NEYMAR    Brazil  05/02/1992  48  31
Didier DROGBA Ivory Coast 11/03/1978  100 61
David VILLA   Spain   03/12/1981  95  56
Abel HERNANDEZ    Uruguay 08/08/1990  12  7
Javier HERNANDEZ  Mexico  01/06/1988  61  35
Islam SLIMANI Algeria 18/06/1988  19  10
Shinji OKAZAKI    Japan   16/04/1986  75  38
...

Cutting and Slicing

Using the power of unix pipes, we can easily extract the data we want from the data. Let’s start by getting all birthdates:

1
2
3
4
5
6
7
8
9
10
11
12
$ cat players.txt | cut -f3
08/03/1991
02/06/1993
20/09/1987
05/02/1992
11/03/1978
03/12/1981
08/08/1990
01/06/1988
18/06/1988
16/04/1986
...

As the man pages say: cut cut out selected portions of each line of a file. In our case, we want the 3rd field in the database.

Now, we can cut again to get the birth year of each player:

1
2
3
4
5
6
7
8
9
10
11
12
$ cat players.txt | cut -f3 | cut -d '/' -f3
1991
1993
1987
1992
1978
1981
1990
1988
1988
1986
...

In this case, we are cutting again, this time using / as a delimiter. Now we have a list of all the players' birth years.

Histogram

I searched around for some quick utilities that would generate a histogram and the most promising seemed a python utility called data hacks. Unfortunetly, I did not install for me and I didn’t have the inclination to mess with my python installation. I did however, find something similar to what I needed in a blog post about visualizing your shell history. After adapting it a bit to my purposes, I created a small bash function that now lives in my profile:

1
2
3
function histogram() {
  sort | uniq -c| sort -rn | awk '!max{max=$1;}{r="";i=s=60*$1/max;while(i-->0)r=r"#";printf "%15s %5d %s %s",$2,$1,r,"\n";}'
}

This function leverages awk very heavily. awk is a pattern-directed scanning and processing language. I am not very familiar with it, but after seeing how powerful it is, I am definitely want to get acquainted with it.

With this function, we can now get a full histogram:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ cat players.txt | cut -f3 | cut -d '/' -f3 | histogram | sort
   1971     1 #
   1976     1 #
   1977     2 ##
   1978     5 ####
   1979    15 ############
   1980    19 ###############
   1981    32 #########################
   1982    33 ##########################
   1983    46 ####################################
   1984    58 ##############################################
   1985    66 ####################################################
   1986    77 ############################################################
   1987    70 #######################################################
   1988    65 ###################################################
   1989    62 #################################################
   1990    63 ##################################################
   1991    38 ##############################
   1992    47 #####################################
   1993    22 ##################
   1994     7 ######
   1995     6 #####
   1996     1 #

Notice that the final sort is needed, because histogram returns the values ordered by the number of times it appeared in the data, but since we are talking about birth years, I believe the graph is more telling if it is in ascending order.

Other Findings

With all that in place, it is relatively easy to make a histogram of other data. I remember Malcom Gladwell’s theory in his book Outliers about how most professional hockey players are born in the first part of the year, because of how the developmental leagues work in Canada. With a database of 735 professional soccer players, chosen to be then best 23 of each country, I would expect 61.25 players to be born on each month. Let’s find out how many really are:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ cat players.txt | cut -f3 | cut -d '/' -f2 | histogram | sort
   01    73 #########################################################
   02    77 ############################################################
   03    66 ####################################################
   04    63 ##################################################
   05    73 #########################################################
   06    59 ##############################################
   07    57 #############################################
   08    57 #############################################
   09    66 ####################################################
   10    46 ####################################
   11    48 ######################################
   12    51 ########################################

Notice that to arrive at this graph, the only thing we changed was the field we extracted from the data, in this case the month of the birthday. Is there a trend here? It does suggest that players born in the first part of the year are favored, but I do not know if it’s statistically significant.

Conclusion

Quick and dirty data analysis on the command line is pretty easy if you know a bit of unix and some awk!

Ruby Implicit `to_proc`

Ruby’s blocks are one of the language features I like the most. The make iterating on collections extremely easy.

1
2
[1, 2, 3].map { |n| n.to_s }
=> ["1", "2", "3"]

You can shorten the (already short!) syntax above, like so:

1
2
[1, 2, 3].map &:to_s
=> ["1", "2", "3"]

The above is implicitly calling to_proc on the symbol. This es extremely handy, when you are calling the same method on each object. However, it can also be useful to call a method with each object as an argument:

1
2
3
4
5
6
7
def fancy_formatvalue)
 "==::[[#{value}]]::=="
end

['1', '3', 'a'].map &method(:fancy_format)

=> ["==::[[1]]::==", "==::[[3]]::==", "==::[[a]]::=="]

In addition, note that the number of arguments yielded to the method, depends on what the original implementation is. For hashes, this is especially useful:

1
2
3
4
5
6
7
8
9
10
11
def fancy_format(key, value)
  puts "==::[[#{key} - #{value}]]::=="
end

ball = { color: 'red', size: 'large', type: 'bouncy' }


>> ball.each &method(:fancy_format)
==::[[color - red]]::==
==::[[size - large]]::==
==::[[type - bouncy]]::==

Book Review: The Agile Culture - Pixton, Gibson & Nickolaisen

Pollyanna Pixton, Paul Gibson and Niel Nickolaisen write a concise and practical book on how to foster an Agile culture inside your company. It is geared towards those responsible for leading teams of software developers and other IT professionals, although most of the material is applicable to any leader.

Agile, at its roots, is about delivering software that brings value to the company and ultimately to the customer. In order to achieve this, it proposes short iterations and embracing change. The authors thesis is centered around trust and ownership. Trust from the leader to the team to do their job professionally. Ownership by the team of the product they work on. If both are present, the team will flourish. Leadership is about giving your team the tools to do their work and protecting them, not about meticulous control. Innovation comes from a culture where risk can be taken and failures can be learned from, as opposed to fear of being blamed and possibly fired. In a way, a lot of what the authors talk about is about aligning incentives of each individual to those of the organization. Of course, clarifying the organization goals and priorities and communicating those to it’s members is required.

The book reads very easily and is reach with anecdotes that illustrate good and bad leadership alike. The advice is sensible and not at all process-heavy like some other so-called ‘Agile’ books that have meticulous schedules for stand-ups, backlog pruning, retrospectives, etc. The advise feels also real-wordly: For example, it recognizes that not everyone in your organization may embrace Agile and gives you tools to deal with difficult employees and managers.

If you are a team leader, there is a lot to learn from this book.

Links:

The REPL: Issue 2

The Little Mocker

Uncle Bob writes about mocks, stubs, doubles, spies, etc. He explains the differences between them, how and when to use them. The examples are in Java, but are easily followed even with vague familiarity with the languge.

Back To Basics: Regular Expressions

The fellows at thoughbot give a great primer on regular expressions in ruby. The examples are easy to follow and yet manage to explain a lot of more advanced concepts like capture groups, expression modifiers and lookarounds.

Goto Fail, Heartbleed, and Unit Testing Culture

Back when Apple’s Goto Fail bug was news, my reaction to this was: How did the introduction of this bug pass the tests. At the time I thought about writing a test suite around it and running it with and without the duplicated line that causes the bug to demonstrate how test catch regression mistakes. I never got around to it, mainly because of my lack of familiarity with the language. Martin Fowler has written a lengthy and thoughtful post that expressess the feeling much better than I would have. It gives the same treatment to the [Heartbleed Bug] and explains why testing is so important in softare development.

I am lucky to work mainly in ruby, a community that is very test-oriented, even after the recent hoopla.

Unicorn vs. Puma: Round 3

MRI Ruby has gotten a lot faster since I ran my last benchmark, so it’s time for an update.

Methodology

The benchmark consists of hitting a single endpoint on a rails (4.1.1) app in production mode for 30 seconds. The endpoint reads a set amount of posts from a Postgres database and the renders them in a html view using erb, without any view caching.

The purpose of the benchmark to get an idea of the performances characteristics under varying load for MRI and jRuby. Unicorn was chosen for MRI because it uses the unix fork model for it’s processes, which is pretty much the de facto way to do concurrency in MRI. Puma was chosen for jRuby because it bills itself as a very efficient threading server (although recent versions can mix forking workers and threading). Threading is the de facto way to do concurrency on the JVM.

Of course, there are many parameters that can be tweaked in the server configuration. No benchmark is perfect, but I believe it’s a good indication of what type of performace differences can be seen in the two versions on ruby.

Here are the details:

Unicorn

  • Unicorn 4.8.3
  • Ruby 2.1.2
  • Configuration: 4 Workers, Timeout 30 seconds
  • Maximum observer memory usage: 450 Mb

Puma

  • Puma 2.8.2
  • jRuby 1.7.12
  • Configuration: Default configuraration (Maximum 16 threads)
  • Maximum observer memory usage: 482 Mb

The number of unicorn workers where used in order to match the amount of memory used by puma. In both cases, the observer memory stays below a 1x dyno from Heroku (but not by much).

The benchamrk was run with Apache’s Benchamrking Tool, with varying levels of concurrency:

1
$ ab -c $USER_COUNT -t 30 $URL

Results

Both servers perform similarly in the number of request they can handle per second. Unicorn seems to ramp up on par with it’s number of workers and then plateau. Even though more users are hitting the endpoint concurrently, unicorn just handles 4 at a time. Puma seems to increase in capacity with more users, although there is a sharp drop-off 1 at the end when reaching 64 concurrent users.

With regards of average (or 50th percentile) response time, it looks like both servers, surprisingly, perform exactly the same!. The response times are significantly slower when the server is under heavier load, but still perform acceptably.

The 95th percentile and 99th percentile graphs paint a different story though: Unicorn’s response time start to get more pronounced as concurrency increases, wich means that for some of the users, it might easily fall into unacceptable levels.

How significant is this? For example, let’s take the 32 concurrent users case: Puma 50th percentile response is 62 ms against unicorn’s 64 ms. Not very different. However, when we look at the 95% percentile puma comes in at 147 ms, wich is 2.3 times the average. Unicorn comes in at 175 ms, 2.7 times the average. Looking into the 99th percentile, puma’s response if 2.69 times the average response; Unicorn is a more dramatic 4.15 times. You should care about percentiles and not just the average response time.

Conclusion

Since the benchmark was last made, MRI Ruby has gotten much faster (last version was running in 1.9.3), however running a Rails app in jRuby still offers some performance characteristics under high load.


  1. I do not know what that drop-off means, and it didn’t seem to be there last year. However, I re-ran the benchmark many times, and got consistent results.