Histograms with time using arrow and matplotlib

I needed to make an histogram of a list of timestamp to better understand the load on a server, so I decide to see if I could use Arrow and matplotlib together.

Arrow is the requests of time for Python. In other words, it is a wonderful Python module that allow the user to manipulate time and timestamp very intuitively, much like requests makes dealing with HTTP/HTTPS requests easy. matplotlib is a well-know and powerful python module for creating beautiful plots, it is very powerful, but its API shoes some age, since it is quite convoluted.

Overall, creating such an histogram is simple and I wanted to highlight a couple of pitfalls and nice things that I have encountered during the process.

You can find the full script in this gist, it expects a file with a list of timestamps (in UTC format, but should work with other formats too) and produces an histogram with their distribution. You can find my input data on this pastebin.

Pitfalls

1. make matplotlib understand dates

matplotlib manages dates internally in a format that I don really understand, but their API provides a date2num function that translates datetime objects in the matplotlib’s date objects.

To get a datetime object from an Arrow object just do: atime.datetime.

2. print legible dates as labels for the x-axis

This does not need much comment, besides the fact that it took me quite a while in combination with the problem above to figure out why I was getting all sorts of weird numbers.

Nice moments

1. time spans

To create the histogram I wanted to take as minimum an maximum respectively:

  • the beginning (XX:00:00) of one hour before of the smallest timestamp. So, if the minimum timestamp was 2017-12-05T17:55:37.806460+01:00, I wanted 2017-12-05T16:00:00.000000+01:00;
  • the end (XX:59:59) of one hour after of the greatest timestamp. So, if the minimum timestamp was 2017-12-13T19:54:37.361527+01:00, I wanted 2017-12-13T20:59:59.000000+01:00;

With Arrow all of this is quite straightforward.

What the code above does:

  • min(timestamps) (max(timestamps)) finds the minimum (respectively, maximum) timestamp within the list timestamps;
  • .replace(hours=-1) (.replace(hours=-1)) returns a new timestamp with 1 hour subtracted (respectively, added) to the given timestamp;
  • .span('hour')[0] (.span('hour')[1]) returns a tuple of length with the beginning and the end of the hour containing the given timestamp (you can use other spans like 'year' and 'minute'.

2. time ranges

Loop over time ranges with Arrow is very simple, the following list comprehension that creates a list of timestamps between start and end with spacing of 30 minutes.

3. personalized x-ticks with time

Timestamps are handled by matplotlib as numbers (seconds from epoch), so we need to specify to the plot that we want to print those numbers is a date format.

The following code specifies a DateFormatter to the x-axis so that the timestamps are shown in the format '%Y-%m-%d %H:00' and then it creates a list of timestamps from start to end spaced by 6 hours and which are set as ticks for the axis.

The final result

And hereś the final result. Enjoy!
Histogram of submission times


The image at the top is by Mrs Airwolfhound via Flickr, released under a CreativeCommons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0) license.