Tuvok picks high-rated episodes of tv-series

In: Projects
Published on 2016-09-15
Written from the perspective of a computer security analyst.

Last week, when I was visiting the SANS SEC560 course in Brussels, I wrote a small tool in Python 3 called tuvok. This tool takes a tv-series and a minimum rating and then outputs a list of all tv-episodes of that series that have a rating equal to or higher than the minimum rating.

Why

I wrote this tool since I wanted to start watching the classic Star Trek series, but I don't have time to watch all episodes. Instead I decided to watch only The Very Best Of Star Trek: The Original Series. I needed a list of the episodes with an IMDB-rating of 7.8/10 or higher.

The Internet Movie Database allows you to view separate episodes and their properties. However, I found that IMDB does not provide a handy interface of listing all episodes and their ratings. Filtering episodes based on their rating is also not an option.

Thus, tuvok was born.

How

The tool uses the API of the Open Movie Database (OMDb) API, which outputs neat JSON-files. The nice thing about OMDb is that you don't need an API key or developer registration to query the database, which makes it very accessible to developers. Of course, it is important to be 'nice' when using the API: tuvok only uses (1+ nr_seasons) queries for obtaining all information it needs.

To start, tuvok needs an IMDB-identifier of a series (string) and minimum rating (float). It then queries OMDb for information about the tv-series and all its seasons, collecting information about all the episodes in the process. The JSON-reply to a season-query also contains the basic properties of all its episodes. I like this, as it keeps the number of requests very low.

The IMDB-identifier is easy to find: just surf to the series page on IMDB and look at the URL. The IMDB-identifier of a movie object is of the form 'ttXXXXXXX'. Note that there are also identifiers that start with different letters, such as 'nm' (name) and 'ch' (character).

Here is a sample input/output of tuvok:

$ python3
>>> import tuvok
>>> tuvok.main("tt0060028",8.3)
Season 1
----------
8.5      11 The Menagerie: Part I
8.4      12 The Menagerie: Part II
9.0      14 Balance of Terror
9.0      22 Space Seed
8.5      25 The Devil in the Dark
8.3      26 Errand of Mercy
9.3      28 The City on the Edge of Forever

Season 2
----------
8.8      1 Amok Time
9.2      4 Mirror, Mirror
8.8      6 The Doomsday Machine
8.6      10 Journey to Babel
9.0      15 The Trouble with Tribbles

Season 3
----------
8.6      2 The Enterprise Incident
8.3      23 All Our Yesterdays

Star Trek (1966-1969) has 14 episodes with rating >= 8.3.

Limitations

Of course, taking a rating-based approach to episode-picking has its limitations. Some series get such (skewed?) high ratings for each episode that filtering episodes purely on rating does not help in weeding the average episodes from the truly good ones. Take for example Downton Abbey: all 52 episodes have a rating of 8.2 or higher! Consequently, using tuvok to pick a subset of episodes will not yield interesting results.

Secondly, dealing with user ratings is hard to begin with. Sites like IMDB, Rotten Tomatoes and RateBeer all have their own algorithms to calculate ratings. The question of how to transform the ratings of thousands of users into a single number is a difficult one to answer.

We could start an endless discussion about the meaning and value of user ratings. I'll not go into that here, although it is a fascinating subject, especially for data scientists.

More info

You can find the source of tuvok on GitHub.