NLP on social media messages

Today, Twitter user @iamdevloper started a new meme asking for Harry Potter titles with technical references in them. People started posting so many replies that a friend asked me whether I could find which ones where the best. This post is a write-up of my method. Click here to go straight to the results.

Method

I used the Twitter API (Python3 and python-twitter) to scrape 3203 tweets that are a reply to @iamdevloper. I filter specifically on replies to @iamdevloper newer than the original meme. I don’t want to see retweets in my results.

These elements make up the raw Twitter search query:

@iamdevloper 
filter:replies
since_id:1197281108213293056
exclude:retweets 
-filter:nativeretweets 

I filtered these posts to just include direct replies to the original meme. This left me with a dataset of 2733 tweets.

target = api.GetStatus(1197281108213293056) # the original meme
target = target.AsDict()
replies = [t for t in tweets 
            if "in_reply_to_status_id" in t.keys() and 
                t["in_reply_to_status_id"] == target["id"]]

We came up with two ways to determine “the best” Twitter posts in the series.

Rank 1: reach

We can sorted tweets by reach, i.e. the number of users that were (potentially) reached by a certain post. Each tweet is reached by the followers of the author, and the followers of every person that retweets it. I wasn’t able to get the number of followers for every user that retweeted a tweet, so I decided to multiply the number of retweets by the median number of followers of all users in my dataset.

followers = [tweet["user"]["followers_count"] for tweet in tweets_ if "followers_count" in tweet["user"].keys()] 
followers += (len(tweets) - len(followers)) * [0] # users with 0 followers are ignored by the list comprehension
average_followers = sum(followers)/len(followers)
followers.sort()
median_followers = followers[math.floor(len(followers)/2)] #192 is median of followers_count

reach(tweet, median_followers):
    return tweet["user"]["followers_count"] + 
            tweet["retweet_count"] * median_followers

reach_statistics = [ (tweet, reach(tweet, 193)) for tweet in replies]
reach_statistics.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
    print(reach_statistics[i][0]['full_text'], reach_statistics[i][1])

@iamdevloper Harry Potter and the Useless Use of Cat 185845
@iamdevloper Harry Potter and the Bashed Bin of Unix 131775
@iamdevloper Config of WebPack 74163
@iamdevloper haslayout 67400
@iamdevloper Harry Potter and the #HollowsHunter 41347
@iamdevloper Harry Potter and the Prisoner of Kanban 40799
@iamdevloper Harry Potter and the Image Pull Policy 35719
@iamdevloper Harry Potter and the Room of Unclear Requirements 28759
@iamdevloper Race Condition Harry Potter and the 26434
@iamdevloper hollow of vim. 25424

Rank 2: popularity

We can also sort tweets by popularity, i.e. calculate a popularity metric based on number of likes and number of retweets. I decided to count favorites with weight 1, and retweets with weight 3.

popularity(tweet, favorite_weight, retweet_weight):
    return tweet["favorite_count"] * favorite_weight + 
            tweet["retweet_count"] * retweet_weight

popularity_statistics = [ (tweet, popularity(tweet, 1, 3)) for tweet in replies]
popularity_statistics.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
    print(popularity_statistics[i][0]['full_text'], popularity_statistics[i][1])

@iamdevloper Harry Potter and the Prisoner of Kanban 2233
@iamdevloper Harry Potter and the Room of Unclear Requirements 2030
@iamdevloper Race Condition Harry Potter and the 1936
@iamdevloper Harry Potter and the Prisoner of Vim 1417
@iamdevloper Chamber of NullPointerExceptions 854
@iamdevloper Harry potter and the unexpected ":" on line 43. 724
@iamdevloper szynszyliszys Harry Potter and the [object Object] 532
@iamdevloper Prisoner of jQuery 404
@iamdevloper Harry Potter and the ;SELECT * FROM users; 362
@iamdevloper HaaS (Hogwarts as a Service) 347

Prett(y|ier) printing

I liked the results of the popularity metric best. I sorted the tweets on their popularity and formatted the results in something a bit more readable.

First, I removed all usernames at the beginning of the tweet with a regular expression. Then I checked whether “Harry Potter” is already part of the tweet. If it’s not, I prepend the tweet with “Harry Potter and the “. Unfortunately, this does not work perfectly, as some people inserted ampersands, abbreviations and other deviations from @iamdevlopers fill-in-the-blank template. There’s a lot more I could do to make it prettier – however, I’m just interested in getting the results in a readable format /fast/.

for i in range(50):
    booktitle = popularity_statistics[i][0]['full_text']
    # delete usernames at the start of the tweet
    booktitle, sub = re.subn(r"^(@[A-Za-z0-9_]+ )+\s*",r"",booktitle)
    if "Harry Potter" not in booktitle:
        print("Harry Potter and the", booktitle)
    else:
        print(booktitle)

Bonus: puns

So, which Harry Potter title is most suitable for making puns? Let’s count how many tweets refer to each book in their text:

books = [
    "Harry Potter and the Philosopher's Stone", 
    'Harry Potter and the Chamber of Secrets', 
    'Harry Potter and the Prisoner of Azkaban', 
    'Harry Potter and the Goblet of Fire', 
    'Harry Potter and the Order of the Phoenix', 
    'Harry Potter and the Half-Blood Prince', 
    'Harry Potter and the Deathly Hallows']
keywords = [
    ['sorcerer','philosopher', 'stone'], 
    ['chamber', 'secret'], 
    ['prisoner', 'azkaban'], 
    ['goblet', 'fire'], 
    ['order', 'phoenix'], 
    ['blood', 'prince', 'half'], 
    ['deathly', 'hallows']]

bookcounts = {}
for book in books:
  bookcounts[book] = 0

for t in tweets_:
  for index, book in enumerate(books):
    for keyword in keywords[index]:
      if keyword in t["full_text"].lower():
        bookcounts[book] += 1
        break # count each book only once

for book in books:
  print(book, bookcounts[book])

Harry Potter and the Philosopher's Stone 83
Harry Potter and the Chamber of Secrets 178
Harry Potter and the Prisoner of Azkaban 131
Harry Potter and the Goblet of Fire 85
Harry Potter and the Order of the Phoenix 131
Harry Potter and the Half-Blood Prince 103
Harry Potter and the Deathly Hallows 94

Most tweets don’t refer to a specific book at all, but Chamber of Secrets has the most puns.

So, which things are described as DEATHLY or SECRET by the replying punsters?

for t in tweets_:
  if "deathly" in t["full_text"].lower():
    print(re.search("deathly.*",t["full_text"].lower()).group())

    [ some omitted grep, tr, sort and uniq magic ]

      2 deathly deploy
      2 deathly exception
      2 deathly hallow
      2 deathly haskell
      2 deathly nullpointerexception
      3 deathly bug
      3 deathly ehlo
      3 deathly malloc
      4 deathly hello world
for t in tweets_:
  if "of secrets" in t["full_text"].lower():
    print(re.search("[^ ]* of secrets",t["full_text"].lower()).group())

    [ some omitted grep, tr, sort and uniq magic ]

      2 docker of secrets
      2 documentation of secrets
      2 repository of secrets
      2 vault of secrets
      3 container of secrets

Results: top 50 Tech Hogwarts titles

Harry Potter and the Prisoner of Kanban
Harry Potter and the Room of Unclear Requirements
Race Condition Harry Potter and the
Harry Potter and the Prisoner of Vim
Harry Potter and the Chamber of NullPointerExceptions
Harry potter and the unexpected ":" on line 43.
Harry Potter and the [object Object]
Harry Potter and the Prisoner of jQuery
Harry Potter and the ;SELECT * FROM users;
Harry Potter and the HaaS (Hogwarts as a Service)
Harry Potter and the Stack Trace of Secrets
Harry Potter and the Static site generator
Harry Potter and the Hotfix on Production
Harry Potter and the ________ // TODO
Harry Potter and the Order of the Arguments
Harry Potter and the Unexpected Identifier
Harry potter and the prisoner of the legacy system
Harry Potter and the Container of Secrets
Harry Potter and the Bubble-Sorting Hat
Harry Potter and the Unsecured AWS Container
Harry Potter and the harry potter and the harry potter and the 
    harry potter and the harry potter and the harry potter and 
    the harry potter and the harry potter and the harry potter 
    and the CTRL-C [as Voldermort: a while(true) loop]
Harry Potter and the Problem That Couldn't Possibly Be DNS.
Harry Potter and the Conflicts of Git
Harry Potter and the Passwords in Plain Text
Harry Potter and the Code that Worked on His Machine
Harry Potter and the Kernel of Panic
Harry Potter and the marketing manager who always blames site 
    speed for poor sales performance.
Harry Potter and the Full-Stack Prince
Harry Potter and the Infinite Loop
Harry Potter and the Chamber of Application Secrets
Harry Potter and the Order of the Phoenix Project
Harry Potter and the Half-Blood Print Statement
Harry Potter and the Deathly Malloc
Harry Potter and the Config of WebPack
Harry Potter and the [Object object]
Harry Potter and the Prisoner of Friday Deployments.
Harry Potter and the Goblet of Java
Harry Potter and the Chamber of Shared Secrets
Harry Potter and the Blue Screen of Deathly Hallows
Harry Potter and the Order of the Nginx
Harry Potter and the Missing Node Module
Harry Potter and the Useless Use of Cat
Harry Potter and the De-referenced Null Pointer
Harry Potter and the conflict of kubernetes.
Harry Potter and the facepalm of vulnerabilities.
Harry Potter and the Deathly Requirements.
Harry Potter and the Arrow of Lambda
Harry Potter and the Source on Your Phone.
Harry Potter and the Order of the Callbacks
Harry Potter and the Accidently Delete Master Branch
Harry Potter anD 7#!../'*zz)n3,+1j¥%£<22e.-3;_)#;dj82;I✊$1@;
    Segmentation fault (core dumped)
Harry Potter and Order of Flex Items
Harry Potter and the Chamber of Segfaults.
Harry Potter and the Image Pull Policy
Harry Potter and the CHMOD of secrets.
Harry Potter and prisoner of ASCII-ban.

If you want even more, here’s the (uncleaned) top 1000.

Beware of unescaped, unsafe, untrusted user input. ;)