In this post I want to share some fun data sources for NLG and NLP. In an earlier post, I mentioned why a suitable corpus is important, especially when you’re doing natural language generation. Since I’m working on natural language generation for video games, I try to find good datasets for that domain. However, finding datasets with text data from games is hard, which means that often I need to scrape and compile my own corpora.

I’m also collecting examples of games that already use NLG, to see which techniques are used by developers and how these affect the quality of the game.

Game code

Finding information about NLG in games is hard, since game companies generally don’t publish their game code. Luckily for me, it’s easy to inspect the code of games that run on client-side javascript, such as CookieClicker and open source games like Max Kreminski’s idle game Epitaph.

Both games feature NLG techniques, which is an added bonus. CookieClicker has a news ticker that shows weird headlines depending on your game process. Epitaph is text-based and procedurally generated, which means it runs completely on NLG. The game generates fictional societies, languages and Civilizations-like tech trees.

Fan wikis

Popular games such as the RPGs Diablo, Deus Ex and Bioshock have crowd-sourced FANDOM wikis, where you can find collections of game text. The Deus Ex wiki contains the text of in-game ebooks and newpapers. The Bioshock wiki has transcriptions of all audio diaries that the player can find in the game. Warning: crawling these wikis might lead to spoilers. ;)

Extracting dialogue data from games

Javier Torres has written a manual for extracting dialogue from Morrowind using the Elder Scrolls toolkit.


Film scripts do not contain video game text, but are still fun for my application domain, especially for generating dialogue.

I’m really impressed by the fan community of D&D webshow Critical Role, who have transcribed all episodes of Critical Role.

There are also transcripts of Star Trek Voyager episodes available online. I’m secretly planning on turning these into a Vulcan irc-bot.

I hope you liked this list of text resources. If you know other fun datasets that could be of use to NLG/NLP research in the context of video games, let me know!