Talk transcript: Fantastic strings and where to find them

In: Projects, Research, Video Games
Published on
Written from the perspective of a third-year PhD candidate in the Netherlands.

This is a textual version of the talk I gave at the Intelligent Narrative Technologies workshop at AIIDE 2020, for the research paper titled Fantastic Strings and Where to Find Them. The talk was transcribed using Whisper.cpp and co-edited with the open beta of ChatGPT to improve the flow of the text.


Hi everyone, thank you for tuning in to my talk. My name is Judith van Stegeren and I collected three new datasets with video game text.

But beyond just discussing the three datasets, I also want to address the underlying problem. Why is it so challenging for researchers to create new video game text corpora? And what are some possible approaches that you could take if you want to create a new dataset?

I am a PhD student at the University of Twente and my research focuses on language generation for video games. There are many different types of texts in video games, but the ones that I am most interested in are in-game dialogue, game narratives, and flavor text for RPG games. Flavor text is the decorative text that is not essential for gameplay or completing the game, but adds more depth to the game world.

Video game datasets

As I began my research, I was curious to learn what other datasets were being used by researchers in the field. Personally, I read a lot of papers from the procedural content generation and natural language processing fields. And I found that most researchers tend to work with texts from other domains, even if they list video games as their intended application domain. For example, researchers working on story generation use story corpora, and those working on dialogue generation use dialogue corpora. However, these corpora are typically sourced from domains outside of video games.

One exception is Skyrim. Many papers use data from Skyrim because it comes with its own modding software, which is relatively user-friendly. This makes me think that we should encourage game developers to publish official modding software, as it can be very helpful for researchers.

Game writing is different from other writing

Additionally, I started reading about how game writers work in the industry, as I believe this could provide useful inspiration for creating my own text generators. I came across a great book by Chris Bateman in which the authors argue that game writing is fundamentally different from other types of writing and fiction writing.

Firstly, there is the interactive aspect of games to consider. But it's also because of the game development process. Game writers have to be able to adapt to the sometimes unpredictable development process, as games can change stories, genres, or styles within a few months. Game writers need to be flexible enough to handle this. Learning all of this made me realize that if I want to analyze and generate new video game texts, I want to work with source data from the video game domain.

Creating new datasets

However, I learned that there are not many available datasets with video game texts. So I had to create my own. But that turned out to be quite difficult. Ideally, I think the best data can be found directly in game files. But extracting those strings from an installed game is not straightforward. The strings might be spread across many different files, and the game engine reassembles them into what you see on screen during gameplay. Extracting this data is not easy, even for researchers with a computer science degree. You need to understand how the game works at a low level to access this data.

I talked to a game developer friend of mine and he gave me some ideas. I also read about how others have solved this problem in scientific papers.

I eventually found three possible approaches that I wanted to investigate further. These are ordered from easiest to most complex.

  1. The first option is to build a web scraper and search the internet for fan-made websites with in-game texts. There are many places on the internet where you can find transcriptions of quest texts, dialogue, cutscenes, and other in-game texts.
  2. The second option is to use official modding software, if it is available for the game. Not all games come with their own software, but those that do are generally user-friendly and well-documented. Since the publisher released the software, they often have a vested interest in its success, making it a good option to consider.
  3. The final option is to search online for fan-made tooling for modification or extraction of game assets on websites like Nexus Mods. This could also be a viable option.

Pros and cons

I tried out all these approaches in practice and they led to three different datasets. However, each has its own pros and cons, so depending on your needs and technical proficiency, as well as whether the publisher has released anything, you can choose one of these methods.

By the way, unfortunately, none of these methods are scalable, which is a disadvantage. You can build one extraction tool for one game, but it is unlikely that you can use it for another game, with the exception of games from the same publisher.

The first option, using wikis, is the easiest to use. However, as with Wikipedia, the data is crowd-sourced and may require extra cleaning, have typos, or be incomplete. The data available for different games is also subject to the fan community, so it can vary.

The second method is my favorite, but only a minority of games come with publisher-sanctioned modding software.

The third method, using fan-made tools, is for adventurers only. It requires installing a lot of undocumented or old tools and seeing if they can do what you need. However, the hard work paid off in the end, as it led to some high-quality data for my final dataset.

Dataset 1: Elder Scrolls

For my first dataset, I scraped all of the in-game books from the RPGs in the Elder Scrolls series from a fan wiki. For the second dataset, I used the official Torchlight 2 modding software, called Guts, to extract all the strings from that game. And for the third dataset, I used a collection of open-source fan-made tools to extract all the branching dialogue from Star Wars Knights of the Old Republic 1.

So, what does in-game books from the Elder Scrolls mean? The Elder Scrolls is a series of action RPGs, and the developers use in-game books to convey more information about the in-game world to the player. The player character can walk around the world and find books in shops and houses, and can read every book they encounter.

The advantage of this dataset is that the books contain relatively long texts. Most game text is short to save the player time, but this dataset is quite large. In a previous project, I used it to learn the semantics of non-English words and create a sentiment analyzer based on those semantics. I believe it can be useful for many different NLP applications focusing on video games. It's a great starting dataset because of its size.

Dataset 2: Torchlight 2

The second dataset is very different from the first one in that it contains a wider variety of texts. The first dataset only contained fictional book texts, while this dataset is smaller but has a greater variety of texts. It contains the starting, intermediate, and completion dialogue for every quest in Torchlight 2. In addition to quest dialogue, it also contains GUI texts such as one-line quest objectives and toolbox tips. It also contains long-form summaries of the main quest and the entire main quest narrative of the game. I really like the variety of this dataset.

Dataset 3: Star Wars

The third dataset contains branching dialogue, which is a feature that Star Wars Knights of the Old Republic 1 is known for. The game was published by BioWare, who are famous for their games with high-quality writing and voice acting.

Through branching dialogue, the player influence how their character reacts to situations and dialogue from other characters in the game. Not every choice is significant, but some can change the direction of the story. In the Star Wars games, there is a big focus on good versus evil, Jedi versus Sith alignment, and branching dialogue is one way they implement this.

Branching dialogue is not something you find in every RPG, but I am happy to have collected this dataset as it contains over 30,000 lines of dialogue, including information about the conversation and speakers, and all branching options. This dataset can be used to research how branching dialogue in games works and to potentially generate new branching dialogue. I hope it is helpful for researchers interested in this topic.

The third dataset again contains only one type of text, but it is a large dataset with a wide variety of dialogue that is heavy on emotion and personality. Unlike the books in the Elder Scrolls dataset, which contain neutral "facts", this dataset has a diverse cast of characters. Researchers interested in analyzing personality, linguistic style, and emotions in narratives may be able to use this dataset.


I will now take questions in the workshop Zoom session, but if you watch this talk later or have any questions, feel free to send me a message and I'd be happy to respond. You can find some information about the dataset in the GitHub repository.

For those who are not technically savvy, don't worry, I made sure that all the data is easy to use by formatting it into a CSV file. It can be read with Excel or your favorite programming language. I want to make it as usable as possible.

Thanks for listening and I'll take some questions now.