Talk transcript: Fine-tuning GPT-2 on World of Warcraft quests

In: Natural Language Generation, Projects, Research, Video Games
Published on
Written from the perspective of a fourth-year PhD candidate in the Netherlands.

This is a textual version of the talk I gave at Foundations of Digital Games 2021, for the research paper titled Fine-tuning GPT-2 on annotated RPG quests for NPC dialogue generation. The talk was transcribed using Whisper.cpp and co-edited with the open beta of ChatGPT to improve the flow of the text.


Good morning everyone. My name is Judith van Stegeren and I'm here to present a fun project that I worked on together with my co-author Jakub.

The project is a quest-giver dialog generator using GPT-2. We searched for a suitable data set to use and looked at multiple online RPGs. In the end, we chose to use a World of Warcraft data because it had the right amount of data points and context to make the project interesting.

Can GPT-2 support game writers?

This project wasn't initially focused on World of Warcraft. We simply wanted to see if GPT-2 could be used to support game writers.

This project is part of my PhD research, which has three main themes.

  1. First, there is a lack of example texts from video games when using machine learning for video game text. So, finding new video game corpora and exploring how they can be used is something I'm interested in.
  2. Second, procedural content generation is a thriving field, but there aren't many researchers focusing on generating textual game assets. Combining natural language processing or generation with procedural content generation is an area that needs more research in my opinion.
  3. Third, there is a gap between academic research and industry practice. Even when natural language generation researchers find a new method, it may not be practical for the game industry. I believe we should try to address this by keeping industry requirements in mind when conducting research like this.

There are many Transformer-based large scale language models, but we chose to use GPT-2 for our research because it is the most accessible for non-machine learning experts and non-computer scientists. This is mainly due to the availability of tutorials and Google Colabs written about it, as well as the popularity of GPT-2. It is also supported by a PyTorch wrapper called GPT-2-simple by Max Woolf, which makes it easy to use. Additionally, GPT-2 is already being used in AI Dungeon, which shows its potential in a gaming context.

Finding a suitable dataset

Our original idea was to create an NPC dialogue generator, but we hadn't decided on a game yet. To fine-tune a large scale language model on game text, we needed examples. Jakub looked at Destiny fan wikis, the Eve Online quest database, and the WoWHead quest database for World of Warcraft. We ultimately chose to use World of Warcraft because it had more data points and richer content than the other options. The quests were longer and had more text variation, providing us with more information to work with.

GPT-2 is a powerful language model that can learn latent NLP tasks from examples. This means that even if we don't explicitly code a task for it, GPT-2 can learn the task by looking at the examples. The original transformer paper and the GPT-2 paper used machine translation examples to demonstrate this.

For example, if we give GPT-2 a lot of examples that include the word "translate" and strings in the source and target languages, GPT-2 will learn that this is the correct output format. This means that if we give it a sentence in the source language and the target language, it will automatically generate a translation that fulfills the latent task it learned. We want to use this same approach for generating quest dialog.

We wanted to use a data set that not only included raw text strings from the game, but also had metadata about those strings. This would allow us to use the metadata to structure the task for GPT-2. For example, we could say "give us quest dialogue given this quest title and quest objective." We could also train GPT-2 to generate the quest title and objective given a piece of quest dialogue.

Generating World of Warcraft quests

We used the World of Warcraft quest database to create data points with the structure of "start of text," "title," "objective," "NPC dialogue," and "end of text." This information is shown to the player when they start a quest or subquest. By feeding GPT-2 enough of these data points, it will become a NPC quest dialogue generator.

We generated a bunch of quests using GPT-2 and mixed them with original World of Warcraft quests written by humans. We asked 32 people to rate the quests on several aspects, which is a common evaluation method in computational creativity. On average, the generated quests were rated worse than the handwritten quests, but in two out of five categories, the differences were not significant. We also found that in some cases, the generated quests had a wider range of content quality than the handwritten quests. This means that in some cases, a generated quest might be rated higher in quality than a human-written one. This suggests that GPT-2 can be successfully used for human-computer co-creation.

Overall, these results were not surprising, but they were encouraging. This was only a pilot project, and we found that picking the right data set and looking at the limitations of this approach are more interesting than the approach itself, as many people have used similar methods for fine-tuning GPT-2.

What can we learn from this?

The new contribution of this work is not the approach itself, but rather exploring how we can generalize a large scale language model trained on game data to other games and video game genres. For example, we trained GPT-2 on World of Warcraft quests and used it to create more World of Warcraft quests or dialogue. But what would happen if we used the same language model to create content for a Star Wars game, My Little Pony game, or a Dungeons & Dragons game, which are similar but not the same as World of Warcraft? This is an interesting area for future research.

It is important to consider other models besides GPT-2. While GPT-2 was initially chosen for its accessibility to non-experts, it has limitations, such as its large size and limited language support. I would be interested in seeing similar research projects using models like BERT, which supports multiple languages and is being developed to be smaller and more efficient. The Google BERT team is also researching this direction, whereas OpenAI is focusing on making larger and larger language models, such as GPT-3. Both approaches have merit and are worth researching.

Finally I would like to experiment with incorporating more information about the player or player character into each data point. This could be useful for generating more personalized or adaptive content. This could potentially be used for creating adaptive games that can change and respond to the player's actions or choices. It would be interesting to explore this further and see how it can influence the generated content.


In conclusion, this project shows that it is possible to use GPT-2 for human-computer co-creation. It is a promising first try and can be expanded in many ways. Making machine learning tools more accessible to game developers, as well as providing high-quality video game text corpora with metadata, would be beneficial for research in both procedural content generation and natural language generation. Additionally, exploring how we can generalize and transfer large language models to be reused across multiple games, genres, and domains is a valuable direction for future research.

Thank you very much and I'll be happy to answer any questions you have.