A comparison of GPT-2 and BERT

In: Natural Language Generation, Research
Published on 2020-08-04
Written from the perspective of a third-year PhD candidate in the Netherlands.

GPT-2 and BERT are two methods for creating language models, based on neural networks and deep learning. GPT-2 and BERT are fairly young, but they are 'state-of-the-art', which means they beat almost every other method in the natural language processing field.

GPT-2 and BERT are extra useable because they come with a set of pre-trained language models, which anyone can download and use. Pre-trained models have as main advantage that user don't have to train a language model from scratch, which is computationally expensive and requires a huge dataset. Instead, you can take a smaller dataset and "fine-tune" the large, pre-trained model to your specific dataset with a bit of additional training, which is much cheaper.

For my next research project, I want to use one of the "state of the art" deep learning approaches to text generation, so I read up on GPT-2 and BERT and created this overview. I hope it is helpful to other people on the internet!

Note that I just compared GPT-2 and BERT for the two languages that are relevant to my research (i.e. Dutch and English). Especially BERT has spawned a long list of alternative, spin-off models. Before you decide on a language model for your particular problem, read up on the strengths and weaknesses of a language model for specific tasks!

Quick decision table

I would use the following approach for specific task/language combinations, based on what I read about GPT-2 and BERT so far:

Task	English	Dutch
Natural language processing (NLP)	BERT	BERT (one of the BERT models for Dutch)
Natural language generation (NLG)	GPT-2	It depends: If input and output are not sensitive data: Combine GPT-2 (English) with GoogleTranslate API (Dutch to English and vice versa) If you only need short texts (15-token sentences) and you have bi-directional prompts: BERT (one of the BERT models for Dutch)

A very quick introduction to GPT-2 and BERT

GPT-2

Useful general information about this approach can be found in OpenAI's README and the model card for GPT-2
GPT-2 is only partially open source. The training dataset (called 'WebText') is proprietary and not published (although there are on-going discussions about this on GitHub). The opensource community is trying to create an opensource version of the training dataset, such as this initiative of Brown University researchers and this project.
All sizes pre-trained models are freely available. The biggest model was at first kept closed source for fear of abuse.
Models can be finetuned with smaller datasets for particular generation purposes.
It is only avaible for English. As far as I know, no model is available for Dutch, or other languages. Some people are trying to train a model for other languages than English, but lack of similar-sized heterogeneous training sets remains a practical problem:
Is widely being used for experimental natural language generation, ranging from hobby projects to exploratory research. Particularly cool examples: Procedural Magic cards, AI dungeon game
Is supported by Transformers, the biggest opensource library for AI/ML/NLP in Python.
There is lots of information available on this model: example code, colabs, tutorials, spin-off opensource projects, etc.
It is easy to use, even for students. I have seen multiple successful study projects of my students (for the courses "NLP", "advanced NLP", bachelor theses, etc) in the past two years. It's great for letting students experience the entire NLP research pipeline in a short time span: GPT-2 allows them to create a text generator, generate some outputs and evaluate them with humans in a couple of weeks.
Outputs of smaller models tend to converge on repetitive language, loops, etc during generation. If output quality is important, you need to either use a bigger language model (larger computation cost), or take some time to do automated output cleaning, or manual inspection and selection of artifacts (i.e. human-computer co-creation).
Can be used in combination with "field-specific tokens" to control the flow/coherence/structure of output text. This is best explained as using xml-tags in your training data and output prompt, to guide (force) the generator to generate a specific data structure. See for example the method used in this paper (page 3).
The next version, GPT-3, has a similar architecture as GPT-2 but is even larger in size.
GPT-3 is commercially available (not open source!) via an API. The authors have written a paper and published some research data, but training data and models are not available.
Demo API access is available for researchers ("private beta"), but there's a long waiting list as of June 2020. More media coverage about GPT-3: The Next Web, New York Times and The Verge.

BERT

BERT stands for "Bidirectional Encoder Representations from Transformers", and "is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks."
There is an academic paper available on arxiv, in which the creators explain the theoretical details. The same paper was later published in NAACL conference of ACL.
BERT was created by Google researchers, and it seems to be fully open source.
Fun fact: the name BERT fits into the long-standing NLP tradition of naming systems after Muppets.
All different sizes pre-trained models are freely available via Github. Sizes range from Tiny to Large.
There are a number of smaller BERT models (tiny, mini, small, etc), which are intended for environments with restricted computational resources. Paper related to the smaller models: Well-Read Students Learn Better: On the Importance of Pre-training Compact Models.
Just as with GPT-2, models can be finetuned with smaller datasets for particular purposes.
Some of the models are integrated in HuggingFace's Python library Transformers.
Researchers and users are encouraged to share new models and performance comparisons online. BERT has a very active open source community, which seems to be mostly driven by the Transformers library of HuggingFace. BERT has various spin-off models, such as RoBERTa (from Facebook) and DistilBert (by HuggingFace). Some examples can be found in the 'supported models' list of HuggingFace: https://github.com/huggingface/transformers#model-architectures
Contrary to GPT-2, BERT has models for various languages: both monolingual models (English, Dutch, etc) and multilingual models. Multilingual models do not seem to perform as well as monolingual ones, possibly due to undertraining on the various included languages. This is a nice comparison paper for monolingual vs multilingual BERT-models, focused on Nordic languages: Is Multilingual BERT Fluent in Language Generation? by Rönnqvist et al. (2019).
Performs well on NLP tasks, even compared to GPT-2.
BERT was originally not meant for language generation, but Wang and Cho (2019) published a series of 'hacks' that uses BERT's training method (cloze test + follow-up sentence test) for generation. Relevant paper: BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model by Wang and Cho (2019).
Because of the deep bi-directionality of the neural architecture, BERT seems to perform best for text generation if there is a prompt prepended AND appended to each output sequence. If I recall correctly, GPT-2 is also bidirectional, but only shallowly, which is why it can perform better than BERT on 'standard' left-to-right text generation.
There are at least two BERT models for Dutch: BERTje (De Vries et al.) and RobBERT (Delobelle et al.), and there might be more.

It seems that if you want normal left-to-right generation in English, GPT-2 is still the best way to go. BERT's main strength is NLP tasks, and the variety of languages for which a pre-trained model is available.

If you've used this overview to help you choose a language model, let me know in the comments. I'd love to hear about your NLP/NLG projects.