A suitable corpus, or dataset of text documents, is one of the main ingredients of many natural language generation projects. There are many general-purpose corpora available for download on the internet. For example, the natural language processing library nltk has a built-in download module, through which you can access various standard datasets. Anyone can download these datasets for free, which makes them a great NLP resource for programmers and language researchers.
However, sometimes you need a corpus specifically for your project. This is particularly the case for data-driven approaches to language generation, where, instead of manually coding a model of what you want to generate, you train a model on a dataset. This results in a model that reflects certain properties of the corpus, because the text in the dataset acts as an example of “valid output text”. Depending on the chosen model and way of preprocessing, output texts will feature the same vocabulary, word-length, topics, sentence structure or grammar as the text from the corpus. If your input dataset is biased in any way, that same bias will end up in the results of the generator trained on that same dataset.
In other words, if I use weather reports as a basis for generating text, my output will also look like weather reports!
If I use newspaper articles as starting point for generating new text, my output will probably contain terms related to politics and economics. Books from project Gutenberg infuse my output with the old-fashioned language found in 19th century novels. Using Wikipedia articles as a source might introduce crowd-sourced spelling mistakes in the text. And using a dataset of Amazon product reviews riddled with opinions will lead to output text with lots of subjective language.
This “contamination” is fine, as long as it is deliberate and it fits with the intended application.
Keep this in mind next time you choose a corpus for your NLG project.
Thanks to Gerdriaan Mulder for proofreading this post.