A workflow for turning talks into blogposts with ML

In: Natural Language Generation, Projects
Published on
Written from the perspective of a machine learning engineer.

This blogpost was co-created with ChatGPT.

In this blogpost, I will describe the workflow I used to efficiently turn my talk videos into blogposts with machine learning. The heavy lifting is done by two relatively new machine learning tools: whisper.cpp and ChatGPT.

I use whisper.cpp, created by Georgi Gerganov, for audio transcripton. Whisper.cpp is a lightning fast open-source implementation of OpenAI's automatic speech recognition (ASR) model Whisper. I love that it runs offline, on my 5 year old computer, without any third party dependencies. It transcribed 15-30 minute talks in mere minutes. The audio transcripts are edited by hand and then fed ChatGPT for further processing.

ChatGPT is a large language model by OpenAI -- as it will repeatedly tell you if you feed it the wrong prompt. It can be used for a variety of natural language processing tasks, such as text completion, translation, and summarization. I use ChatGPT to help me rewrite my transcripts into more fluent and natural-sounding texts.

Here's the workflow:

  1. Clone the whisper.cpp git repository and follow the excellent installation instructions in the readme.
  2. Optional: download a specific GGML model for whisper. However, you don't NEED a fancy model. The base model that is downloaded as part of the whisper.cpp demo worked fine for transcribing talks. The base model supports multiple languages. (Edit: for transcribing English talks I now use the medium-sized model for English: ggml-medium.en.bin)
  3. Prep the audio of the talk you want to transcribe. My old talks are on YouTube, so I used youtube-dlp to download the video and extract the audio as an mp3-file. Example: youtube-dlp -x https://www.youtube.com/watch?v=dQw4w9WgXcQ --audio-format mp3
  4. Whisper.cpp only works on 16-bit wav files, so I then convert the mp3 to wav with ffmpeg. Example: ffmpeg -i filename.mp3 -ar 16000 -ac 1 -c:a pcm_s16le filename.wav
  5. I tell whisper.cpp to transcribe the file, following the demo in the project's README. Example: ./main -m models/ggml-base.bin -otxt filename.wav
  6. Now we start prepping the transcript to make it ingestible by ChatGPT. Start by adding newlines to the transcript and break up the text in clear paragraphs.
  7. Edit the text to remove any irrelevant paragraphs.
  8. Remove incorrect words or substitute them for placeholders such as [name]. Things like names, urls and hyperspecific jargon can be tricky to transcribe for Whisper.
  9. Use ChatGPT to help you rewrite the transcript into a more fluent and natural-sounding text. A useful prompt for this step is: "You are a professional editor. Rewrite this transcript into a more fluent text. Break up long sentences without changing the meaning or leaving out information. Make the flow of the text more natural." I started out with adding the prompt to every interaction with ChatGPT. However, the bot is optimized for dialogue and it pretty good in remembering the conversation so far, so retrospect this was not really necessary.
  10. Feed one or two paragraphs at a time to the language model. If you provide it with too much text (e.g. three or more paragraphs), it may start to leave out important information.
  11. Concatenate all rewritten paragraphs.
  12. Factcheck and spellcheck the text. If you skip this step, I will cry and/or lose my hope for the future of humanity.
  13. Finish up the blogpost by adding links, references, and images.

Writing a blogpost this way can still be a lot of work, but it definitely beats transcribing 30 minutes of audio by hand. As added bonus, I now have text-searchable (and search engine discoverable!) textual versions of my past talks.