Three Trillion Hours of Conversation, Waiting to Be Unlocked

AI transcription services decode the mass of scattered audio content into text—the perfect source material for transformation.

Dec 29, 2023

In the biblical account, the Tower of Babel was an edifice of human pride. It was erected with the ambition to reach the heavens and "make a name for the builders," but this unity of purpose was fractured by a divine intervention that confounded their language, leaving the tower as a testament to their scattered aspirations.

In a similar vein, today's digital landscape mirrors the fragmentation of Babel. Instead of a physical tower, we have constructed vast repositories of digital content. The billions of hours of podcast recordings, videos, voice notes, and meetings captured each day form a virtual monument to human thought (and a fair amount of hubris). And like the ancient tower, our auditory collection teeters on the brink of incomprehensibility, with the vast majority of data scattered across cloud storage drives and servers.

However, we now have the tools to bring order to this chaos as AI transcription services begin to decode this mass of scrambled chatter into a single format accessible to all: text.

Transcription as a Service (TaaS)

Transcription as a Service has undergone a mini-revolution in recent years thanks to advances in automated speech recognition. The advent of transformer architectures, such as GPT-3, has further enhanced this capability through sophisticated context modeling, enabling AI to predict words more accurately. The quality keeps getting better, even as the cost plunges, in true Moore's Law fashion.

In 2016, when I first began working with speech recognition tools, a mediocre AI-generated transcript cost me around $15. Today you can generate a much more accurate version for pennies. Quality transcription used to be reserved for television closed captioning, courtroom dialogue, depositions, and other matters of importance where a clear record was needed. Now, it's everywhere:

Instagram and YouTube generate automatic transcripts to caption their videos, at no cost to users

Slack adds a transcript to any audio or video message you send to your coworkers

Zoom's meeting assistant "takes notes" while you talk, based on real-time transcription

The Real Untapped Potential

By itself, a transcript functions as a useful reference for what was said in a conversation. However, the real untapped potential lies in the combination of AI-powered transcription with LLM transformation. Vast troves of audio content can now be unlocked and repurposed, changing the way people do business in a variety of industries:

Conference calls into actionable tasks

Corporate training sessions into manuals

Educational lectures into study notes

Screenshare videos into how-to guides

Customer service calls into FAQ resources

But the industry that will change the most will be that of content creation and writing itself:

Podcasts into engaging blog posts or articles

YouTube series into e-books or compiled narratives

Webinars into instructional e-books or lead magnets

Interviews into biographical features

Speeches into opinion pieces or editorials

Celebrity podcasters like Tim Ferriss and Guy Raz were early here—turning their most popular episodes into best-selling books. Ferriss's Tools of Titans, for example, condensed insights from over 100 podcast interviews with "world-class performers" into an 800-page tome. The book debuted in 2016 as the #1 New York Times bestseller.

Generating a Raw Transcript

The first step in this process is to generate a high-quality transcript of your audio.

For podcasters, recording software like Riverside.fm offers quality transcription, as does editing software like Descript. However, in my experience, these tools provide good but not great transcription compared to services that specialize in transcription only, like Otter.ai and Rev.com.

Finally, OpenAI licenses its in-house Automatic Speech Recognition (ASR) technology, Whisper, to several products. ChatGPT's "voice mode" is powered by Whisper, which lets you talk into your phone's microphone rather than type in the chat box. The transcription quality is excellent—it puts punctuation in the right place and even papers over errors and filler words in your speech.

Most of these services claim around 95% accuracy. Whisper claims to have achieved 98.5%. With these rates, you might expect that human transcription would be a thing of the past. On the contrary, customers are now paying a premium for humans to provide the last 2-3% of accuracy that the machines miss. Once again, human + AI beats either one alone.

The Coming Upcycling Industry

But correcting basic errors is only one way that a human can increase the value of a raw AI-generated transcript. I predict that a whole new industry will emerge based on the creative transformation and "upcycling" of ideas from the heaps of audio content that have been accumulating.

This ability to transform spoken words into text is crucial in today's information landscape, given that it's often easier to articulate ideas aloud than to write them down, yet written content remains the preferred medium for consumption.

We begin this upcycling process by liberating your ideas from audio into editable text. But capturing this raw source text is just the first step in our alchemical process. Next, we must learn to wield AI commands for refining and increasing its value—starting with the simple but powerful use case of the sweetened, condensed transcript.

This post is adapted from "Commanding the Page" (2023).

Discussion about this post

Ready for more?