The Knowledge Refinery

Every podcaster with 200 episodes is sitting on an oil field they can't drill. The refinery turns crude transcripts into books, skills, and products. Here's the category.

Feb 22, 2026

There are roughly four million podcasts in the world. About 400,000 of them have published more than fifty episodes. Maybe 40,000 have published over two hundred.

Each of those 40,000 shows is sitting on something between half a million and five million words of spoken knowledge - transcribed, timestamped, and doing absolutely nothing. It lives in an RSS feed that Apple indexes and nobody searches. Buried in a hosting dashboard that the creator checks for download numbers and nothing else.

This is the largest untapped knowledge base on the internet, and almost none of it has been refined into anything you can actually use.

I've been refining it for three years.

Before AI, I did it by hand. A client would send me a hundred hours of podcast tape and I'd spend months extracting the good parts, rewriting them, stitching chapters together, polishing prose until the speaker's voice survived on the page. It was brutal, beautiful work. The output was a book. One book, from one client, in maybe six months.

Now I run the same operation in weeks. The transcription happens overnight. A set of skills chunks the material into semantic topics, extracts the entities and frameworks, builds a searchable index, and queues up research packets for each chapter. I still make every editorial judgment - which stories to lead with, which frameworks deserve their own section, where the speaker's actual words hit harder than any paraphrase. But the friction between raw and refined has nearly disappeared.

Six books in two production rounds. Two hundred and fifteen thousand words. From transcripts that were just sitting there.

The villain in this story is waste.

Not malicious waste. Comfortable waste. The kind where a creator with three hundred episodes and a quarter million followers keeps recording new material because that's what creators do, while the backlog appreciates in volume and depreciates in relevance. Every week the archive gets bigger. Every week the window to publish a book from the early material gets a little narrower, because the conversation has moved on and the guest's claims have aged and the anecdotes that felt urgent in 2022 now need context.

The creator knows this. They've thought about writing a book, probably mentioned it on the show, maybe even talked to Scribe Media and heard the number. Forty thousand dollars. Six months. For one book extracted from content that already exists. The economics don't make sense, so the project sits in a "someday" list and the archive keeps growing.

Meanwhile, their audience is doing something interesting. They're listening to three-hour episodes, taking notes in Notion, building personal wikis, clipping quotes for Twitter threads. The audience is refining the content by hand, one listener at a time, because no systematic version exists.

Eric Jorgensen saw this with Naval Ravikant. He took Naval's public tweets and podcast appearances, organized them into chapters, and published The Almanac of Naval Ravikant. Naval didn't write a word. He authorized it - gave his blessing to someone else's curation. The book has moved over a million copies and it's free.

That was a one-off. What if it were a pipeline?

The pipeline exists. I've been running it, and the economics look nothing like traditional publishing.

Here's what the refinery produces from a single podcast corpus:

Books. Not ghostwritten-from-scratch books where someone interviews the author for twenty hours and fabricates a narrative. Compiled books - the speaker's actual words, reorganized and edited, with sixty percent editorial voice and ten percent direct quotes and every claim traced to a specific episode. Fair use compliant. Companion to the original, not substitute. The creator approves an outline and a final draft. That's their total involvement.

Wikis. Searchable encyclopedias built from the full transcript archive. Every framework, every guest, every protocol cross-referenced and linked. Obsidian-style markdown that works as a static site. I've built eight of these. The Ray Peat wiki has 2,898 transcripts and 23,833 semantic chunks. The Scott Adams wiki has 1,151 episodes indexed. These are knowledge bases that didn't exist before because nobody had the tooling to build them.

Skills. This is the part most people haven't considered. Inside every methodology-heavy podcast, there are repeatable workflows that can be codified as executable instructions. An SEO podcast teaches keyword research - that becomes a skill an AI can run. A sales podcast teaches objection handling - that becomes a skill with templates and decision trees. A health podcast teaches supplement protocols - that becomes a skill with dosages and timing and contraindications. The knowledge locked in transcripts can become tools that actually do things.

Answer agents. A RAG-powered AI that has read every episode and answers questions in the creator's voice, grounded in their actual words, with citations back to the source material. Not a chatbot trained on the internet pretending to be someone. A knowledge base with a conversation layer.

One corpus. Four products. The crude goes in, and books, wikis, skills, and agents come out the other end.

The word for this is refinery, and I mean it literally.

An oil refinery takes one input - crude petroleum - and produces gasoline, diesel, jet fuel, kerosene, lubricants, and asphalt. Different products for different markets, all from the same barrel. The economics work because the refinery is expensive to build and cheap to operate. The capital expenditure is the plant. The marginal cost of processing one more barrel is close to nothing.

The knowledge refinery works the same way. The expensive part was building the pipeline - the transcription engine, the semantic chunker, the entity extractor, the indexer, the chapter-writing skill with its anti-AI detection and citation verification and legal compliance checks, the six-phase production process with human review gates. That took a year. Now the marginal cost of processing one more podcast corpus is maybe twenty dollars in API calls and a few days of editorial attention.

The creator doesn't need to know any of this. They hand over an RSS feed. They get back products.

Every interesting category creates its own vocabulary. The knowledge refinery creates this one:

Corpus. The creator's body of work treated as a unified asset. Not "content" - that word implies disposability. A corpus is a library. It appreciates.

Crude. Raw transcripts, unstructured, full of repetition and filler and buried insight. Valuable but unusable in this form.

Refining. The systematic transformation of crude into products. Not editing - that implies cleanup. Refining implies extraction, separation, recombination. You're producing multiple outputs from a single feedstock.

Authorization. The creator's blessing. They didn't write the book, but they stand behind it. The Jorgensen-Naval model. This is the relationship that makes everything else possible and everything else legal.

The backlog problem. The uncomfortable truth that most creators are sitting on more value in their archives than in their next episode.

I can tell you exactly who this is for because I've done the work for three of them already.

It's the podcaster with a hundred-plus episodes and fifty thousand followers who has never published a book. The one whose audience would buy a book tomorrow if it existed. The one who mentioned writing a book two years ago on the show and hasn't started because Scribe quoted forty thousand and that's money they'd rather spend on production.

It's the YouTuber with a methodology - a system they teach across dozens of videos - that has never been organized into a single reference. Their audience takes notes by hand. The knowledge exists in fragments scattered across a playlist.

It's the course creator who recorded forty hours of lectures and put them behind a paywall, not realizing that the transcripts alone - organized, indexed, refined - could be a product line. Books from the lectures. A wiki from the curriculum. Skills from the exercises.

The common thread: they've already created the raw material. They don't need to create anything new. They need a refinery.

What I'm describing is not a service. It's a category.

Services compete on price and trust. Categories create their own demand. "Ghostwriting" is a service - you hire someone to write for you. "Knowledge refining" is a category - the systematic transformation of a creator's existing corpus into multiple knowledge products using AI-augmented pipelines.

The difference matters because a service scales with headcount and a category scales with tooling. I can refine one podcast corpus myself. But the pipeline - the skills, the SOPs, the quality gates, the production process - can be operated by anyone I train. And the training is the pipeline itself. You learn knowledge refining by running the refinery.

That's the endgame. Not a boutique where I produce books for ten clients. A refinery that other people operate, processing corpora I'll never touch, producing products in niches I don't know anything about. The platform takes a cut. The skills get better with every run. The creators get products they couldn't build themselves.

And somewhere in the archive of every podcast that's ever published fifty episodes, there's a book waiting to be refined.

The question I keep asking myself is whether Keynes would recognize this as progress.

A technology that turns dormant archives into living knowledge. That takes the best thinking from a thousand podcast conversations and makes it searchable, quotable, teachable. That gives a creator's ideas a life beyond the feed.

Or a technology that turns every backlog into a production opportunity and every corpus into a pipeline and never lets anything just sit there, being what it is, without someone calculating the marginal return on refining it.

I don't have a clean answer. Both things are true. The refinery produces real value - books that people read, skills that people use, wikis that organize knowledge that was previously inaccessible. And the refinery is a machine, and machines have a way of running you if you're not careful about running them.

So I run it on things I care about. Niches where I have conviction. Creators whose work I'd read anyway. I point the pipeline at the Catholic land movement and rural newsletters and a psychologist who teaches breathing exercises to anxious people. Not because these are the most profitable niches - they're not - but because the refinery should serve something, and if I can't name what it serves, I'm just processing crude.

The intelligence is abundant. The transcripts are abundant. The question, as always, is what's worth refining.

Discussion about this post

Ready for more?