Introducing Talkie: A 13B Vintage AI Model Trained on Pre-1931 Text

A Window to the Past: Introducing Talkie, the 13B Vintage Language Model

Researchers Nick Levine, David Duvenaud, and Alec Radford have unveiled Talkie, a novel 13-billion-parameter language model with a unique constraint: it was trained exclusively on English text published before 1931. This 'vintage' AI is designed not just as a conversational oddity, but as a serious research tool to probe the limits of AI generalization and study historical culture through a computational lens.

Why Build an AI from 1930?

The core idea, pioneered by researchers like Owain Evans, is to create a controlled environment for studying AI. By establishing a firm knowledge cutoff, researchers can cleanly test a model's ability to predict future events, generalize to novel concepts, and invent ideas beyond its training data. It also serves as a contamination-free baseline.

Contamination, where models inadvertently learn from test data, is a persistent problem in AI evaluation. Vintage models are 'clean' by design. The team is already using Talkie to evaluate forecasting performance, measuring the 'surprisingness' of post-1930 historical events to the model.

Early analysis shows a pronounced spike in surprisingness for events in the 1950s and 1960s, followed by a plateau. The researchers also aim to test if such a model could independently arrive at post-1930 inventions, like the helicopter or Turing machines, or even discover scientific principles like General Relativity.

Benchmarking a Model from Another Era

To contextualize Talkie's performance, the team trained an architecturally identical 'modern twin' on contemporary web data (FineWeb). On standard knowledge evaluations, Talkie underperforms its modern counterpart, even after correcting for anachronistic questions.

However, its performance on core language understanding and numeracy tasks is similar. The performance gap is attributed to differences in data quality—primarily poor Optical Character Recognition (OCR) from historical documents—and the subject matter distribution of the pre-1931 corpus.

The Daunting Challenges of Vintage AI

Building Talkie presented unique hurdles not faced in modern model training. The three primary challenges were temporal leakage, data quality, and era-appropriate post-training.

Combating Temporal Leakage

Ensuring no post-1930 data contaminates the training set is paramount. Leakage can occur through faulty metadata or modern editorial insertions in old texts. The team used an n-gram-based classifier to filter the corpus, but it wasn't perfect.

An earlier 7B version demonstrated knowledge of Franklin D. Roosevelt's presidency and New Deal legislation, which began in 1933. The current 13B model also shows some awareness of World War II and the postwar order. The researchers are developing more advanced leakage detection techniques for future versions.

continue reading below...

The OCR Problem

All pre-1931 text must be transcribed, introducing noise. The team found that models trained on conventionally OCR'd text achieve only 30% of the learning efficiency of models trained on human-transcribed versions. Simple regex cleaning improves this to 70%.

Modern Vision-Language Models (VLMs) offer higher accuracy but risk hallucinating modern facts into the transcriptions. The team is building a dedicated 'vintage OCR' system to retranscribe the corpus and close this performance gap.

Post-Training Without Modern Bias

Fine-tuning Talkie on standard modern chat data would bake in anachronistic knowledge and style. Instead, the team built a pipeline from historical sources. They generated instruction-response pairs from structured historical texts like etiquette manuals, letter-writing guides, and cookbooks.

They then used synthetic prompts and online Direct Preference Optimization (DPO) with Claude Sonnet 4.6 as a judge, improving instruction-following ratings. A final round of supervised fine-tuning on synthetic multi-turn chats smoothed conversational abilities. The goal is to eventually use vintage models themselves as judges for a fully era-appropriate pipeline.

Data Collection and Future Scaling

The Talkie corpus, built on work by the Internet Archive and Common Pile, contains 260 billion tokens from pre-1931 books, newspapers, journals, patents, and case law. The 1930 cutoff aligns with U.S. public domain laws. The current model is English-only, but multilingual expansion is a priority.

The researchers plan to scale Talkie rapidly by increasing corpus size, improving OCR, strengthening leakage detection, and refining post-training with historians. They aim to release a GPT-3-level model this summer and believe a trillion-token historical corpus could support a model similar in capability to the original ChatGPT.

Research Implications and Ethical Considerations

Talkie represents a growing niche alongside projects like Ranke-4B and Machina Mirabilis. It promises to help disentangle what we know about AI in general from what we know about models trained specifically on the modern web. The team invites collaboration from researchers, historians, and artists.

A critical disclaimer accompanies the model: Talkie reflects the culture and values of its pre-1931 training data, which means it can produce outputs that will be offensive to modern users. This inherent bias is a feature of the experimental design, not an endorsement.

Context in a Competitive AI Landscape

The release of Talkie comes amid a period of intense competition and scrutiny in the AI industry. While not a direct competitor to commercial giants, its research-focused goals contrast with market pressures highlighted elsewhere. A contemporaneous New York Times DealBook report questioned if OpenAI is 'falling further behind' after missing user and revenue targets, underscoring the different priorities between commercial and exploratory AI development.

Talkie is supported by funding and compute from Coefficient Giving and Anthropic. Its development highlights a path for AI research that values historical understanding and fundamental scientific inquiry as much as raw capability scaling.