Embodied AI == Unlimited Training Data

Community Article Published January 13, 2025

The End of Internet Data Scarcity: Why Embodied AI Changes Everything

We're at a pivotal moment in AI history. While industry leaders declare the end of pre-training as we know it, there is a missing crucial insight: we're not facing the end of pre-training—the next frontier isn't about scraping more data from the internet; it's about tapping into the boundless stream of real-world data through embodied AI.

Consider this: the entire English internet's training data, accumulated over decades, equals just 15.6 years of footage from a single camera. Now imagine a million cameras, each capturing the world 24/7. This isn't just an incremental change in how we collect data—it's a paradigm shift that could fundamentally transform how AI learns and understands our world.

image/png Ilya Sutskever's Presentation @ NeurIPS 2024

The Death of Pre-training? Not Quite

In NeurIPS 2024, Ilya Sutskever said pre-training will end. But let's dig deeper: what's really ending is our reliance on internet-sourced data. Why? Because every piece of internet content—every article, poem, textbook, and artwork—requires human effort to create and curate. We're hitting the ceiling of human content creation capacity. But what if we stopped depending on human-created data entirely?

The Hidden Economics of Data Collection

Numbers tell this story better than words ever could. Let me show you something striking: while it takes months to generate 1M training tokens from written content, the same amount of data flows through just 32.8 seconds of real-world video capture. But here's the critical nuance: I'm not suggesting that a text token and a video token are equivalent—they capture fundamentally different aspects of information. A text token might encode abstract concepts and relationships, while a video token captures visual patterns, motion, and physical interactions.

The real revelation isn't about token equivalence—it's about scale. Even accounting for these differences in information density, the sheer velocity of real-world data collection is staggering. Think about this: while you're reading this article, a network of 1M cameras could generate 1T training tokens. For perspective, FineWeb, the largest open-source English training dataset, contains just 15T tokens—equivalent to 15.6 years of a single camera's capture.

image/png

image/png

The math is beautifully simple: Data Scale = Number of Sensors × Time Elapsed

This isn't just about having more data—it's about having fundamentally unlimited data collection capacity as every second elapses. The implications of this shift from scarce, human-created content to boundless, real-world capture are profound.

Beyond Human Bias

Here's where it gets even more interesting: internet content, no matter how objective it tries to be, carries inherent human biases. Every author's choice of words, every curator's selection, every moderator's decision—they're all filtered through limited human perception and expression.

Real-world capture, on the other hand, is fundamentally different. It records reality as it exists, bound by physics and social norms rather than human interpretation. While sensor distribution might create some bias, this is something we can systematically control and adjust—unlike the inherent biases in human-created content.

image/png

The Path to AGI: Unlimited Data, Unlimited Potential

We're entering uncharted territory. With compute and budgets expanding, data has become the primary bottleneck in AI development. But what happens when that bottleneck disappears? When data becomes truly unlimited?

Just as I once underestimated GPT-3's capabilities despite my deep understanding of transformers and GPT-2, I suspect we're underestimating the potential of unlimited real-world data. Could this be the key to achieving AGI—an AI that truly understands and interacts with the physical world?

Just as GPT-3 surprised us with its capabilities, unlimited real-world data could enable breakthroughs in multiple domains - perhaps a robot that can adapt to any kitchen layout, or autonomous vehicles that can handle truly unpredictable scenarios.

image/png Jensen Huang's Presentation @ CES 2025

The answer might lie in letting our compute loose on this unlimited stream of reality. The future of AI is in giving those algorithms a direct window into the real world.

Appendix

Video Token Calculation

Video Input Parameters

Resolution: 1080p (1920×1080 pixels)
Frame rate: 30fps
Duration: 32.8 seconds
Color channels: RGB (3 channels)

Raw Data Calculation

Single frame pixels: 1920 × 1080 = 2,073,600 pixels
Total frames: 32.8 seconds × 30fps = 984 frames
Total raw pixels: 2,073,600 × 984 = 2,040,422,400 pixels

Token Generation Using Maximum Compression Using Cosmos Tokenizer CV8x16x16 or DV8x16x16

Compression factors:
8x temporal compression
16x spatial compression (width)
16x spatial compression (height)
Total compression rate: 8 × 16 × 16 = 2048x

Final Token Count

Token calculation: 2,040,422,400 ÷ 2048 ≈ 996,300 ~ 1M tokens