I like to train large deep neural nets too 🧠🤖💥 | First Paper (AutoAgents: A Framework for Automatic Agent Generation) Accepted @ IJCAI 2024 | Role Model Karpathy
Implements from first-principle recently proposed dynamic tanh as alternative to layernorm. Specifically, we trained a nanoGPT (0.8 M params) on tiny shakespeare with conventional layernorm, RMSNorm and dynamic tanh, then compared performances. Observed performance seems to match or is stable for α = 0.5~ 1.5, might outperform if trained longer. Code: https://github.com/Jaykef/ai-algorithms/blob/main/Dynamic_Tanh.ipynb Background music by 周子珺
Before 2020, most of the AI field was open and collaborative. For me, that was the key factor that accelerated scientific progress and made the impossible possible—just look at the “T” in ChatGPT, which comes from the Transformer architecture openly shared by Google.
Then came the myth that AI was too dangerous to share, and companies started optimizing for short-term revenue. That led many major AI labs and researchers to stop sharing and collaborating.
With OAI and sama now saying they're willing to share open weights again, we have a real chance to return to a golden age of AI progress and democratization—powered by openness and collaboration, in the US and around the world.
This is incredibly exciting. Let’s go, open science and open-source AI!
Implemented a custom multimodal GRPO trainer that scales for Small VLMs, supports cpu and gpu with vllm + flash attention. Using SmolVLM-256M-Instruct reference & reward model, wasn’t trained for long btw, still got some sparks of “thinking”:) Code: https://github.com/Jaykef/ai-algorithms/blob/main/grpo_multimodal_reasoner.ipynb
Finally, the ground truth / AlexNet’s original source code is available to all. Context: AlexNet had a historic win in the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), reducing error rate from 26% (previous best) to 15.3%. It’s a deep CNN with 8 layers (5 convolutional + 3 fully connected), pioneering the use of ReLU activations for faster training, dropout for regularization, and GPU acceleration for large-scale learning. This moment marked the beginning of the deep learning revolution, inspiring architectures like VGG, ResNet, and modern transformers. Code: https://github.com/computerhistory/AlexNet-Source-Code
Nvidia brings blue (from starwars droids) to life 🤯, supercute with flawless dexterity and droid voice. It's the result of their colab research with Google DeepMind and Disney, revealed as part of their new opensource physics engine for robotics simulation: NEWTON - which enables robots to learn how to complete complex tasks with greater precision.
This is the most exciting of this week’s release for me: Gemini Robotics - A SOTA generalist Vision-Language-Action model that brings intelligence to the physical world. It comes with a verifiable real-world knowledge Embodied Reasoning QA benchmark. Cool part is that the model can be specialized with fast adaptation to new tasks and have such adaptations transferred to new robot embodiment like humanoids. Looking forward to the model and data on hf, it’s about time I go full physical:) Technical Report: https://storage.googleapis.com/deepmind-media/gemini-robotics/gemini_robotics_report.pdf
Super Interesting Paper! Proposes neural networks (CRNNs) that can learn to produce traveling waves in their hidden state in response to visual stimuli, thus enabling the transfer and integration of spatial information across neural connections. In other words they showed that neural networks have wave-like properties that blends and processes visual information over time, cool seeing a union of AI and physics in this way. Paper: https://arxiv.org/pdf/2502.06034 Code: https://github.com/KempnerInstitute/traveling-waves-integrate