deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️ > pretty insane it can parse and re-render charts in HTML > it uses CLIP and SAM features concatenated, so better grounding > very efficient per vision tokens/performance ratio > covers 100 languages
IBM just released small swiss army knife for the document models: granite-docling-258M on Hugging Face 🔥
> not only a document converter but also can do document question answering, understand multiple languages 🤯 > best part: released with Apache 2.0 license 👏 use it with your commercial projects! > it supports transformers, vLLM and MLX from the get-go! 🤗 > built on SigLIP2 & granite-165M
Okay this is insane... WebGPU-accelerated semantic video tracking, powered by DINOv3 and Transformers.js! 🤯 Demo (+ source code): webml-community/DINOv3-video-tracking
This will revolutionize AI-powered video editors... which can now run 100% locally in your browser, no server inference required (costs $0)! 😍
How does it work? 🤔 1️⃣ Generate and cache image features for each frame 2️⃣ Create a list of embeddings for selected patch(es) 3️⃣ Compute cosine similarity between each patch and the selected patch(es) 4️⃣ Highlight those whose score is above some threshold
... et voilà! 🥳
You can also make selections across frames to improve temporal consistency! This is super useful if the object changes its appearance slightly throughout the video.