Introducing Moonshine Web: real-time speech recognition running 100% locally in your browser! 🚀 Faster and more accurate than Whisper 🔒 Privacy-focused (no data leaves your device) ⚡️ WebGPU accelerated (w/ WASM fallback) 🔥 Powered by ONNX Runtime Web and Transformers.js
Performance leap: TGI v3 is out. Processes 3x more tokens, 13x faster than vLLM on long prompts. Zero config !
3x more tokens.
By reducing our memory footprint, we’re able to ingest many more tokens and more dynamically than before. A single L4 (24GB) can handle 30k tokens on llama 3.1-8B, while vLLM gets barely 10k. A lot of work went into reducing the footprint of the runtime and its effect are best seen on smaller constrained environments. 13x faster
On long prompts (200k+ tokens) conversation replies take 27.5s in vLLM, while it takes only 2s in TGI. How so ? We keep the initial conversation around, so when a new reply comes in, we can answer almost instantly. The overhead of the lookup is ~5us. Thanks @Daniël de Kok for the beast data structure. Zero config
That’s it. Remove all the flags your are using and you’re likely to get the best performance. By evaluating the hardware and model, TGI carefully selects automatic values to give best performance. In production, we don’t have any flags anymore in our deployments. We kept all existing flags around, they may come in handy in niche scenarios.
Introducing TTS WebGPU: The first ever text-to-speech web app built with WebGPU acceleration! 🔥 High-quality and natural speech generation that runs 100% locally in your browser, powered by OuteTTS and Transformers.js. 🤗 Try it out yourself!
We just released Transformers.js v3.1 and you're not going to believe what's now possible in the browser w/ WebGPU! 🤯 Let's take a look: 🔀 Janus from Deepseek for unified multimodal understanding and generation (Text-to-Image and Image-Text-to-Text) 👁️ Qwen2-VL from Qwen for dynamic-resolution image understanding 🔢 JinaCLIP from Jina AI for general-purpose multilingual multimodal embeddings 🌋 LLaVA-OneVision from ByteDance for Image-Text-to-Text generation 🤸♀️ ViTPose for pose estimation 📄 MGP-STR for optical character recognition (OCR) 📈 PatchTST & PatchTSMixer for time series forecasting
That's right, everything running 100% locally in your browser (no data sent to a server)! 🔥 Huge for privacy!
Llava o1 - vlm capable of spontaneous, systematic reasoning, similar to GPT-o1, 11B model outperforms gemini-1.5-pro, gpt-4o-mini, and llama-3.2-90B-vision Xkev/Llama-3.2V-11B-cot
Jina AI Jina CLIP v2 - general purpose multilingual and multimodal (text & image) embedding model, 900M params, 512 x 512 resolution, matroyoshka representations (1024 to 64) jinaai/jina-clip-v2
How do I test an LLM for my unique needs? If you work in finance, law, or medicine, generic benchmarks are not enough. This blog post uses Argilla, Distilllabel and 🌤️Lighteval to generate evaluation dataset and evaluate models.
Have you tried out 🤗 Transformers.js v3? Here are the new features: ⚡ WebGPU support (up to 100x faster than WASM) 🔢 New quantization formats (dtypes) 🏛 120 supported architectures in total 📂 25 new example projects and templates 🤖 Over 1200 pre-converted models 🌐 Node.js (ESM + CJS), Deno, and Bun compatibility 🏡 A new home on GitHub and NPM