After some heated discussion π₯, we clarify our intent re. storage limits on the Hub
TL;DR: - public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible - private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)
We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community π₯
self.brag(): Kokoro finally got 300 votes in Pendrokar/TTS-Spaces-Arena after @Pendrokar was kind enough to add it 3 weeks ago. Discounting the small sample size of votes, I think it is safe to say that hexgrad/Kokoro-TTS is currently a top 3 model among the contenders in that Arena. This is notable because: - At 82M params, Kokoro is one of the smaller models in the Arena - MeloTTS has 52M params - F5 TTS has 330M params - XTTSv2 has 467M params
5 replies
Β·
reacted to prithivMLmods's
post with π₯27 days ago
Another great week in open ML! Here's a small recap π«°π»
Model releases β―οΈ Video Language Models AI at Meta released Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2
π¬ Small language models Hugging Face released HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets. Meta released facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M
πΌοΈπ¬Any-to-Any gpt-omni/mini-omni2 is closest reproduction to GPT-4o, a new LLM that can take image-text-audio input and output speech is released!
Dataset releases πΌοΈ Spawning/PD12M, a new captioning dataset of 12.4 million examples generated using Florence-2
reacted to prithivMLmods's
post with πabout 2 months ago
Is is time for the open-source AI robots revolution π?
With @haixuantao and @Leyo weβve been playing with a low-cost DJI robot controlled by three local open-source AI models (Whisper, Idefics2, Parler-TTS - all Apache2) and orchestrated by Dora-cs.
Good folks at @Apple have developed a novel method called KV Prediction that significantly reduces the "time to first token" (TTFT) for on-device LLM inference.
Some highlights of the paper:
β’ Uses a small auxiliary transformer model to efficiently predict the KV cache of a larger base model β’ Reduces TTFT by up to 4x while retaining 60-80% accuracy on benchmarks β’ Achieves Pareto-optimal efficiency-accuracy trade-off compared to baselines β’ Demonstrates 15-50% relative accuracy improvements on TriviaQA at equal TTFT FLOP budgets β’ Shows up to 30% accuracy gains on HumanEval code completion at fixed TTFT FLOP counts β’ Validated on Apple M2 Pro CPU, proving FLOP gains translate to real-world speedups
So, how's it done?
Based on the KV Prediction method described in the paper, here are the key steps for how it's done:
1. Choose a base model and an auxiliary model: - The base model is a larger, pretrained transformer model that will be used for final generation. - The auxiliary model is a smaller transformer model used to efficiently process the input prompt.
2. Design the KV predictor: - Create a set of learned linear projections to map from the auxiliary model's KV cache to the base model's KV cache. - Define a mapping from auxiliary cache layers to base cache layers.
3. Training process: - Pass input tokens through the auxiliary model to get its KV cache. - Use the KV predictor to generate a predicted KV cache for the base model. - Run the base model using the predicted KV cache and compute losses. - Backpropagate errors through the frozen base model to update the auxiliary model and KV predictor.
4. Inference process: - Process the input prompt with the auxiliary model to get its KV cache. - Use the KV predictor to generate the predicted base model KV cache. - Run a single token generation step with the base model using the predicted KV cache. - Continue autoregressive generation with the base model as normal.
Made a notable change to the TTS Arena fork. I do not think anyone is interested in which bottomfeeder TTS is better than another beside it. So one of the top 5 TTS is always chosen in a challenge for more scrutiny. Also these top 5 are taken from preliminary results. Pendrokar/TTS-Spaces-Arena
This is a multimodal assistant: Qwen 2.5 72B + SOTA diffusion models for image generation. Same architecture as Image Gen+ but with some MAJOR improvements ! These are as follows:
- Switched the LLM to Qwen 2.5 72B, the most powerful model currently available on HuggingChat. This results in higher quality prompts for the txt2img model and much better adherence to the prompt-url format that the upstream provider requires (image gen models are hosted by pollinations as with most other assistants on huggingchat that offer image generation).
- Cleaned up the system prompt including the examples of the prompt-in-url format, and adjusted the logic that determines how many images to generate based on the quality of user prompt... these changes further improve
- Assistant has access to multiple image generation models and will by default choose whatever model is most appropriate for the task. This includes NSFW generations, which it makes using an uncensored SD3 turbo. For other workloads, the Assistant preferentially uses one of the flux variants or any-dark (an artistic SDXL finetune), based on the nature of the task. Available models include turbo, flux, flux-realism, flux-anime, flux-3d, any-dark
- Added verbiage to system prompt which greatly reduces censorship / refusals by the LLM (the txt2img models are uncensored to start off)
Here are the user-entered prompts used to create the images you see here... feel free to try them yourself!
"Ayatollah Khameini and Kamala Harris having a secret romantic rendezvous. Use flux-realism model" "A self portrait of your consciousness" "The chien of andalous, in a psychedelic style" "Make me 4 paintings in the style of Frida Kahlo that I can sell to tourists in a mexican hippie town" "Paint me a van gogh and greg rutkowski style scene involving elephants and gerbils"
Maybe like me you have always wanted a super easy way to compare llama3.2-1B vs. llama3.2-3B? or the same model with different temperatures?
Trying and comparing warm Inference API models has never been easier! Just go to https://hf.co/playground, set your token and you're ready to go. We'll keep improving, feedback welcome π
Rhymes AI drops Aria: small Multimodal MoE that beats GPT-4o and Gemini-1.5-Flash β‘οΈ
New player entered the game! Rhymes AI has just been announced, and unveiled Aria β a multimodal powerhouse that's punching above its weight.
Key insights:
π§ Mixture-of-Experts architecture: 25.3B total params, but only 3.9B active.
π Multimodal: text/image/video β text.
π Novel training approach: βmultimodal-nativeβ where multimodal training starts directly during pre-training, not just tacked on later
π Long 64K token context window
π Apache 2.0 license, with weights, code, and demos all open
β‘οΈ On the benchmark side, Aria leaves some big names in the dust.
- It beats Pixtral 12B or Llama-3.2-12B on several vision benchmarks like MMMU or MathVista. - It even overcomes the much bigger GPT-4o on long video tasks and even outshines Gemini 1.5 Flash when it comes to parsing lengthy documents.
But Rhymes AI isn't just showing off benchmarks. They've already got Aria powering a real-world augmented search app called βBeagoβ. Itβs handling even recent events with great accuracy!
And they partnered with AMD to make it much faster than competitors like Perplexity or Gemini search.
Meta AI vision has been cooking @facebook They shipped multiple models and demos for their papers at @ECCVπ€
Here's a compilation of my top picks: - Sapiens is family of foundation models for human-centric depth estimation, segmentation and more, all models have open weights and demos π
All models have their demos and even torchscript checkpoints! A collection of models and demos: facebook/sapiens-66d22047daa6402d565cb2fc - VFusion3D is state-of-the-art consistent 3D generation model from images
Huge news for Kohya GUI - Now you can fully Fine Tune / DreamBooth FLUX Dev with as low as 6 GB GPUs without any quality loss compared to 48 GB GPUs - Moreover, Fine Tuning yields better results than any LoRA training could
LoRA Extraction The checkpoint sizes are 23.8 GB but you can extract LoRA with almost no loss quality - I made a research and public article / guide for this as well
Info This is just mind blowing. The recent improvements Kohya made for block swapping is just amazing.
Speeds are also amazing that you can see in image 2 - of course those values are based on my researched config and tested on RTX A6000 - same speed as almost RTX 3090
Also all trainings experiments are made at 1024x1024px. If you use lower resolution it will be lesser VRAM + faster speed
The VRAM usages would change according to your own configuration - likely speed as well
Moreover, Fine Tuning / DreamBooth yields better results than any LoRA could