This model essentially explores having different experts (MoE) for image encoder part of vision language model. How? π§ The authors concatenate the vision encoder output tokens together, and they apply "pre-alignment" essentially fine-tune experts with frozen text encoder.
Then they freeze both experts and the decoder and just train the projection layer, and finally, they unfreeze everything for supervised fine-tuning β¨
In the paper, they explore different fusion strategies and vision encoders, extending basic CLIP encoder, and figure out simply concatenating visual tokens works well. Rest of the architecture is quite similar to LLaVA. (see below the architecture)
This isnβt a goal of ours because we have plenty of money in the bank but quite excited to see that @huggingfaceis profitable these days, with 220 team members and most of our platform being free (like model hosting) and open-source for the community!
Especially noteworthy at a time when most AI startups wouldnβt survive a year or two without VC money. Yay!
4 replies
Β·
reacted to bartowski's
post with β€οΈ5 months ago
So turns out I've been spreading a bit of misinformation when it comes to imatrix in llama.cpp
It starts true; imatrix runs the model against a corpus of text and tracks the activation of weights to determine which are most important
However what the quantization then does with that information is where I was wrong.
I think I made the accidental connection between imatrix and exllamav2's measuring, where ExLlamaV2 decides how many bits to assign to which weight depending on the goal BPW
Instead, what llama.cpp with imatrix does is it attempts to select a scale for a quantization block that most accurately returns the important weights to their original values, ie minimizing the dequantization error based on the importance of activations
The mildly surprising part is that it actually just does a relatively brute force search, it picks a bunch of scales and tries each and sees which one results in the minimum error for weights deemed important in the group
But yeah, turns out, the quantization scheme is always the same, it's just that the scaling has a bit more logic to it when you use imatrix
Huge shoutout to @compilade for helping me wrap my head around it - feel free to add/correct as well if I've messed something up
It's a multimodal model based on Llama 3.1 that accepts an arbitrary number of interleaved images with text with a huge context window (10k tokens!) β¨
π₯ We are ready to announce a new series of Supple Diffusion models, these are new generation diffusion models (about 1-2 weeks left before release).
π¦Ύ The new series aims to take diffusion models to the next level, with performance and versatility as the main goal.
π§ How will our models be better than others? Firstly, we worked on the CLIP models, now they understand your requests better, it will become easier to process. Secondly, we trained the models with high quality, even better than all our previous ones. Thirdly, you wonβt have to keep 20 models on your disk; only 4-6 will be enough.
πΊοΈ Roadmap: 1. Create Supple Diffusion Small 2. Creating Supple Diffusion Medium 3. Create Supple Diffusion Large
π Our models are universal for realism, and for cartoons, and for anime, and for caricatures.
π The project really needs your support and your recommendations and reviews, please do not hesitate to write comments under this post, thank you!
πΌοΈ Below are demo images made with the pre-release version of Supple Diffusion Small.
Forget about all the captioning datasets you've tried before!
PixelProse is a captioning dataset of 16M image-caption pairs, with less toxicity and higher details β¨ tomg-group-umd/pixelprose
The existing suite of captioning datasets consists of web scrapes that have alt text that is either irrelevant or not descriptive. The authors of this paper have taken those datasets, filtered for CSAM, passed it with a prompt to Gemini Vision Pro. They also removed PII and detoxified the resulting dataset.