Arakinas (Landon)

updated a collection about 2 months ago

Data

Collection

2 items • Updated Nov 18, 2024

liked a model about 2 months ago

si-pbc/hertz-dev

Audio-to-Audio • Updated Nov 14, 2024 • 210

reacted to merve's post with 👍 5 months ago

Post

2381

NVIDIA just dropped NVEagle 🦅

Super impressive vision language model that comes in 7B, 13B and 13B fine-tuned on chat 💬
Model repositories: merve/nveagle-66d0705108582d73bb235c26
Try it: NVEagle/Eagle-X5-13B-Chat 💬 (works very well! 🤯)

This model essentially explores having different experts (MoE) for image encoder part of vision language model.
How? 🧐
The authors concatenate the vision encoder output tokens together, and they apply "pre-alignment" essentially fine-tune experts with frozen text encoder.

Then they freeze both experts and the decoder and just train the projection layer, and finally, they unfreeze everything for supervised fine-tuning ✨

In the paper, they explore different fusion strategies and vision encoders, extending basic CLIP encoder, and figure out simply concatenating visual tokens works well.
Rest of the architecture is quite similar to LLaVA. (see below the architecture)

reacted to clem's post with ❤️ 5 months ago

Post

3685

This isn’t a goal of ours because we have plenty of money in the bank but quite excited to see that @huggingfaceis profitable these days, with 220 team members and most of our platform being free (like model hosting) and open-source for the community!

Especially noteworthy at a time when most AI startups wouldn’t survive a year or two without VC money. Yay!

4 replies

·

reacted to bartowski's post with ❤️ 5 months ago

Post

10057

So turns out I've been spreading a bit of misinformation when it comes to imatrix in llama.cpp

It starts true; imatrix runs the model against a corpus of text and tracks the activation of weights to determine which are most important

However what the quantization then does with that information is where I was wrong.

I think I made the accidental connection between imatrix and exllamav2's measuring, where ExLlamaV2 decides how many bits to assign to which weight depending on the goal BPW

Instead, what llama.cpp with imatrix does is it attempts to select a scale for a quantization block that most accurately returns the important weights to their original values, ie minimizing the dequantization error based on the importance of activations

The mildly surprising part is that it actually just does a relatively brute force search, it picks a bunch of scales and tries each and sees which one results in the minimum error for weights deemed important in the group

But yeah, turns out, the quantization scheme is always the same, it's just that the scaling has a bit more logic to it when you use imatrix

Huge shoutout to @compilade for helping me wrap my head around it - feel free to add/correct as well if I've messed something up

5 replies

·

reacted to merve's post with 👀 5 months ago

Post

4976

Idefics3-Llama is out! 💥💥
Model: HuggingFaceM4/Idefics3-8B-Llama3
Demo: HuggingFaceM4/idefics3

It's a multimodal model based on Llama 3.1 that accepts an arbitrary number of interleaved images with text with a huge context window (10k tokens!) ✨

Supported by Hugging Face transformers 🤗

2 replies

·

liked 3 Spaces 5 months ago

Running

427

🏆

Can Ai Code Results

Running

1.06k

📈

Big Code Models Leaderboard

Running on CPU Upgrade

562

🌎

Open VLM Leaderboard

VLMEvalKit Evaluation Results Collection

upvoted a paper 6 months ago

Patch-Level Training for Large Language Models

Paper • 2407.12665 • Published Jul 17, 2024 • 16

liked a model 6 months ago

xinsir/controlnet-union-sdxl-1.0

Text-to-Image • Updated Jul 30, 2024 • 64.1k • 1.23k

replied to ehristoforu's post 6 months ago

I've been semi anxiously awaiting more info. :)

liked a model 6 months ago

qnguyen3/nanoLLaVA-1.5

Image-Text-to-Text • Updated Sep 21, 2024 • 1.12k • 105

upvoted a paper 6 months ago

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Paper • 2407.01392 • Published Jul 1, 2024 • 39

upvoted a paper 7 months ago

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

Paper • 2406.18790 • Published Jun 26, 2024 • 33

reacted to ehristoforu's post with ❤️ 7 months ago

Post

6365

🤗 Hello from the Project Fluently team!

🥏 We are ready to announce a new series of Supple Diffusion models, these are new generation diffusion models (about 1-2 weeks left before release).

🦾 The new series aims to take diffusion models to the next level, with performance and versatility as the main goal.

🧐 How will our models be better than others? Firstly, we worked on the CLIP models, now they understand your requests better, it will become easier to process. Secondly, we trained the models with high quality, even better than all our previous ones. Thirdly, you won’t have to keep 20 models on your disk; only 4-6 will be enough.

🗺️ Roadmap:
1. Create Supple Diffusion Small
2. Creating Supple Diffusion Medium
3. Create Supple Diffusion Large

🎆 Our models are universal for realism, and for cartoons, and for anime, and for caricatures.

💖 The project really needs your support and your recommendations and reviews, please do not hesitate to write comments under this post, thank you!

🖼️ Below are demo images made with the pre-release version of Supple Diffusion Small.

4 replies

·

reacted to merve's post with 👍 7 months ago

Post

3275

Forget about all the captioning datasets you've tried before!

PixelProse is a captioning dataset of 16M image-caption pairs, with less toxicity and higher details ✨
tomg-group-umd/pixelprose

The existing suite of captioning datasets consists of web scrapes that have alt text that is either irrelevant or not descriptive. The authors of this paper have taken those datasets, filtered for CSAM, passed it with a prompt to Gemini Vision Pro. They also removed PII and detoxified the resulting dataset.