Sayak Paul
AI & ML interests
Articles
Organizations
sayakpaul's activity
* Decrease the rank of a LoRA
* Increase the rank of a LoRA
The first one is helpful in reducing memory requirements if the LoRA is of a high rank, while the second one is merely an experiment. Another implication of this study is in the unification of LoRA ranks when you would like to
torch.compile()
them. Check it out here:
sayakpaul/flux-lora-resizing
Full snippet is here: https://gist.github.com/sayakpaul/cfaebd221820d7b43fae638b4dfa01ba
With @JW17
Merging code (very simple), inference code, merged params: sayakpaul/FLUX.1-merged
Enjoy the Monday π€
We present our findings from a series of experiments on quantizing different diffusion pipelines based on diffusion transformers.
We demonstrate excellent memory savings with a bit of sacrifice on inference latency which is expected to improve in the coming days.
Diffusers π€ Quanto β€οΈ
This was a juicy collaboration between @dacorvo and myself.
Check out the post to learn all about it
https://huggingface.co/blog/quanto-diffusers
Takes you through simple prompt engineering, prompt weighting, prompt enhancement using GPT-2, and more.
Check out the guide here π¦―
https://huggingface.co/docs/diffusers/main/en/using-diffusers/weighted_prompts
My personal favorite is the ability to run it on a variety of different GPUs with minimal code changes.
Learn more about them here:
https://huggingface.co/blog/sd3
I think you definitely missed out on another big release:
https://huggingface.co/posts/sayakpaul/557387472547604
It features the first non-generative pipeline of the library -- Marigold π₯
Marigold shines at performing Depth Estimation and Surface Normal Estimation. It was contributed by @toshas , one of the authors of Marigold.
This release also features a massive refactor (led by @DN6 ) of the
from_single_file()
method, highlighting our efforts for making our library more amenable to community features π€Check out the release notes here:
https://github.com/huggingface/diffusers/releases/tag/v0.28.0
Wanted to use customized pipelines and other components (schedulers, unets, text encoders, etc.) in Diffusers?
Found it inflexible?
Since the first dawn on earth, we have supported loading custom pipelines via a
custom_pipeline
argument πThese pipelines are inference-only, i.e., the assumption is that we're leveraging an existing checkpoint (e.g., runwayml/stable-diffusion-v1-5) and ONLY modifying the pipeline implementation.
We have many cool pipelines, implemented that way. They all share the same benefits available to a
DiffusionPipeline
, no compromise there π€Check them here:
https://github.com/huggingface/diffusers/tree/main/examples/community
Then we might have a requirement of everything customized i.e., custom components along with a custom pipeline. Sure, that's all possible.
All you have to do is keep the implementations of those custom components on the Hub repository you're loading your pipeline checkpoint from.
SDXL Japanese was implemented like this π₯
stabilityai/japanese-stable-diffusion-xl
Full guide is available here β¬οΈ
https://huggingface.co/docs/diffusers/main/en/using-diffusers/custom_pipeline_overview
And, of course, these share all the benefits that come with
DiffusionPipeline
.
device_map
in Diffusers π€If you have multiple GPUs you want to use to distribute the pipeline models, you can do so. Additionally, this becomes more useful when you have multiple low-VRAM GPUs.
Documentation:
https://huggingface.co/docs/diffusers/main/en/training/distributed_inference#device-placement
π¨ Currently, only "balanced" device mapping strategy is supported.
diffusers
library. The post delves deeper into the workflows responsible for:* Publishing the package on Test PyPI and main PyPI servers.
* Notifying an internal Slack channel after a release is published on the repository.
Check it out here π
https://sayak.dev/posts/streamlined-releases.html
@chansung and I worked on a weekend project combining the benefits of Gemini 1.0 and powerful chat models like Zephyr to demo this.
We use Gemini 1.0 to produce the personality traits of any character found in an input video. We then prepare a system prompt with the discovered traits to start chatting with an LLM (Zephyr in this case).
Managing a video captioning model is a little out of our expertise, hence Gemini FTW here πΆβπ«οΈ
π¨βπ» Code: https://github.com/deep-diver/Vid2Persona
π€ Demo: chansung/vid2persona
Among other things, we shipped:
* Stable Cascade
* Playground v2.5 and EDM-style training
* EDM-formulated schedulers
* Trajectory Consistency Distillation for accelerated sampling
* A new guide on merging LoRAs
* A new image editing pipeline -- LEDITS++
Check out the release notes to catch everything that went into the release
https://github.com/huggingface/diffusers/releases/tag/v0.27.0
Thanks to everyone that contributed to the release π€
I mean we should be able to make the most out of the GPU by reducing the idle-time as much as possible while also ensuring the throughput is really the highest we can get out of the card.
For example, if we are getting 60 QPS, is that the highest we can get out of the card? Is it the maximum limit?
I think we can consider using the cheapest yet reasonable alternative. Okay to probably not exhaustively consider all the specs. For example, it won't make much sense to do this using a 4GB card to do SDXL deployment. So, something in the range of 16-24GB should suffice.
How would you aim for the cheapest latency using existing tooling?
Slick Let's do a project on diffusion models using the cheapest option possible. But we can also show if it can provide the highest efficiency. What say?
So, we replace the FFN layer with FFN layers from different models (which hence requires models to be of the same size).
Crazy that this works!
Haven't gone through the details but a follow-up question.
If the models are needed to be of the same size, how do we select the FFN layers from another model to replace a single FFN layer from the other? If a Transformer block contains a single FFN block (composition of dense layers), how do we accumulate multiple FFN layers, though.
How are the params of the MoE layers populated, though? It doesn't impact the performance? What's the intuition? π
Super cool work.
Anyone curious, you can try out Marigold in diffusers
through a custom pipeline too. Check it out here: https://github.com/huggingface/diffusers/tree/main/examples/community#marigold-depth-estimation.