Text Generation
Transformers
PyTorch
TensorBoard
Safetensors
bloom
Eval Results
text-generation-inference
Inference Endpoints

How does hugging face have so many hosted api s running at once?

#132
by mishavee - opened

How does hugging face have so many hosted apis running at once? It is literally a hundred thousand or even more models hosted. Do they load the models when needed, taking turns using available resources? If I were to have only Bloom loaded would there be a significant reduction in the time it takes to get a result vs through the hosted api? Also are many requests coming in concurrently to the model in the hosted api, slowing it down or are requests rare in the hosted api?

BigScience Workshop org

How does hugging face have so many hosted apis running at once? It is literally a hundred thousand or even more models hosted. Do they load the models when needed, taking turns using available resources?

I don't know specifically. Though let's try to keep discussion on this repository BLOOM related. I think it would be better suited somewhere else, adressed to the people responsible of infrastructure as Hugging Face.

If I were to have only Bloom loaded would there be a significant reduction in the time it takes to get a result vs through the hosted api?

If you were to deploy BLOOM yourself, you'll probably see much lower latency as you could decide where you deploy it, probably have a lot more fine grained control in terms of what you want (which inference algorithm, remove the limit in length etc ...). We didn't so too many specific things, @Narsil has written a great blog about how we deployed it: https://huggingface.co/blog/bloom-inference-optimization and the code is available here: https://github.com/huggingface/transformers_bloom_parallel

Also are many requests coming in concurrently to the model in the hosted api, slowing it down or are requests rare in the hosted api?

Yes requests are coming concurrently, I don't think we provide specific numbers about how the API is behaving. Is your worry that your requests are going to be too slow? We make the best effort so that you get the fastest inference, but it's never going to beat you deploying it yourself.

BigScience Workshop org

Do they load the models when needed, taking turns using available resources?

Exactly done that way. We use kubernetes, share ressources as much as possible (including GPUs whihc comes with it's own share of issues).

If I were to have only Bloom loaded would there be a significant reduction in the time it takes to get a result vs through the hosted api?

Not necessarily. First of all you would have to recreate some of the engineering. @TimeRobber gave you the correct information were to get started, but it might require additional work and at least understanding what we have done. That being said, once you're in control of the code, you do have a lot more flexibility to adapt to your own use case. For instance the API is optimized to run on demand on a wide variety of requests and do dynamic batching. This might not be useful to you.

We have gone to great lengths to make the API as fast as possible.
Bloom is now generously sponsored by AzureML which is a huge plus for the community.

If you're seeing any bad latency or overload please mention it here, we did move recently so their might be some specific quirks and options we haven't properly tuned yet.

What exactly do you mean by "adapt to your own case"? Do you mean fine tune? How would fine tuning make it work faster?

Are you saying that the fact that Bloom needs to be actually loaded into the shared resources doesn't make a request much longer? I understand if it is already loaded then there would still be a delay but not as big since other people are sending requests but if Bloom hasn't been loaded in you are saying " not necessarily". Is that true for that scenario when it hasn't been loaded?

BigScience Workshop org

What exactly do you mean by "adapt to your own case"? Do you mean fine tune? How would fine tuning make it work faster?

I mean if you are running big batches, or just small requests, or always with the same parameters etc.. Then your spec are different from the API (which aims to be generalist).
When you have a specific enough use case, some optimizations can probably be made to get things running faster. For instance, using batching improves a lot throughput, but usually it reduces latency, so you could try to batch aggressively if you want massive throughput.

Are you saying that the fact that Bloom needs to be actually loaded into the shared resources doesn't make a request much longer? I understand if it is already loaded then there would still be a delay but not as big since other people are sending requests but if Bloom hasn't been loaded in you are saying " not necessarily". Is that true for that scenario when it hasn't been loaded?

I am not sure I understand exactly.
Bloom is really big, so really slow to load. It takes ~1mn under optimal conditions for instance. So it's not really realistic to load it "on demand". It's always up in the API for that reason.

Bloom is loaded and it's ALONE on the hardware hosting it, so it's not in a shared pool of resources, so I don't think anything would change if you were to host it on dedicated resources.

Does that answer your question ?

partly, ty
what is mn?

what is batching and how do you do it?

BigScience Workshop org

what is mn?

Minute

what is batching and how do you do it?

It's essentially the ability to pass multiple samples at once, instead of processing them individually. It's a core concept in Deep Learning as it increases significantly the speed at which you process entire datasets.

ty,

can you give me an example of passing multiple samples at the same time for example with paraphrasing or anything. How is this possible? I thought it just continues text from your input. How could it continue text from several points?

BigScience Workshop org

At this point, I would suggest you try out lectures and books.
For instance this: https://www.youtube.com/watch?v=M6adb1j2jPI

I watched. I don't understand why having the same length sentences would optimize the speed?

BigScience Workshop org

I watched. I don't understand why having the same length sentences would optimize the speed?

Speed doesn't really mean anything in a webserver context, without specifying the speed of what.
There's 2 core concepts.
Latency: How fast to get your answer once you sent your request. (It's client speed let's say).
Throughput: How fast does the server take to server N requests (It's a server's view).

Both are really important.

And GPU are really good at throughput (you can add more calculation to do at once, and all will be calculated in pretty much at the same speed as on a single request).
Now the topic is fast, and as I said before, just go and check out blogs, books and videos that provide explanation about this topic.
This is a vast topic and answering is not really feasible.

Also you can ask questions here: https://discuss.huggingface.co/
Or our discord server: https://discord.com/invite/JfAtkvEtRb

please tell me if I finetuned Bloom for paraphrasing like this
sentence1 : ( sentence)
sentence2 : ( sentence)

How could I paraphrase more than one sentence at the same time?

Sign up or log in to comment