Adds to the readme:

Usage with infinity

Usage with Docker and infinity: https://github.com/michaelfeil/infinity

docker run -it --gpus all  -v ./data:/app/.cache -p 7997:7997 michaelf34/infinity:0.0.72 \
v2 --model-id BAAI/bge-en-icl --batch-size 8

this doesn't work on my macos15 with conda:

infinity_emb v2 --port 7997 --model-id BAAI/bge-en-icl --revision "refs/heads/main" --batch-size 8

@qdrddr This model is based on 7B, and will be loaded in fp32 on MAC. You will need at least 7*4=28 GB unified memory for this deployment + activation memory. I would recommend 32GB, so a MAC M2 Max would do

/workspace# docker run -it --gpus "0,1" -p 7997:7997 michaelf34/infinity:0.0.72 \
v2 --model-id BAAI/bge-en-icl --batch-size 8
Unable to find image 'michaelf34/infinity:0.0.72' locally
0.0.72: Pulling from michaelf34/infinity
aece8493d397: Already exists 
dd4939a04761: Already exists 
b0d7cc89b769: Already exists 
1532d9024b9c: Already exists 
04fc8a31fa53: Already exists 
bcd69f21629b: Pull complete 
47c205b4c52f: Pull complete 
369eb047c1cd: Pull complete 
383e894c0e57: Pull complete 
Digest: sha256:cc88276bc996c8a33332fcd2a3850e2e896d0feff0e4732ad270010898fb5b28
Status: Downloaded newer image for michaelf34/infinity:0.0.72
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO     2024-12-30 10:57:27,593 infinity_emb INFO: Creating 1engines: engines=['BAAI/bge-en-icl']                          infinity_server.py:92
INFO     2024-12-30 10:57:27,596 infinity_emb INFO: Anonymized telemetry can be disabled via environment variable                 telemetry.py:30
         `DO_NOT_TRACK=1`.                                                                                                                       
INFO     2024-12-30 10:57:27,601 infinity_emb INFO: model=`BAAI/bge-en-icl` selected, using engine=`torch` and device=`None`   select_model.py:64
INFO     2024-12-30 10:57:27,665 sentence_transformers.SentenceTransformer INFO: Load pretrained SentenceTransformer:  SentenceTransformer.py:216
         BAAI/bge-en-icl                                                                                                                         
WARNING  2024-12-30 10:57:27,696 sentence_transformers.SentenceTransformer WARNING: No sentence-transformers model    SentenceTransformer.py:1508
         found with name BAAI/bge-en-icl. Creating a new one with mean pooling.                                                                  
INFO     2024-12-30 10:58:06,176 infinity_emb INFO: Adding optimizations via Huggingface optimum.                              acceleration.py:56
The class `optimum.bettertransformers.transformation.BetterTransformer` is deprecated and will be removed in a future release.
WARNING  2024-12-30 10:58:06,178 infinity_emb WARNING: BetterTransformer is not available for model: <class                    acceleration.py:67
         'transformers.models.mistral.modeling_mistral.MistralModel'> Continue without bettertransformer modeling code.                          
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
INFO     2024-12-30 10:58:06,775 infinity_emb INFO: Getting timings for batch_size=8 and avg tokens per sentence=2             select_model.py:97
                 6.24     ms tokenization                                                                                                        
                 27.25    ms inference                                                                                                           
                 0.14     ms post-processing                                                                                                     
                 33.63    ms total                                                                                                               
         embeddings/sec: 237.89                                                                                                                  
INFO     2024-12-30 10:58:07,133 infinity_emb INFO: Getting timings for batch_size=8 and avg tokens per sentence=513          select_model.py:103
                 4.99     ms tokenization                                                                                                        
                 167.21   ms inference                                                                                                           
                 0.25     ms post-processing                                                                                                     
                 172.45   ms total                                                                                                               
         embeddings/sec: 46.39                                                                                                                   
INFO     2024-12-30 10:58:07,135 infinity_emb INFO: model warmed up, between 46.39-237.89 embeddings/sec at batch_size=8      select_model.py:104
INFO     2024-12-30 10:58:07,137 infinity_emb INFO: creating batching engine                                                 batch_handler.py:443
INFO     2024-12-30 10:58:07,139 infinity_emb INFO: ready to batch requests.                                                 batch_handler.py:512
INFO     2024-12-30 10:58:07,141 infinity_emb INFO:                                                                        infinity_server.py:106
                                                                                                                                                 
         ♾️  Infinity - Embedding Inference Server                                                                                                
         MIT License; Copyright (c) 2023-now Michael Feil                                                                                        
         Version 0.0.72                                                                                                                          
                                                                                                                                                 
         Open the Docs via Swagger UI:                                                                                                           
         http://0.0.0.0:7997/docs                                                                                                                
                                                                                                                                                 
         Access all deployed models via 'GET':                                                                                                   
         curl http://0.0.0.0:7997/models                                                                                                         
                                                                                                                                                 
         Visit the docs for more information:                                                                                                    
         https://michaelfeil.github.io/infinity                                                                                                  
                                                                                                                                                 
                                                                                                                                                 
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit)

Just tested on H100-80B, works. Bfloat16 uses ~16GB with batch-size 8.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment