Infinity usage
#9
by
michaelfeil
- opened
Adds to the readme:
Usage with infinity
Usage with Docker and infinity: https://github.com/michaelfeil/infinity
docker run -it --gpus all -v ./data:/app/.cache -p 7997:7997 michaelf34/infinity:0.0.72 \
v2 --model-id BAAI/bge-en-icl --batch-size 8
this doesn't work on my macos15 with conda:
infinity_emb v2 --port 7997 --model-id BAAI/bge-en-icl --revision "refs/heads/main" --batch-size 8
@qdrddr This model is based on 7B, and will be loaded in fp32 on MAC. You will need at least 7*4=28 GB unified memory for this deployment + activation memory. I would recommend 32GB, so a MAC M2 Max would do
/workspace# docker run -it --gpus "0,1" -p 7997:7997 michaelf34/infinity:0.0.72 \
v2 --model-id BAAI/bge-en-icl --batch-size 8
Unable to find image 'michaelf34/infinity:0.0.72' locally
0.0.72: Pulling from michaelf34/infinity
aece8493d397: Already exists
dd4939a04761: Already exists
b0d7cc89b769: Already exists
1532d9024b9c: Already exists
04fc8a31fa53: Already exists
bcd69f21629b: Pull complete
47c205b4c52f: Pull complete
369eb047c1cd: Pull complete
383e894c0e57: Pull complete
Digest: sha256:cc88276bc996c8a33332fcd2a3850e2e896d0feff0e4732ad270010898fb5b28
Status: Downloaded newer image for michaelf34/infinity:0.0.72
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO 2024-12-30 10:57:27,593 infinity_emb INFO: Creating 1engines: engines=['BAAI/bge-en-icl'] infinity_server.py:92
INFO 2024-12-30 10:57:27,596 infinity_emb INFO: Anonymized telemetry can be disabled via environment variable telemetry.py:30
`DO_NOT_TRACK=1`.
INFO 2024-12-30 10:57:27,601 infinity_emb INFO: model=`BAAI/bge-en-icl` selected, using engine=`torch` and device=`None` select_model.py:64
INFO 2024-12-30 10:57:27,665 sentence_transformers.SentenceTransformer INFO: Load pretrained SentenceTransformer: SentenceTransformer.py:216
BAAI/bge-en-icl
WARNING 2024-12-30 10:57:27,696 sentence_transformers.SentenceTransformer WARNING: No sentence-transformers model SentenceTransformer.py:1508
found with name BAAI/bge-en-icl. Creating a new one with mean pooling.
INFO 2024-12-30 10:58:06,176 infinity_emb INFO: Adding optimizations via Huggingface optimum. acceleration.py:56
The class `optimum.bettertransformers.transformation.BetterTransformer` is deprecated and will be removed in a future release.
WARNING 2024-12-30 10:58:06,178 infinity_emb WARNING: BetterTransformer is not available for model: <class acceleration.py:67
'transformers.models.mistral.modeling_mistral.MistralModel'> Continue without bettertransformer modeling code.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
INFO 2024-12-30 10:58:06,775 infinity_emb INFO: Getting timings for batch_size=8 and avg tokens per sentence=2 select_model.py:97
6.24 ms tokenization
27.25 ms inference
0.14 ms post-processing
33.63 ms total
embeddings/sec: 237.89
INFO 2024-12-30 10:58:07,133 infinity_emb INFO: Getting timings for batch_size=8 and avg tokens per sentence=513 select_model.py:103
4.99 ms tokenization
167.21 ms inference
0.25 ms post-processing
172.45 ms total
embeddings/sec: 46.39
INFO 2024-12-30 10:58:07,135 infinity_emb INFO: model warmed up, between 46.39-237.89 embeddings/sec at batch_size=8 select_model.py:104
INFO 2024-12-30 10:58:07,137 infinity_emb INFO: creating batching engine batch_handler.py:443
INFO 2024-12-30 10:58:07,139 infinity_emb INFO: ready to batch requests. batch_handler.py:512
INFO 2024-12-30 10:58:07,141 infinity_emb INFO: infinity_server.py:106
♾️ Infinity - Embedding Inference Server
MIT License; Copyright (c) 2023-now Michael Feil
Version 0.0.72
Open the Docs via Swagger UI:
http://0.0.0.0:7997/docs
Access all deployed models via 'GET':
curl http://0.0.0.0:7997/models
Visit the docs for more information:
https://michaelfeil.github.io/infinity
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit)
Just tested on H100-80B, works. Bfloat16 uses ~16GB with batch-size 8.