Deploying the Helper LLM and Embedding Service
Helper LLM
The purpose of the Helper LLM is to handle auxiliary tasks for information retrieval (IR) systems, such as summarizing documents, splitting documents into propositions, and filtering retrieved documents. To optimize performance and reduce latency, the Helper LLM is typically much smaller than the main LLM (e.g., around 1B parameters), as it is invoked frequently during the IR process. In this project, we use the bloom-1b7 model as the Helper LLM, alongside the TGI (Text Generation Inference) framework for inference.
Deployment
Step 1: Install the TGI framework.
Step 2: Launch the TGI service:
CUDA_VISIBLE_DEVICES=YOUR_GPU_ID text-generation-launcher --model-id PATH_TO_YOUR_HELPER_LLM_CHECKPOINT --port YOUR_PORT --num-shard 1 --disable-custom-kernels
Step 3: Configure service URLs.
Update the
service_url_config.json
file. Replace the values for the following keys with the IP address and port of your Helper LLM instance:concept_perspective_generation
proposition_generation
concept_identification
filter_doc
dialog_summarization
Text Embedding Model
The Text Embedding Model is used to compute text embeddings that support dense retrieval. In this project, we use the GTE large model for embedding generation.
Deployment
Step 1: Install the required dependencies:
Install
cherrypy
andsentence_transformers
.Step 2: Launch the embedding service:
python text_embed_service.py --model gte_large --gpu YOUR_GPU_ID --port YOUR_PORT --batch_size 128
Step 3: Configure service URLs:
Update the
service_url_config.json
file. Replace the value ofsentence_encoding
with the IP address and port of your text embedding service instance.
- Downloads last month
- 0