Deploying the Helper LLM and Embedding Service

Helper LLM

The purpose of the Helper LLM is to handle auxiliary tasks for information retrieval (IR) systems, such as summarizing documents, splitting documents into propositions, and filtering retrieved documents. To optimize performance and reduce latency, the Helper LLM is typically much smaller than the main LLM (e.g., around 1B parameters), as it is invoked frequently during the IR process. In this project, we use the bloom-1b7 model as the Helper LLM, alongside the TGI (Text Generation Inference) framework for inference.

Deployment

Step 1: Install the TGI framework.

Step 2: Launch the TGI service:

CUDA_VISIBLE_DEVICES=YOUR_GPU_ID text-generation-launcher --model-id PATH_TO_YOUR_HELPER_LLM_CHECKPOINT --port YOUR_PORT --num-shard 1 --disable-custom-kernels

Step 3: Configure service URLs.

Update the service_url_config.json file. Replace the values for the following keys with the IP address and port of your Helper LLM instance:
- concept_perspective_generation
- proposition_generation
- concept_identification
- filter_doc
- dialog_summarization

Text Embedding Model

The Text Embedding Model is used to compute text embeddings that support dense retrieval. In this project, we use the GTE large model for embedding generation.

Deployment

Step 1: Install the required dependencies:

Install cherrypy and sentence_transformers.

Step 2: Launch the embedding service:

python text_embed_service.py --model gte_large --gpu YOUR_GPU_ID --port YOUR_PORT --batch_size 128

Step 3: Configure service URLs:

Update the service_url_config.json file. Replace the value of sentence_encoding with the IP address and port of your text embedding service instance.