GENOME: LoRA Expert Models
This repository contains 10 expert models fine-tuned via low-rank adaptation (LoRA) on 10 distinct domains extracted from the Tulu-v2-SFT-mixture dataset. Our base model is google/gemma-2-2b-it, and all expert models were trained using the llama-factory framework on an 8×A100-80GB GPU setup. Our goal is to contribute to the open-source community by sharing these domain-specific experts.
Experimental Setup
- Base Model: google/gemma-2-2b-it
- Dataset: 10 subsets from Tulu-v2-SFT-mixture
- Fine-tuning Framework: llama-factory
- Adaptation Technique: LoRA
- Training Hardware: 8×A100-80GB GPUs
- Note: Deploying a 2B model only requires 12GB of VRAM. For optimal performance, we recommend using an RTX 3090 (24GB) or a comparable GPU.
A visualization of the performance (ranks) across various datasets shows that each expert model excels in its respective domain. vLLM supports dynamic LoRA switching, allowing seamless adaptation of different expert models with minimal computational overhead, enabling cost-effective optimization.
Usage Instructions
Below is an example deployment script that shows how to use vLLM to serve the base model along with the LoRA weights on a single GPU (adapted from the original multi-GPU script). Make sure to adjust the parameters (such as model path and log directory) to suit your environment.
Step 1. Deploying the Base Model on a Single GPU (or more)
Save the following script as deploy_single_gpu.sh
and modify the placeholders accordingly:
#!/bin/bash
export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
# Specify your model path here (this can be a local path or a Hugging Face Hub path)
MODEL="input your model path here"
# Set the maximum number of LoRAs
MAX_LORAS=20
# Log directory for vLLM logs
ROOT="input your log dir here"
# Maximum LoRA rank
MAX_LORA_RANK=16
# Specify the port for the API server (single GPU deployment requires only one port)
PORT=9112
echo "Deploying model $MODEL with $MAX_LORAS LoRAs on a single GPU"
echo "Starting API server on port $PORT..."
# Create the log directory if it doesn't exist
mkdir -p vllm_logs/$ROOT
COMMON_ARGS="--model $MODEL \
--trust-remote-code \
--enable-lora \
--seed 42 \
--max-lora-rank $MAX_LORA_RANK \
--gpu-memory-utilization 0.95 \
--max-loras $MAX_LORAS \
--max-cpu-loras $MAX_LORAS \
--disable-sliding-window \
--max-model-len 8192"
# Single GPU deployment: use only GPU 0
CUDA_VISIBLE_DEVICES=0 nohup python -m vllm.entrypoints.openai.api_server \
$COMMON_ARGS \
--port $PORT > vllm_logs/$ROOT/port_1.log 2>&1 &
Step 2. Loading and Unloading LoRA Adapters Dynamically
vLLM supports online LoRA switching, allowing seamless adaptation of different expert models with minimal computational overhead.
- Download the LoRA weights and store them under
/lora/*
. - Use the following Python code to load and unload LoRA adapters dynamically:
import requests
import time
from loguru import logger
def online_load_lora(base_url: str, lora_name: str, lora_path: str):
counter = 1
while True:
try:
response = requests.post(
f"{base_url}/load_lora_adapter",
json={
"lora_name": lora_name,
"lora_path": lora_path
}
)
time.sleep(3)
assert response.status_code == 200, f"Failed to load LORA: {response.text}"
break
except Exception as e:
logger.warning(f"Load LORA Error: {e}, retrying in {min(counter, 10)} seconds ...")
time.sleep(min(counter, 10))
counter += 1
continue
def online_unload_lora(base_url: str, lora_name: str):
while True:
try:
response = requests.post(
f"{base_url}/unload_lora_adapter",
json={
"lora_name": lora_name
}
)
assert response.status_code == 200, f"Failed to unload LORA: {response.text}"
break
except Exception as e:
logger.warning(f"Unload LORA Error: {e}, retrying ...")
time.sleep(1)
continue
Step 3: Using OpenAI SDK to Access the Deployed LoRA Models
Once the LoRA model is loaded, you can interact with it using the OpenAI SDK. Below is a mock example:
import openai
def query_lora_model(base_url: str, lora_name: str, prompt: str):
client = openai.OpenAI(base_url=base_url)
response = client.Completion.create(
model=lora_name,
prompt=prompt,
max_tokens=100
)
return response
# Example usage
base_url = "http://localhost:9112/v1"
lora_name = "example_lora"
prompt = "Tell me about the impact of AI in healthcare."
response_text = query_lora_model(base_url, lora_name, prompt)
print(response_text)
Related Projects
This repository is associated with the GENOME project. We welcome community feedback and contributions to help further open-source AI development.