GENOME: LoRA Expert Models

This repository contains 10 expert models fine-tuned via low-rank adaptation (LoRA) on 10 distinct domains extracted from the Tulu-v2-SFT-mixture dataset. Our base model is google/gemma-2-2b-it, and all expert models were trained using the llama-factory framework on an 8×A100-80GB GPU setup. Our goal is to contribute to the open-source community by sharing these domain-specific experts.

Experimental Setup

Base Model: google/gemma-2-2b-it
Dataset: 10 subsets from Tulu-v2-SFT-mixture
Fine-tuning Framework: llama-factory
Adaptation Technique: LoRA
Training Hardware: 8×A100-80GB GPUs
Note: Deploying a 2B model only requires 12GB of VRAM. For optimal performance, we recommend using an RTX 3090/4090 (24GB) or a comparable GPU.

A visualization of the performance (ranks) across various datasets shows that each expert model excels in its respective domain. vLLM supports dynamic LoRA switching, allowing seamless adaptation of different expert models with minimal computational overhead, enabling cost-effective optimization.

Usage Instructions

Below is an example deployment script that shows how to use vLLM to serve the base model along with the LoRA weights on a single GPU (adapted from the original multi-GPU script). Make sure to adjust the parameters (such as model path and log directory) to suit your environment.

Step 1. Deploying the Base Model on a Single GPU (or more)

Save the following script as deploy_single_gpu.sh and modify the placeholders accordingly:

#!/bin/bash
export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

# Specify your model path here (this can be a local path or a Hugging Face Hub path)
MODEL="input your model path here"
# Set the maximum number of LoRAs
MAX_LORAS=20
# Log directory for vLLM logs
ROOT="input your log dir here"
# Maximum LoRA rank
MAX_LORA_RANK=16
# Specify the port for the API server (single GPU deployment requires only one port)
PORT=9112

echo "Deploying model $MODEL with $MAX_LORAS LoRAs on a single GPU"
echo "Starting API server on port $PORT..."

# Create the log directory if it doesn't exist
mkdir -p vllm_logs/$ROOT

COMMON_ARGS="--model $MODEL \
    --trust-remote-code \
    --enable-lora \
    --seed 42 \
    --max-lora-rank $MAX_LORA_RANK \
    --gpu-memory-utilization 0.95 \
    --max-loras $MAX_LORAS \
    --max-cpu-loras $MAX_LORAS \
    --disable-sliding-window \
    --max-model-len 8192"

# Single GPU deployment: use only GPU 0
CUDA_VISIBLE_DEVICES=0 nohup python -m vllm.entrypoints.openai.api_server \
    $COMMON_ARGS \
    --port $PORT > vllm_logs/$ROOT/port_1.log 2>&1 &

Step 2. Loading and Unloading LoRA Adapters Dynamically

vLLM supports online LoRA switching, allowing seamless adaptation of different expert models with minimal computational overhead.

Download the LoRA weights and store them under /lora/*.
Use the following Python code to load and unload LoRA adapters dynamically:

import requests
import time
from loguru import logger


def online_load_lora(base_url: str, lora_name: str, lora_path: str):
    counter = 1
    while True:
        try:
            response = requests.post(
                f"{base_url}/load_lora_adapter",
                json={
                    "lora_name": lora_name,
                    "lora_path": lora_path
                }
            )
            time.sleep(3)
            assert response.status_code == 200, f"Failed to load LORA: {response.text}"
            break
        except Exception as e:
            logger.warning(f"Load LORA Error: {e}, retrying in {min(counter, 10)} seconds ...")
            time.sleep(min(counter, 10))
            counter += 1
            continue

def online_unload_lora(base_url: str, lora_name: str):
    while True:
        try:
            response = requests.post(
                f"{base_url}/unload_lora_adapter",
                json={
                    "lora_name": lora_name
                }
            )
            assert response.status_code == 200, f"Failed to unload LORA: {response.text}"
            break
        except Exception as e:
            logger.warning(f"Unload LORA Error: {e}, retrying ...")
            time.sleep(1)
            continue

Step 3: Using OpenAI SDK to Access the Deployed LoRA Models

Once the LoRA model is loaded, you can interact with it using the OpenAI SDK. Below is a mock example:

import openai

def query_lora_model(base_url: str, lora_name: str, prompt: str):
    client = openai.OpenAI(base_url=base_url)
    response = client.Completion.create(
        model=lora_name,
        prompt=prompt,
        max_tokens=100
    )
    return response

# Example usage
base_url = "http://localhost:9112/v1"
lora_name = "example_lora"
prompt = "Tell me about the impact of AI in healthcare."
response_text = query_lora_model(base_url, lora_name, prompt)
print(response_text)

Related Projects

This repository is associated with the GENOME project. We welcome community feedback and contributions to help further open-source AI development.

Estwld
/

GENOME-gemma-2b-it