Text Generation
File size: 3,690 Bytes
b840854
549c9d3
 
 
 
 
b840854
ac7defa
 
d95179e
ac7defa
d95179e
ff07c97
d95179e
ff07c97
 
778289e
ac7defa
778289e
d95179e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac7defa
d95179e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac7defa
d95179e
b840854
d95179e
ac7defa
d95179e
 
 
 
 
 
 
ac7defa
d95179e
 
 
ac7defa
d95179e
ac7defa
d95179e
 
 
 
 
 
 
549c9d3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
inference: true
license: apache-2.0
datasets:
- bitext/customer-support-intent-dataset
pipeline_tag: text-generation
---


# longchat-7b-qlora-customer-support Model Card

This repo contains the 4-bit LORA (low-rank) adapter weights for [longchat-7b-16k model](https://huggingface.co/lmsys/longchat-7b-16k), fine-tuned on top of [Bitext's customor support domain dataset](https://huggingface.co/datasets/bitext/customer-support-intent-dataset).

The Supervised Fine-Tuning (SFT) method is based on this [qlora paper](https://arxiv.org/abs/2305.14314) using 🤗 peft adapters, transformers, and bitsandbytes.


## Model details

**Model type:**
longchat-7b-qlora-customer-support is an 4-bit LORA (low-rank) adapter supervised fine-tuned on top of the [longchat-7b-16k model](https://huggingface.co/lmsys/longchat-7b-16k) with [Bitext's customor support domain dataset](https://huggingface.co/datasets/bitext/customer-support-intent-dataset). 

It's a Causal LM decoder-only LLM.

**Language:** 
English

**License:** 
apache-2.0 inherited from [Base Model](https://huggingface.co/lmsys/longchat-7b-16k) and the [dataset](https://huggingface.co/datasets/bitext/customer-support-intent-dataset).

**Base Model:** 
lmsys/longchat-7b-16k 

**Dataset:**
bitext/customer-support-intent-dataset

**GPU Mermory Consumption:** 
~6GB GPU consumption in 4-bit mode with fully loaded (base + qlora) models


## Install dependcy packages

```shell
pip install -r requirements.txt
```

Per the [base model instruction](https://huggingface.co/lmsys/longchat-7b-16k), the [llma_condense_monkey_patch.py file](https://github.com/lm-sys/FastChat/blob/main/fastchat/model/llama_condense_monkey_patch.py) is needed to load the base model properly. This file is alreay included in this repo. 


## Load the model in 4-bit mode

```ipython

from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig 
from llama_condense_monkey_patch import replace_llama_with_condense
from peft import PeftConfig
from peft import PeftModel
import torch

## config device params & load model
peft_model_id = "mingkuan/longchat-7b-qlora-customer-support"
base_model_id = "lmsys/longchat-7b-16k"

config = AutoConfig.from_pretrained(base_model_id)
replace_llama_with_condense(config.rope_condense_ratio)
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=False)

kwargs = {"torch_dtype": torch.float16}
kwargs["device_map"] = "auto"
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    return_dict=True, 
    trust_remote_code=True, 
    quantization_config=nf4_config,
    load_in_4bit=True,
    **kwargs
)
model = PeftModel.from_pretrained(model, peft_model_id)
```

## Inference the model

```ipython

def getLLMResponse(prompt):
    device = "cuda"
    input_ids = tokenizer(prompt, return_tensors='pt').input_ids.cuda()
    output = model.generate(inputs=input_ids, temperature=0.5, max_new_tokens=256)
    promptLen = len(prompt)
    response = tokenizer.decode(output[0], skip_special_tokens=True)[promptLen:] ## omit the user input part
    return response

query = 'help me to setup my new shipping address.'
response = getLLMResponse(generate_prompt(query)) 
print(f'\nUserInput:{query}\n\nLLM:\n{response}\n\n')

```

Inference Output:
```shell
{
"category": "SHIPPING",
"intent": "setup_new_shipping_address",
"answer": "Sure, I can help you with that. Can you please provide me your full name, current shipping address, and the new shipping address you would like to set up?"
}
```