File size: 4,431 Bytes
0543aa3
 
 
 
e0b1078
313af0d
 
0543aa3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
313af0d
 
 
 
 
0543aa3
 
c46926b
313af0d
c46926b
313af0d
0543aa3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6f81f69
0543aa3
6f81f69
 
0543aa3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
# Content Classification LoRA Adapter for Gemma-2B

A LoRA adapter for unsloth/gemma-2b that determines content indexing suitability using chain-of-thought reasoning.

Used in a pipeline.


## Technical Specifications

### Base Model
- Model: unsloth/gemma-2b
- LoRA Rank: 64
- Target Modules: q_proj, up_proj, down_proj, gate_proj, o_proj, k_proj, v_proj
- Task: CAUSAL_LM
- Dropout: 0
- Alpha: 32

### Input/Output Format

Input XML structure:
```xml
<instruction>Determine true or false if the following content is suitable and should be indexed.</instruction>
<suitable>
  <content>{input_text}</content>
```

Output XML structure:
```xml
  <thinking>{reasoning_process}</thinking>
  <category>{content_type}</category>
  <should_index>{true|false}</should_index>
</suitable>

```

The model then expects an indefinite list of `<suitable> ... </suitable>` that you may not want. But you can use this to do fewshots with incontext learning to correct a mistake or enhance the results.

Your stop token should be `</suitable>`.

## Deployment

### VLLM Server Setup
```bash
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

vllm serve unsloth/gemma-2-2b \
  --gpu-memory-utilization=1 \
  --port 6002 \
  --served-model-name="gemma" \
  --trust-remote-code \
  --max-model-len 8192 \
  --disable-log-requests \
  --enable-lora \
  --lora-modules lora=./dataset/output/unsloth/lora_model \
  --max-lora-rank 64
```

### Processing Pipeline

1. Install Dependencies:
```bash
pip install requests tqdm concurrent.futures
```

2. Run Content Processor:
```bash
python process.py --input corpus.jsonl --output results.jsonl --threads 24
```

### Client Implementation

```python
import requests

def classify_content(text: str, vllm_url: str = "http://localhost:6002/v1/completions") -> dict:
    xml_content = (
        '<instruction>Determine true or false if the following content is '
        'suitable and should be indexed.</instruction>\n'
        '<suitable>\n'
        f'  <content>{text}</content>'
    )
    
    response = requests.post(
        vllm_url,
        json={
            "prompt": xml_content,
            "max_tokens": 6000,
            "temperature": 1,
            "model": "lora",
            "stop": ["</suitable>"]
        },
        timeout=30000
    )
    
    completion = response.json()["choices"][0]["text"]
    
    # Parse XML tags
    import re
    def extract_tag(tag: str) -> str:
        match = re.search(f'<{tag}>(.*?)</{tag}>', completion, re.DOTALL)
        return match.group(1).strip() if match else ""
        
    return {
        "thinking": extract_tag("thinking"),
        "category": extract_tag("category"),
        "should_index": extract_tag("should_index")
    }
```

### Example Usage

```python
text = """Multiservice Tactics, Techniques, and Procedures
for
Nuclear, Biological, and Chemical Aspects of Consequence
Management

TABLE OF CONTENTS..."""

result = classify_content(text)
print(result)
```

Example output:
```json
{
    "thinking": "This is a table of contents for a document, not the actual content.",
    "category": "table of contents",
    "should_index": "false"
}
```

## Batch Processing

The included processor supports parallel processing of JSONL files:

```python
from request_processor import RequestProcessor

processor = RequestProcessor(
    input_file="corpus.jsonl",
    output_file="results.jsonl",
    num_threads=24
)
processor.process_file()
```

Input JSONL format:
```json
{
    "pid": "document_id",
    "docid": "path/to/source",
    "content": "document text",
    "metadata": {
        "key": "value"
    }
}
```

Output JSONL format:
```json
{
    "pid": "document_id",
    "docid": "path/to/source",
    "content": "document text",
    "metadata": {
        "key": "value"
    },
    "thinking": "reasoning process",
    "category": "content type",
    "should_index": "true/false",
    "processed_at": "2024-10-22 02:52:33"
}
```

## Implementation and Performance Considerations

- Use thread pooling for parallel processing
- Implement atomic writes with file locking
- Progress tracking with tqdm
- Automatic error handling and logging
- Configurable thread count for optimization

## Error Handling

Errors are captured in the output JSONL:
```json
{
    "error": "error message",
    "processed_at": "timestamp"
}
```

Monitor errors in real-time:
```bash
tail -f results.jsonl | grep error
```