Spaces:
Running
Running
title: German Llm Outputs | |
emoji: 🦀 | |
colorFrom: green | |
colorTo: pink | |
sdk: gradio | |
sdk_version: 4.36.1 | |
app_file: app.py | |
pinned: false | |
license: mit | |
# Dataset | |
The dataset usesd is https://huggingface.co/datasets/lmsys/chatbot_arena_conversations | |
Preprocessing: | |
- filtered german conversations | |
- took first user prompt | |
- deleted short prompts (less than 70 chars) | |
```python | |
dataset = load_dataset('lmsys/chatbot_arena_conversations') | |
def get_message(x): | |
x['message'] = [x['conversation_a'][0]] | |
return x | |
dataset = dataset.filter(lambda x: x['language'] == 'German') | |
dataset = dataset['train'].map(get_message) | |
dataset = dataset.filter(lambda x: len(x['message'][0]['content']) > 70) | |
``` | |
# Generation | |
I rely on the huggingface `conversational` pipeline to generate the outputs. There are some issues with the chat template (esp. for the non-instruction tuned models) i'll fix later. | |
```python | |
messages = json.loads(Path('messages.json').read_text()) | |
outputs = [] | |
pipe = pipeline( | |
"conversational", | |
model=model_name, | |
torch_dtype="auto", | |
device_map=device, | |
max_new_tokens=1024, | |
trust_remote_code=True | |
) | |
for message in tqdm(messages): | |
output = pipe([message]) | |
outputs.append(output) | |
``` |