|
--- |
|
license: apache-2.0 |
|
language: |
|
- ru |
|
- en |
|
base_model: |
|
- jinaai/jina-embeddings-v3 |
|
--- |
|
|
|
## **JinaJudge: Proxy Judgement for Russian LLM Arena** |
|
|
|
### **Description** |
|
This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the [Russian LLM Arena](https://huggingface.co/spaces/Vikhrmodels/arenahardlb), designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models. |
|
|
|
--- |
|
|
|
### **Model Details** |
|
|
|
This is an iterative update of [kaleinaNyan/jina-v3-rullmarena-judge-300924](https://huggingface.co/kaleinaNyan/jina-v3-rullmarena-judge-300924) model: |
|
- Increased amount of training data (not by much, approaximately 1.5x times). |
|
- Updated data composition to fix erroneous judgements where GPT-4 picked English responses over Russian ones. |
|
- Validation set was updated as well to exclude such errors. |
|
- Test set did not change (no bad judgements in that regard). |
|
|
|
--- |
|
|
|
### **Evaluation** |
|
The validation process was based on **existing judgements** from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training. |
|
|
|
NOTE: values in parenthesis show relative improvement compared to previous model. |
|
|
|
**Models evaluated**: |
|
- **gemma-2-9b-it-sppo-iter3** |
|
- **glm-4-9b-chat** |
|
- **gpt-3.5-turbo-1106** |
|
- **mistral-7b-instruct-v0.3** |
|
- **storm-7b** |
|
|
|
**Validation Performance (old validation set)**: |
|
- **Accuracy**: 79.97% (-0.78) |
|
- **Precision**: 78.25% (-0.31) |
|
- **Recall**: 78.25% (-1.23) |
|
- **F1-score**: 78.25% (-0.75) |
|
|
|
NOTE: will report later what actually caused the drop (the subset of fixed judgements or smth else) |
|
|
|
**Validation Performance (new validation set)**: |
|
- **Accuracy**: 83.59% (+2.48) |
|
- **Precision**: 80.97% (+2.14) |
|
- **Recall**: 80.97% (+1.22) |
|
- **F1-score**: 80.97% (+1.77) |
|
|
|
For the **test** phase, new judgements were generated using GPT-4 for the `kolibri-mistral-0427-upd` model. |
|
|
|
**Test Performance**: |
|
- **Accuracy**: 85.09% (+2.37) |
|
- **Precision**: 83.20% (+3.09) |
|
- **Recall**: 83.20% (+0.78) |
|
- **F1-score**: 83.20% (+2.02) |
|
|
|
--- |
|
|
|
### **Usage Example** |
|
|
|
```python |
|
from transformers import AutoModel |
|
|
|
jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-041024", trust_remote_code=True) |
|
|
|
prompt_template = """ |
|
<user prompt> |
|
{user_prompt} |
|
<end> |
|
<assistant A answer> |
|
{assistant_a} |
|
<end> |
|
<assistant B answer> |
|
{assistant_b} |
|
<end> |
|
""".strip() |
|
|
|
prompt = "your prompt" |
|
assistant_a = "assistant a response" |
|
assistant_b = "assistant b response" |
|
|
|
example = prompt_template.format( |
|
user_prompt=user_prompt, |
|
assistant_a=assistant_a, |
|
assistant_b=assistant_b, |
|
) |
|
|
|
judgement = jina([example])[0].argmax() |
|
|
|
judgement_map = { |
|
0: "A is better than B", |
|
1: "A == B", |
|
2: "B is better than A" |
|
} |
|
|
|
print(judgement_map[judgement]) |
|
``` |