kaleinaNyan's picture
Update README.md
74f06ca verified
|
raw
history blame
2.89 kB
---
license: apache-2.0
language:
- ru
- en
base_model:
- jinaai/jina-embeddings-v3
---
## **JinaJudge: Proxy Judgement for Russian LLM Arena**
### **Description**
This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the [Russian LLM Arena](https://huggingface.co/spaces/Vikhrmodels/arenahardlb), designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models.
---
### **Model Details**
This is an iterative update of [kaleinaNyan/jina-v3-rullmarena-judge-300924](https://huggingface.co/kaleinaNyan/jina-v3-rullmarena-judge-300924) model:
- Increased amount of training data (not by much, approaximately 1.5x times).
- Updated data composition to fix erroneous judgements where GPT-4 picked English responses over Russian ones.
- Validation set was updated as well to exclude such errors.
- Test set did not change (no bad judgements in that regard).
---
### **Evaluation**
The validation process was based on **existing judgements** from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training.
NOTE: values in parenthesis show relative improvement compared to previous model.
**Models evaluated**:
- **gemma-2-9b-it-sppo-iter3**
- **glm-4-9b-chat**
- **gpt-3.5-turbo-1106**
- **mistral-7b-instruct-v0.3**
- **storm-7b**
**Validation Performance (old validation set)**:
- **Accuracy**: 79.97% (-0.78)
- **Precision**: 78.25% (-0.31)
- **Recall**: 78.25% (-1.23)
- **F1-score**: 78.25% (-0.75)
NOTE: will report later what actually caused the drop (the subset of fixed judgements or smth else)
**Validation Performance (new validation set)**:
- **Accuracy**: 83.59% (+2.48)
- **Precision**: 80.97% (+2.14)
- **Recall**: 80.97% (+1.22)
- **F1-score**: 80.97% (+1.77)
For the **test** phase, new judgements were generated using GPT-4 for the `kolibri-mistral-0427-upd` model.
**Test Performance**:
- **Accuracy**: 85.09% (+2.37)
- **Precision**: 83.20% (+3.09)
- **Recall**: 83.20% (+0.78)
- **F1-score**: 83.20% (+2.02)
---
### **Usage Example**
```python
from transformers import AutoModel
jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-041024", trust_remote_code=True)
prompt_template = """
<user prompt>
{user_prompt}
<end>
<assistant A answer>
{assistant_a}
<end>
<assistant B answer>
{assistant_b}
<end>
""".strip()
prompt = "your prompt"
assistant_a = "assistant a response"
assistant_b = "assistant b response"
example = prompt_template.format(
user_prompt=user_prompt,
assistant_a=assistant_a,
assistant_b=assistant_b,
)
judgement = jina([example])[0].argmax()
judgement_map = {
0: "A is better than B",
1: "A == B",
2: "B is better than A"
}
print(judgement_map[judgement])
```