kaleinaNyan
/

jina-v3-rullmarena-judge-041024

Model card Files Files and versions Community

jina-v3-rullmarena-judge-041024 / README.md

kaleinaNyan's picture

Update README.md

74f06ca verified 3 months ago

|

2.89 kB

	---
	license: apache-2.0
	language:
	- ru
	- en
	base_model:
	- jinaai/jina-embeddings-v3
	---

	## JinaJudge: Proxy Judgement for Russian LLM Arena

	### Description
	This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the [Russian LLM Arena](https://huggingface.co/spaces/Vikhrmodels/arenahardlb), designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models.

	---

	### Model Details

	This is an iterative update of [kaleinaNyan/jina-v3-rullmarena-judge-300924](https://huggingface.co/kaleinaNyan/jina-v3-rullmarena-judge-300924) model:
	- Increased amount of training data (not by much, approaximately 1.5x times).
	- Updated data composition to fix erroneous judgements where GPT-4 picked English responses over Russian ones.
	- Validation set was updated as well to exclude such errors.
	- Test set did not change (no bad judgements in that regard).

	---

	### Evaluation
	The validation process was based on existing judgements from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training.

	NOTE: values in parenthesis show relative improvement compared to previous model.

	Models evaluated:
	- gemma-2-9b-it-sppo-iter3
	- glm-4-9b-chat
	- gpt-3.5-turbo-1106
	- mistral-7b-instruct-v0.3
	- storm-7b

	Validation Performance (old validation set):
	- Accuracy: 79.97% (-0.78)
	- Precision: 78.25% (-0.31)
	- Recall: 78.25% (-1.23)
	- F1-score: 78.25% (-0.75)

	NOTE: will report later what actually caused the drop (the subset of fixed judgements or smth else)

	Validation Performance (new validation set):
	- Accuracy: 83.59% (+2.48)
	- Precision: 80.97% (+2.14)
	- Recall: 80.97% (+1.22)
	- F1-score: 80.97% (+1.77)

	For the test phase, new judgements were generated using GPT-4 for the `kolibri-mistral-0427-upd` model.

	Test Performance:
	- Accuracy: 85.09% (+2.37)
	- Precision: 83.20% (+3.09)
	- Recall: 83.20% (+0.78)
	- F1-score: 83.20% (+2.02)

	---

	### Usage Example

	```python
	from transformers import AutoModel

	jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-041024", trust_remote_code=True)

	prompt_template = """
	<user prompt>
	{user_prompt}
	<end>
	<assistant A answer>
	{assistant_a}
	<end>
	<assistant B answer>
	{assistant_b}
	<end>
	""".strip()

	prompt = "your prompt"
	assistant_a = "assistant a response"
	assistant_b = "assistant b response"

	example = prompt_template.format(
	user_prompt=user_prompt,
	assistant_a=assistant_a,
	assistant_b=assistant_b,
	)

	judgement = jina([example])[0].argmax()

	judgement_map = {
	0: "A is better than B",
	1: "A == B",
	2: "B is better than A"
	}

	print(judgement_map[judgement])
	```