Model Card for mixtral-8x7B-redbioma-qg-v0.2

Model Details

Model Description

This model was trained with the purpose of generating closed domain question-answer pairs focused on scientific texts of Costa Rica biodiversity. This texts are from scientific papers describing all kinds of species and its caractheristics. The texts are then inserted in the model to get high quality QA pairs.

All the resources and data used for the purposed of this project comes from the project redbioma through the Computer Science School of the Instituto Tecnológico de Costa Rica (ITCR).

Developed by: Alejandro Díaz Pereira
Model type: Question answering generation.
Language(s) (NLP): Spanish
License: Apache 2.0
Finetuned from model: Mixtral-8x7B-Instruct-v0.1

Model Sources

Repository: Github

Uses

This model was trained focused on closed domain question-answering generation in spanish, it is not intended to use for other tasks.

Bias, Risks, and Limitations

The model was trained on a small dataset of QA pairs (26035 examples), which are mostly extractive, therefore the model is biased in the way it can generate answers from a context. It works well with text from 400 chars to 4000 chars. But, due to the low diversity of texts, the model often generate one or two QA pairs. With some prompt engineering the results could be improved.

The prompt used that generated the best results is the following:

Genere un grupo de preguntas distintas y sus respuestas exactas correspondientes basadas en la entrada proporcionada sin repetir preguntas. Las preguntas deben explorar diferentes facetas de la información presentada, las respuestas deben ser precisas, detalladas y comprensibles para un público no especialista. Enfóquese en la claridad y profundidad para mejorar la comprensión.

Sentences such as grupo, exactas, sin repetir preguntas improved the way the model generate the questions, it made the results more diverse, detailed and focused on using only the knowledge from the context.

The model show only one case of endless generation using this prompt and the following input text. Further training or prompt engineering could resolve this issue.

Vuela recto y rápidamente hasta la percha, y luego mantiene el cuerpo inmóvil mientras inclina la cabeza en ángulos raros para revisar la vegetación a su alrededor. Atrapa insectos, sus larvas y pupas, lagartijas pequeñas o frutos del follaje mediante salidas súbitas y agitadas. Un solo individuo puede acompañar a bandadas mixtas de hormigueritos, furnáridos y tangaras.

Training Details

Training Data

This model was trained using the following datasets:

RAG_Multilingual
qg_esquad
Total training examples: 26035
Total validation examples: 5579
Total testing examples: 5580

Training Procedure

The model was trained using self-supervised training. The parameters used for the training are as follows:

per_device_train_batch_size=4,
per_device_eval_batch_size=2,
eval_accumulation_steps=2,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
fp16=True,
evaluation_strategy="steps",
adam_beta1=0.9,
adam_beta2=0.95,
num_train_epochs=1,
learning_rate=5e-5,
weight_decay=0.01,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
logging_steps=50,
eval_steps=100,
save_strategy="no",
report_to="tensorboard",
gradient_checkpointing_kwargs={"use_reentrant":False},
deepspeed="./deepspeed_config.json",
optim="adamw_bnb_8bit",

All this parameters were choosen to keep good memory consumption and training time. This were taken from https://blog.vessl.ai/finetuning-mixtral-8x7B. Deepspeed configuration could be found in the model repo.

Training Hyperparameters

Training regime: fp16 mixed precision.

Technical Specifications [optional]

Compute Infrastructure

For the training and inference phase it was used Jarvis.ai.

Hardware

The model was trained with the following hardware specifications:

RAM: 256GB (total)
GPU: 2 x RTX6000 Ada (48GB)

Software

The model was trained using Pytorch.

Model Card Contact

Alejandro Díaz Pereira (Main developer)

Email: [email protected]

Github: alejandrodp