metadata

license: apache-2.0
library_name: peft
tags:
  - generated_from_trainer
base_model: BioMistral/BioMistral-7B
model-index:
  - name: spanish_medica_llm
    results: []
datasets:
  - somosnlp/SMC
language:
  - es
pipeline_tag: text-generation
widget:
  - text: >-
      Is this review positive or negative? Review: Best cast iron skillet you
      will ever buy.
    example_title: Sentiment analysis
  - text: >-
      Barack Obama nominated Hilary Clinton as his secretary of state on Monday.
      He chose her because she had ...
    example_title: Coreference resolution
  - text: >-
      On a shelf, there are five books: a gray book, a red book, a purple book,
      a blue book, and a black book ...
    example_title: Logic puzzles
  - text: >-
      The two men running to become New York City's next mayor will face off in
      their first debate Wednesday night ...
    example_title: Reading comprehension

Model Card for SpanishMedicaLLM

More than 600 million Spanish-speaking people need resources, such as LLMs, to obtain medical information freely and safely, complying with the millennium objectives: Health and Wellbeing, Education and Quality, End of Poverty proposed by the UN. There are few LLMs for the medical domain in the Spanish language.

The objective of this project is to create a large language model (LLM) for the medical context in Spanish, allowing the creation of solutions and health information services in LATAM. The model will have information on conventional, natural and traditional medicines. An output of the project is a public dataset from the medical domain that pools resources from other sources that allows LLM to be created or fine-tuned. The performance results of the LLM are compared with other state-of-the-art models such as BioMistral, Meditron, MedPalm.

Dataset Card in Spanish

Model Details

Model Description

Developed by: Dionis López Ramos, Alvaro Garcia Barragan, Dylan Montoya, Daniel Bermúdez
Funded by: SomosNLP, HuggingFace
Model type: Language model, instruction tuned
Language(s): Spanish (es-ES, es-CL)
License: apache-2.0
Fine-tuned from model: BioMistral/BioMistral-7B
Dataset used: somosnlp/SMC/

Model Sources

Repository: spaces/somosnlp/SpanishMedicaLLM/
Paper: "Comming soon!"
Demo: spaces/somosnlp/SpanishMedicaLLM
Video presentation: SpanishMedicaLLM | Proyecto Hackathon #SomosNLP

Uses

Direct Use

[More Information Needed]

Out-of-Scope Use

The creators of LOL are not responsible for any harmful results they may generate. A rigorous evaluation process with specialists is suggested of the results generated.

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

How to Get Started with the Model

Use the code below to get started with the model.

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM

config = PeftConfig.from_pretrained("somosnlp/spanish_medica_llm")
model = AutoModelForCausalLM.from_pretrained("BioMistral/BioMistral-7B")
model = PeftModel.from_pretrained(model, "somosnlp/spanish_medica_llm")

Training Details

Training Data

Dataset used was somosnlp/SMC/

Training Procedure

Training Hyperparameters

Training regime:

learning_rate: 2.5e-05
train_batch_size: 16
eval_batch_size: 1
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 5
training_steps: 2
mixed_precision_training: Native AMP

Evaluation

Testing Data, Factors & Metrics

Testing Data

The corpus used was 20% somosnlp/SMC/

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: GPU
Hours used: 4 Horas
Cloud Provider: Hugginface
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Model Architecture and Objective

The architecture of BioMistral/BioMistral-7Bbecause it is a foundational model trained with a medical domain dataset.

Compute Infrastructure

[More Information Needed]

Hardware

Nvidia T4 Small 4 vCPU 15 GB RAM 16 GB VRAM

Software

transformers==4.38.0
torch>=2.1.1+cu113
trl @ git+https://github.com/huggingface/trl
peft
wandb
accelerate
datasets

License

Apache License 2.0

Citation

BibTeX:

@software{lopez2024spanishmedicallm,
  author = {Lopez Dionis, Garcia Alvaro, Montoya Dylan, Bermúdez Daniel},
  title = {SpanishMedicaLLM},
  month = February,
  year = 2024,
  url = {https://huggingface.co/datasets/HuggingFaceTB/cosmopedia}
}

More Information

This project was developed during the [Hackathon #Somos600M](https://somosnlp.org/hackathon) organized by SomosNLP. 
The model was trained using GPUs sponsored by HuggingFace.

Team:

Contact

For any doubt or suggestion contact to: PhD Dionis López ([email protected])