CabraMistral7b / README.md
leaderboard-pt-pr-bot's picture
Adding the Open Portuguese LLM Leaderboard Evaluation Results
9543be0 verified
|
raw
history blame
14.5 kB
metadata
language:
  - pt
  - en
license: cc
tags:
  - text-generation-inference
  - transformers
  - mistral
  - gguf
  - brazil
  - brasil
  - portuguese
base_model: mistralai/Mistral-7B-Instruct-v0.2
metrics:
  - name: assin2_rte f1_macro
    type: assin2_rte
    value: 90.13
  - name: assin2_rte acc
    type: assin2_rte
    value: 90.16
  - name: assin2_sts pearson
    type: assin2_sts
    value: 71.51
  - name: assin2_sts mse
    type: assin2_sts
    value: 68.03
  - name: bluex acc
    type: bluex
    value: 47.98
  - name: enem acc
    type: enem
    value: 58.43
  - name: faquad_nli f1_macro
    type: faquad_nli
    value: 64.24
  - name: faquad_nli acc
    type: faquad_nli
    value: 67.69
  - name: hatebr_offensive_binary f1_macro
    type: hatebr_offensive_binary
    value: 83.61
  - name: hatebr_offensive_binary acc
    type: hatebr_offensive_binary
    value: 83.71
  - name: oab_exams acc
    type: oab_exams
    value: 38.41
  - name: portuguese_hate_speech_binary f1_macro
    type: portuguese_hate_speech_binary
    value: 61.87
  - name: portuguese_hate_speech_binary acc
    type: portuguese_hate_speech_binary
    value: 63.22
pipeline_tag: text-generation
model-index:
  - name: CabraMistral7b
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: ENEM Challenge (No Images)
          type: eduagarcia/enem_challenge
          split: train
          args:
            num_few_shot: 3
        metrics:
          - type: acc
            value: 60.81
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=nicolasdec/CabraMistral7b
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BLUEX (No Images)
          type: eduagarcia-temp/BLUEX_without_images
          split: train
          args:
            num_few_shot: 3
        metrics:
          - type: acc
            value: 46.87
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=nicolasdec/CabraMistral7b
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: OAB Exams
          type: eduagarcia/oab_exams
          split: train
          args:
            num_few_shot: 3
        metrics:
          - type: acc
            value: 38.59
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=nicolasdec/CabraMistral7b
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Assin2 RTE
          type: assin2
          split: test
          args:
            num_few_shot: 15
        metrics:
          - type: f1_macro
            value: 90.27
            name: f1-macro
          - type: pearson
            value: 72.25
            name: pearson
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=nicolasdec/CabraMistral7b
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: FaQuAD NLI
          type: ruanchaves/faquad-nli
          split: test
          args:
            num_few_shot: 15
        metrics:
          - type: f1_macro
            value: 64.35
            name: f1-macro
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=nicolasdec/CabraMistral7b
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HateBR Binary
          type: eduagarcia/portuguese_benchmark
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: f1_macro
            value: 83.15
            name: f1-macro
          - type: f1_macro
            value: 64.82
            name: f1-macro
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=nicolasdec/CabraMistral7b
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: tweetSentBR
          type: eduagarcia-temp/tweetsentbr
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: f1_macro
            value: 64.8
            name: f1-macro
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=nicolasdec/CabraMistral7b
          name: Open Portuguese LLM Leaderboard

Cabra Mistral 7b v2

Esse modelo é um finetune do Mistral 7b Instruct 0.2 com o dataset interno Cabra 10k. Esse modelo é optimizado para português. Ele apresenta melhoria em varios benchmarks brasileiros em comparação com o modelo base.

Exprimente o nosso demo aqui: CabraChat.

Conheça os nossos outros modelos: Cabra.

Detalhes do Modelo

Modelo: Mistral 7b Instruct 0.2

Mistral-7B-v0.1 é um modelo de transformador, com as seguintes escolhas arquitetônicas:

  • Grouped-Query Attention
  • Sliding-Window Attention
  • Byte-fallback BPE tokenizer

dataset: Cabra 10k

Dataset interno para finetuning. Vamos lançar em breve.

Quantização / GGUF

Colocamos diversas versões (GGUF) quantanizadas no branch "quantanization".

Exemplo

<s> [INST] who is Elon Musk? [/INST]Elon Musk é um empreendedor, inventor e capitalista americano. Ele é o fundador, CEO e CTO da SpaceX, CEO da Neuralink e fundador do The Boring Company. Musk também é o proprietário do Twitter.</s>

Paramentros de trainamento

- learning_rate: 1e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- distributed_type: multi-GPU
- num_devices: 2
- gradient_accumulation_steps: 8
- total_train_batch_size: 64
- total_eval_batch_size: 8
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.01
- num_epochs: 3

Framework

  • Transformers 4.39.0.dev0
  • Pytorch 2.1.2+cu118
  • Datasets 2.14.6
  • Tokenizers 0.15.2

Uso

O modelo é destinado, por agora, a fins de pesquisa. As áreas e tarefas de pesquisa possíveis incluem:

  • Pesquisa sobre modelos gerativos.
  • Investigação e compreensão das limitações e viéses de modelos gerativos.

Proibido para uso comercial. Somente Pesquisa.

Evals

Tasks Version Filter n-shot Metric Value Stderr
assin2_rte 1.1 all 15 f1_macro 0.9013 ± 0.0043
all 15 acc 0.9016 ± 0.0043
assin2_sts 1.1 all 15 pearson 0.7151 ± 0.0074
all 15 mse 0.6803 ± N/A
bluex 1.1 all 3 acc 0.4798 ± 0.0107
exam_id__USP_2019 3 acc 0.375 ± 0.044
exam_id__USP_2021 3 acc 0.3462 ± 0.0382
exam_id__USP_2020 3 acc 0.4107 ± 0.0379
exam_id__UNICAMP_2018 3 acc 0.4815 ± 0.0392
exam_id__UNICAMP_2020 3 acc 0.4727 ± 0.0389
exam_id__UNICAMP_2021_1 3 acc 0.413 ± 0.0418
exam_id__UNICAMP_2019 3 acc 0.42 ± 0.0404
exam_id__UNICAMP_2022 3 acc 0.5897 ± 0.0456
exam_id__USP_2022 3 acc 0.449 ± 0.041
exam_id__USP_2024 3 acc 0.6341 ± 0.0434
exam_id__UNICAMP_2024 3 acc 0.6 ± 0.0422
exam_id__USP_2023 3 acc 0.5455 ± 0.0433
exam_id__UNICAMP_2023 3 acc 0.5349 ± 0.044
exam_id__USP_2018 3 acc 0.4815 ± 0.0393
exam_id__UNICAMP_2021_2 3 acc 0.5098 ± 0.0403
enem 1.1 all 3 acc 0.5843 ± 0.0075
exam_id__2010 3 acc 0.5726 ± 0.0264
exam_id__2009 3 acc 0.6 ± 0.0264
exam_id__2014 3 acc 0.633 ± 0.0268
exam_id__2022 3 acc 0.6165 ± 0.0243
exam_id__2012 3 acc 0.569 ± 0.0265
exam_id__2013 3 acc 0.5833 ± 0.0274
exam_id__2016_2 3 acc 0.5203 ± 0.026
exam_id__2011 3 acc 0.6325 ± 0.0257
exam_id__2023 3 acc 0.5778 ± 0.0246
exam_id__2016 3 acc 0.595 ± 0.0258
exam_id__2017 3 acc 0.5517 ± 0.0267
exam_id__2015 3 acc 0.563 ± 0.0261
faquad_nli 1.1 all 15 f1_macro 0.6424 ± 0.0138
all 15 acc 0.6769 ± 0.013
hatebr_offensive_binary 1 all 25 f1_macro 0.8361 ± 0.007
all 25 acc 0.8371 ± 0.007
oab_exams 1.5 all 3 acc 0.3841 ± 0.006
exam_id__2011-03 3 acc 0.3636 ± 0.0279
exam_id__2014-14 3 acc 0.475 ± 0.0323
exam_id__2016-21 3 acc 0.4125 ± 0.0318
exam_id__2012-06a 3 acc 0.3875 ± 0.0313
exam_id__2014-13 3 acc 0.325 ± 0.0303
exam_id__2015-16 3 acc 0.425 ± 0.032
exam_id__2010-02 3 acc 0.4 ± 0.0283
exam_id__2012-08 3 acc 0.3875 ± 0.0314
exam_id__2011-05 3 acc 0.375 ± 0.0312
exam_id__2017-22 3 acc 0.4 ± 0.0316
exam_id__2018-25 3 acc 0.4125 ± 0.0318
exam_id__2012-09 3 acc 0.3636 ± 0.0317
exam_id__2017-24 3 acc 0.3375 ± 0.0304
exam_id__2016-20a 3 acc 0.3125 ± 0.0299
exam_id__2012-06 3 acc 0.425 ± 0.0318
exam_id__2013-12 3 acc 0.4375 ± 0.0321
exam_id__2016-20 3 acc 0.45 ± 0.0322
exam_id__2013-11 3 acc 0.4 ± 0.0316
exam_id__2015-17 3 acc 0.4231 ± 0.0323
exam_id__2015-18 3 acc 0.4 ± 0.0316
exam_id__2017-23 3 acc 0.35 ± 0.0308
exam_id__2010-01 3 acc 0.2471 ± 0.0271
exam_id__2011-04 3 acc 0.375 ± 0.0313
exam_id__2016-19 3 acc 0.4103 ± 0.0321
exam_id__2013-10 3 acc 0.3375 ± 0.0305
exam_id__2012-07 3 acc 0.3625 ± 0.031
exam_id__2014-15 3 acc 0.3846 ± 0.0318
portuguese_hate_speech_binary 1 all 25 f1_macro 0.6187 ± 0.0119
all 25 acc 0.6322 ± 0.0117

Open Portuguese LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Average 65.1
ENEM Challenge (No Images) 60.81
BLUEX (No Images) 46.87
OAB Exams 38.59
Assin2 RTE 90.27
Assin2 STS 72.25
FaQuAD NLI 64.35
HateBR Binary 83.15
PT Hate Speech Binary 64.82
tweetSentBR 64.80