Jargon-NACHOS-4096

Jargon is an efficient transformer encoder LM for French, combining the LinFormer attention mechanism with the RoBERTa model architecture.

Jargon is available in several versions with different context sizes and types of pre-training corpora.

Model Initialised from... Training Data
jargon-general-base scratch 8.5GB Web Corpus
jargon-general-biomed jargon-general-base 5.4GB Medical Corpus
jargon-general-legal jargon-general-base 18GB Legal Corpus
jargon-multidomain-base jargon-general-base Medical+Legal Corpora
jargon-legal scratch 18GB Legal Corpus
jargon-legal-4096 scratch 18GB Legal Corpus
jargon-biomed scratch 5.4GB Medical Corpus
jargon-biomed-4096 scratch 5.4GB Medical Corpus
jargon-NACHOS scratch NACHOS
jargon-NACHOS-4096 scratch NACHOS

Evaluation

The Jargon models were evaluated on an range of specialized downstream tasks.

Biomedical Benchmark

Results averaged across five funs with varying random seeds.

FrenchMedMCQA MQC CAS-POS ESSAI-POS CAS-SG MEDLINE EMEA E3C-NER CLISTER
Task Type Sequence Classification Sequence Classification Token Classification Token Classification Token Classification Token Classification Token Classification Token Classification STS
Metric EMR Accuracy Macro-F1 Macro-F1 Weighted F1 Weighted F1 Weighted F1 Weighted F1 Spearman Correlation
jargon-general-base 12.9 76.7 96.6 96.0 69.4 81.7 96.5 91.9 78.0
jargon-biomed 15.3 91.1 96.5 95.6 75.1 83.7 96.5 93.5 74.6
jargon-biomed-4096 14.4 78.9 96.6 95.9 73.3 82.3 96.3 92.5 65.3
jargon-general-biomed 16.1 69.7 95.1 95.1 67.8 78.2 96.6 91.3 59.7
jargon-multidomain-base 14.9 86.9 96.3 96.0 70.6 82.4 96.6 92.6 74.8
jargon-NACHOS 13.3 90.7 96.3 96.2 75.0 83.4 96.8 93.1 70.9
jargon-NACHOS-4096 18.4 93.2 96.2 95.9 74.9 83.8 96.8 93.2 74.9

For more info please check out the paper, accepted for publication at LREC-COLING 2024.

Using Jargon models with HuggingFace transformers

You can get started with jargon-NACHOS-4096 using the code snippet below:

from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("PantagrueLLM/jargon-NACHOS-4096", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("PantagrueLLM/jargon-NACHOS-4096", trust_remote_code=True)

jargon_maskfiller = pipeline("fill-mask", model=model, tokenizer=tokenizer)
output = jargon_maskfiller("Il est allé au <mask> hier")

You can also use the classes AutoModel, AutoModelForSequenceClassification, or AutoModelForTokenClassification to load Jargon models, depending on the downstream task in question.

  • Language(s): French
  • License: MIT
  • Developed by: Vincent Segonne
  • Funded by
    • GENCI-IDRIS (Grant 2022 A0131013801)
    • French National Research Agency: Pantagruel grant ANR-23-IAS1-0001
    • MIAI@Grenoble Alpes ANR-19-P3IA-0003
    • PROPICTO ANR-20-CE93-0005
    • Lawbot ANR-20-CE38-0013
    • Swiss National Science Foundation (grant PROPICTO N°197864)
  • Authors
    • Vincent Segonne
    • Aidan Mannion
    • Laura Cristina Alonzo Canul
    • Alexandre Audibert
    • Xingyu Liu
    • Cécile Macaire
    • Adrien Pupier
    • Yongxin Zhou
    • Mathilde Aguiar
    • Felix Herron
    • Magali Norré
    • Massih-Reza Amini
    • Pierrette Bouillon
    • Iris Eshkol-Taravella
    • Emmanuelle Esperança-Rodier
    • Thomas François
    • Lorraine Goeuriot
    • Jérôme Goulian
    • Mathieu Lafourcade
    • Benjamin Lecouteux
    • François Portet
    • Fabien Ringeval
    • Vincent Vandeghinste
    • Maximin Coavoux
    • Marco Dinarelli
    • Didier Schwab

Citation

If you use this model for your own research work, please cite as follows:

@inproceedings{segonne:hal-04535557,
  TITLE = {{Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains}},
  AUTHOR = {Segonne, Vincent and Mannion, Aidan and Alonzo Canul, Laura Cristina and Audibert, Alexandre and Liu, Xingyu and Macaire, C{\'e}cile and Pupier, Adrien and Zhou, Yongxin and Aguiar, Mathilde and Herron, Felix and Norr{\'e}, Magali and Amini, Massih-Reza and Bouillon, Pierrette and Eshkol-Taravella, Iris and Esperan{\c c}a-Rodier, Emmanuelle and Fran{\c c}ois, Thomas and Goeuriot, Lorraine and Goulian, J{\'e}r{\^o}me and Lafourcade, Mathieu and Lecouteux, Benjamin and Portet, Fran{\c c}ois and Ringeval, Fabien and Vandeghinste, Vincent and Coavoux, Maximin and Dinarelli, Marco and Schwab, Didier},
  URL = {https://hal.science/hal-04535557},
  BOOKTITLE = {{LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation}},
  ADDRESS = {Turin, Italy},
  YEAR = {2024},
  MONTH = May,
  KEYWORDS = {Self-supervised learning ; Pretrained language models ; Evaluation benchmark ; Biomedical document processing ; Legal document processing ; Speech transcription},
  PDF = {https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf},
  HAL_ID = {hal-04535557},
  HAL_VERSION = {v1},
}
Downloads last month
12
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.