Text Generation
Transformers
Inference Endpoints

Model Card for mEdIT-xxl

The medit-xxl model was obtained by fine-tuning the MBZUAI/bactrian-x-llama-13b-lora model on the mEdIT dataset.

Paper: mEdIT: Multilingual Text Editing via Instruction Tuning

Authors: Vipul Raheja, Dimitris Alikaniotis, Vivek Kulkarni, Bashar Alhafni, Dhruv Kumar

Model Details

Model Description

  • Language(s) (NLP): Arabic, Chinese, English, German, Japanese, Korean, Spanish
  • Finetuned from model: MBZUAI/bactrian-x-llama-13b-lora

Model Sources

How to use

Given an edit instruction and an original text, our model can generate the edited version of the text.

task_specs

Specifically, our models support both multi-lingual and cross-lingual text revision. Note that the input and output texts are always in the same language. The monolingual vs. cross-lingual setting is determined by comparing the language of the edit instruction in relation to the language of the input text.

Instruction format

Adherence to the following instruction format is essential; failure to do so may result in the model producing less-than-ideal results.

instruction_tokens = [
    "Instruction",
    "Anweisung",
    ...
]

input_tokens = [
    "Input",
    "Aporte",
    ...
]

output_tokens = [
    "Output",
    "Produzione",
    ...
]

task_descriptions = [
    "Fix grammatical errors in this sentence",  # <-- GEC task
    "Umschreiben Sie den Satz",                 # <-- Paraphrasing
    ...
]

The entire list of possible instructions, input/output tokens, and task descriptions can be found in the Appendix of our paper.

prompt_template = """### <instruction_token>:\n<task_description>\n### <input_token>:\n<input>\n### <output_token>:\n\n"""

Note that the tokens and the task description need not be in the language of the input (in the case of cross-lingual revision).

Run the model

Make sure you have the following libraries installed:

- peft
- protobuf
- sentencepiece
- tokenizers
- torch
- transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "grammarly/medit-xxl"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id)

# English GEC using Japanese instructions
prompt = '### 命什:\nζ–‡η« γ‚’ζ–‡ζ³•ηš„γ«γ™γ‚‹\n### ε…₯εŠ›:\nI has small cat ,\n### ε‡ΊεŠ›:\n\n'

inputs = tokenizer(prompt, return_tensors='pt')

outputs = model.generate(**inputs, max_new_tokens=20)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# --> I have a small cat ,

# German GEC using Japanese instructions
prompt = '### 命什:\nζ–‡η« γ‚’ζ–‡ζ³•ηš„γ«γ™γ‚‹\n### ε…₯εŠ›:\nIch haben eines kleines Katze ,\n### ε‡ΊεŠ›:\n\n'

# ...
# --> Ich habe eine kleine Katze ,

Software

https://github.com/vipulraheja/medit

Citation

BibTeX:

@article{raheja2023medit,
      title={mEdIT: mEdIT: Multilingual Text Editing via Instruction Tuning}, 
      author={Vipul Raheja and Dimitris Alikaniotis and Vivek Kulkarni and Bashar Alhafni and Dhruv Kumar},
      year={2024},
      eprint={2402.16472v1},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

APA: Raheja, V., Alikaniotis, D., Kulkarni, V., Alhafni, B., & Kumar, D. (2024). MEdIT: Multilingual Text Editing via Instruction Tuning. ArXiv. /abs/2402.16472

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train grammarly/medit-xxl

Collection including grammarly/medit-xxl