2f-dev's picture
Update README.md
5f4b2b3 verified
metadata
license: mit
language:
  - ko
  - vi
metrics:
  - bleu
base_model:
  - facebook/mbart-large-50-many-to-many-mmt
pipeline_tag: translation
library_name: transformers
tags:
  - mbart
  - mbart-50
  - text2text-generation

Model Card for mbart-large-50-mmt-ko-vi

This model is fine-tuned from mBART-large-50 using multilingual translation data of Korean legal documents for Korean-to-Vietnamese translation tasks.


Table of Contents


Model Details

Model Description

  • Developed by: Jaeyoon Myoung, Heewon Kwak
  • Shared by: ofu
  • Model type: Language model (Translation)
  • Language(s) (NLP): Korean, Vietnamese
  • License: Apache 2.0
  • Parent Model: facebook/mbart-large-50-many-to-many-mmt

Uses

Direct Use

This model is used for text translation from Korean to Vietnamese.

Out-of-Scope Use

This model is not suitable for translation tasks involving languages other than Korean.


Bias, Risks, and Limitations

The model may contain biases inherited from the training data and may produce inappropriate translations for sensitive topics.


Training Details

Training Data

The model was trained using multilingual translation data of Korean legal documents provided by AI Hub.

Training Procedure

Preprocessing

  • Removed unnecessary whitespace, special characters, and line breaks.

Speeds, Sizes, Times

  • Training Time: 1 hour 25 minutes (5,100 seconds) on Nvidia RTX 4090
  • Throughput: ~3.51 samples/second
  • Total Training Samples: 17,922
  • Model Checkpoint Size: Approximately 2.3GB
  • Gradient Accumulation Steps: 4
  • FP16 Mixed Precision Enabled: Yes

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 8 (per device)
  • eval_batch_size: 8 (per device)
  • seed: 42
  • distributed_type: single-node (since _n_gpu=1 and no distributed training setup is indicated)
  • num_devices: 1 (single NVIDIA GPU: RTX 4090)
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 32 (calculated as train_batch_size * gradient_accumulation_steps)
  • total_eval_batch_size: 8 (evaluation does not use gradient accumulation)
  • optimizer: AdamW (indicated by optim=OptimizerNames.ADAMW_TORCH)
  • lr_scheduler_type: linear (indicated by lr_scheduler_type=SchedulerType.LINEAR)
  • lr_scheduler_warmup_steps: 100
  • num_epochs: 3

Evaluation

Testing Data

The evaluation used a dataset partially extracted from Korean labor law precedents.

Metrics

  • BLEU

Results

  • BLEU Score: 29.69
  • Accuracy: 95.65%

Environmental Impact

  • Hardware Type: NVIDIA RTX 4090
  • Power Consumption: ~450W
  • Training Time: 1 hour 25 minutes (1.42 hours)
  • Electricity Consumption: ~0.639 kWh
  • Carbon Emission Factor (South Korea): 0.459 kgCO₂/kWh
  • Estimated Carbon Emissions: ~0.293 kgCO₂

Technical Specifications

  • Model Architecture: Based on mBART-large-50, a multilingual sequence-to-sequence transformer model designed for translation tasks. The architecture includes 24 encoder and 24 decoder layers with 1,024 hidden units.

  • Software:

    • sacrebleu for evaluation
    • Hugging Face Transformers library for fine-tuning
    • Python 3.11.9 and PyTorch 2.4.0
  • Hardware: NVIDIA RTX 4090 with 24GB VRAM was used for training and inference.

  • Tokenization and Preprocessing: The tokenization was performed using the SentencePiece model pre-trained with mBART-large-50. Text preprocessing included removing special characters, unnecessary whitespace, and normalizing line breaks.


Citation

Currently, there are no papers or blog posts available for this model.


Model Card Contact