|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
language: |
|
- ja |
|
- en |
|
--- |
|
|
|
# Retrieva BERT Model |
|
The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM. |
|
It is designed for use in Japanese. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM. |
|
|
|
It is designed for use in Japanese. |
|
|
|
This model offers several advanced features compared to traditional BERT models: |
|
- **PreNorm**: Improved stability during training. |
|
- **SwiGLU**: Enhanced activation function for better performance. |
|
- **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism. |
|
- **Max Sequence Length**: 2048 tokens, allowing for longer context. |
|
- **Parameters**: 1.3 billion parameters. |
|
- **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP). |
|
- **Token Type IDs**: Not used in this model. |
|
|
|
### Model Sources |
|
- **Developed by:** Retrieva, Inc. |
|
- **Model type:** Based on MegatronBERT Architecture. |
|
- **Language(s) (NLP):** Primarily Japanese (optional support for English). |
|
- **License:** Apache 2.0 |
|
|
|
|
|
## Uses |
|
|
|
This model can be used as a Masked Language Model (MLM). |
|
However, it is primarily intended to be fine-tuned on downstream tasks. |
|
Depending on your use case, follow the appropriate section below. |
|
|
|
### Direct Use |
|
|
|
This model is pre-trained using Masked Language Modeling. |
|
The mask token used is `<MASK|LLM-jp>`. |
|
Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation. |
|
|
|
Example code for direct use: |
|
|
|
```python |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline |
|
|
|
model_id = "retrieva-jp/bert-1.3b" |
|
model = AutoModelForMaskedLM.from_pretrained(model_id, trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer) |
|
|
|
text = "ใใใซใกใฏ๏ผ็งใฎๅๅใฏ<MASK|LLM-jp>ใงใ๏ผ" |
|
print(pipe(text)) |
|
``` |
|
|
|
### Downstream Use |
|
|
|
RetrievaBERT is compatible with Hugging Face's AutoModels. |
|
To fine-tune RetrievaBERT for your specific task, use the corresponding AutoModel class. |
|
For detailed configuration, refer to the config.json file. |
|
|
|
|
|
## Training Details |
|
|
|
### Training Data |
|
The Retrieva BERT model was pre-trained on the reunion of five datasets: |
|
- [Japanese CommonCrawl Dataset by LLM-jp](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v2). |
|
- [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). |
|
- Chinese Wikipedia dumped on 20240120. |
|
- Korean Wikipedia dumped on 20240120. |
|
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack) |
|
|
|
The model was trained on 180 billion tokens using the above dataset. |
|
|
|
### Training Procedure |
|
The model was trained on 4 to 32 H100 GPUs with a batch size of 1,024. |
|
We adopted the curriculum learning which is similar to the Sequence Length Warmup and training with the following sequence lengths and number of steps. |
|
|
|
- The sequence length of 128: 31,000 steps. |
|
- The sequence length of 256: 219,000 steps. |
|
- The sequence length of 512: 192,000 steps. |
|
- The sequence length of 2048: 12,000 steps. |
|
|
|
#### Training Hyperparameters |
|
The model was trained on the following hyperparameters. |
|
|
|
- Learning rate: 1.5e-4. |
|
- Learning rate decay style: Linear. |
|
- Learning rate warmup fraction: 0.01 |
|
- Minimum learning rate: 1e-6 |
|
- Floating point expression: BF16 |
|
|
|
## Evaluation |
|
We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set. |
|
We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja). |
|
|
|
| Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc | |
|
| :--- |---:|---:|---:|---:|---:|---:|---:| |
|
| tohoku-nlp/bert-base-japanese-v3 | 0.957 | 0.914 | 0.876 | 0.906 | 0.878 | 0.946 | 0.849 | |
|
| tohoku-nlp/bert-large-japanese-v2| 0.959 | 0.916 | 0.877 | 0.901 | 0.884 | 0.951 | 0.867 | |
|
| ku-nlp/deberta-v3-base-japaneseใใใใ| 0.958 | 0.925 | 0.890 | 0.902 | 0.925 | 0.910 | 0.882 | |
|
| retrieva-jp/bert-1.3bใใใใใใใใใใใใใใใใใใใใใใใใ| 0.952 | 0.916 | 0.877 | 0.896 | 0.916 | 0.879 | 0.815 | |
|
|
|
|
|
## Technical Specifications |
|
|
|
### Model Architectures |
|
The Retrieva BERT model is based on BERT with the following hyperparameters: |
|
|
|
- Number of layers: 48 |
|
- Hidden layer size: 1536 |
|
- FFN hidden layer size: 4096 |
|
- Number of attention heads: 24 |
|
- Maximum length of position embeddings: 2048 |
|
|
|
As mentioned earlier, the main differences from the original BERT are: |
|
- PreNorm: Improved stability during training. |
|
- SwiGLU: Enhanced activation function for better performance. |
|
- Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism. |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
[TSUBAME 4](https://www.t4.gsic.titech.ac.jp/en/hardware) |
|
|
|
This model is based on results obtained from the TSUBAME deep-learning mini-camp. |
|
|
|
#### Software |
|
|
|
The model was trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM). |
|
|
|
## More Information |
|
|
|
https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese) |
|
|
|
## Model Card Authors |
|
|
|
Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba |
|
|
|
## Model Card Contact |
|
[email protected] |