metadata

library_name: transformers
license: apache-2.0
language:
  - ja
  - en

Retrieva BERT Model

The RetrievaBERT is the pre-trained Transformer Encoder using Megatron-LM. It is designed for use in Japanese.

Model Details

Model Description

The RetrievaBERT is the pre-trained Transformer Encoder using Megatron-LM.

It is designed for use in Japanese.

This model offers several advanced features compared to traditional BERT models:

PreNorm: Improved stability during training.
SwiGLU: Enhanced activation function for better performance.
Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.
Max Sequence Length: 2048 tokens, allowing for longer context.
Parameters: 1.3 billion parameters.
Pre-training Objective: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
Token Type IDs: Not used in this model.

Model Sources

Developed by: Retrieva, Inc.
Model type: Based on MegatronBERT Architecture.
Language(s) (NLP): Primarily Japanese (optional support for English).
License: Apache 2.0

Uses

This model can be used as a Masked Language Model (MLM). However, it is primarily intended to be fine-tuned on downstream tasks. Depending on your use case, follow the appropriate section below.

Direct Use

This model is pre-trained using Masked Language Modeling. The mask token used is <MASK|LLM-jp>. Note that you need to set trust_remote_code to True because RetrievaBERT uses a custom model implementation.

Example code for direct use:

from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model_id = "retrieva-jp/bert-1.3b"
model = AutoModelForMaskedLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer)

text = "こんにちは！私の名前は<MASK|LLM-jp>です！"
print(pipe(text))

Downstream Use

RetrievaBERT is compatible with Hugging Face's AutoModels. To fine-tune RetrievaBERT for your specific task, use the corresponding AutoModel class. For detailed configuration, refer to the config.json file.

Training Details

Training Data

The Retrieva BERT model was pre-trained on the reunion of five datasets:

Japanese CommonCrawl Dataset by LLM-jp.
RefinedWeb.
Chinese Wikipedia dumped on 20240120.
Korean Wikipedia dumped on 20240120.
The Stack The model was trained on 180 billion tokens using the above dataset.

Training Procedure

The model was trained on 4 to 32 H100 GPUs with a batch size of 1,024. We adopted the curriculum learning which is similar to the Sequence Length Warmup and training with the following sequence lengths and number of steps.

The sequence length of 128: 31,000 steps.
The sequence length of 256: 219,000 steps.
The sequence length of 512: 192,000 steps.
The sequence length of 2048: 12,000 steps.

Training Hyperparameters

The model was trained on the following hyperparameters.

Learning rate: 1.5e-4.
Learning rate decay style: Linear.
Learning rate warmup fraction: 0.01
Minimum learning rate: 1e-6
Floating point expression: BF16

Evaluation

We fine-tuned the following models and evaluated them on the JGLUE development set. We adjusted the learning rate and training epochs for each model and task in accordance with the JGLUE paper.

Model	MARC-ja/acc	JSTS/pearson	JSTS/spearman	JNLI/acc	JSQuAD/EM	JSQuAD/F1	JComQA/acc
tohoku-nlp/bert-base-japanese-v3	0.957	0.914	0.876	0.906	0.878	0.946	0.849
tohoku-nlp/bert-large-japanese-v2	0.959	0.916	0.877	0.901	0.884	0.951	0.867
ku-nlp/deberta-v3-base-japanese	0.958	0.925	0.890	0.902	0.925	0.910	0.882
retrieva-jp/bert-1.3b	0.952	0.916	0.877	0.896	0.916	0.879	0.815

Technical Specifications

Model Architectures

The Retrieva BERT model is based on BERT with the following hyperparameters:

Number of layers: 48
Hidden layer size: 1536
FFN hidden layer size: 4096
Number of attention heads: 24
Maximum length of position embeddings: 2048

As mentioned earlier, the main differences from the original BERT are:

PreNorm: Improved stability during training.
SwiGLU: Enhanced activation function for better performance.
Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.

Compute Infrastructure

TSUBAME 4

This model is based on results obtained from the TSUBAME deep-learning mini-camp.

Software

The model was trained using Megatron-LM.

More Information [optional]

https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese)

Model Card Authors [optional]

Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba

Model Card Contact

[email protected]