ku-nlp
/

bart-base-japanese

Text2Text Generation

Inference Endpoints

Model card Files Files and versions Community

Matttttttt commited on May 9, 2023

Commit

f711924

•

1 Parent(s): 8e2a93e

Update README.md

Files changed (1) hide show

README.md +51 -0

README.md CHANGED Viewed

@@ -1,3 +1,54 @@
 ---
 license: cc-by-sa-4.0
 ---

 ---
 license: cc-by-sa-4.0
+language:
+- ja
+library_name: transformers
 ---
+# Model Card for Japanese BART V2 base
+## Model description
+This is a Japanese BART V2 base model pre-trained on Japanese Wikipedia.
+## How to use
+You can use this model as follows:
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained('ku-nlp/bart-v2-base-japanese')
+model = AutoModelForMaskedLM.from_pretrained('ku-nlp/bart-v2-base-japanese/')
+sentence = '京都 大学 で 自然 言語 処理 を 専攻 する 。'  # input should be segmented into words by Juman++ in advance
+encoding = tokenizer(sentence, return_tensors='pt')
+...
+```
+You can fine-tune this model on downstream tasks.
+## Tokenization
+The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. [Juman++ 2.0.0-rc3](https://github.com/ku-nlp/jumanpp/releases/tag/v2.0.0-rc3) was used for pre-training. Each word is tokenized into subwords by [sentencepiece](https://github.com/google/sentencepiece).
+## Training data
+We used the following corpora for pre-training:
+- Japanese Wikipedia (18M sentences)
+## Training procedure
+We first segmented texts in the corpora into words using [Juman++](https://github.com/ku-nlp/jumanpp).
+Then, we built a sentencepiece model with 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC)) and subwords induced by the unigram language model of [sentencepiece](https://github.com/google/sentencepiece).
+We tokenized the segmented corpora into subwords using the sentencepiece model and trained the Japanese BART model using [transformers](https://github.com/huggingface/transformers) library.
+The training took 2 weeks using 4 Tesla V100 GPUs.
+The following hyperparameters were used during pre-training:
+- distributed_type: multi-GPU
+- num_devices: 4
+- batch_size: 512
+- training_steps: 500,000
+- encoder-decoder layers: 6
+- hidden: 768