feat: Add Model

Files changed (7) hide show

README.md ADDED Viewed

+---
+language: ko
+---
+# Pretrained BART in Korean
+This is pretrained BART model with multiple Korean Datasets.
+I used multiple datasets for generalizing the model for both colloquial and written texts.
+The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.
+The script which is used to pre-train model is [here](https://github.com/cosmoquester/transformers-bart-pretrain).
+When you use the reference API, you must wrap the sentence with `[BOS]` and `[EOS]` like below example.
+```
+[BOS] 안녕하세요? 반가워요~~ [EOS]
+```
+You can also test mask filling performance using `[MASK]` token like this.
+```
+[BOS] [MASK] 먹었어? [EOS]
+```
+## Used Datasets
+### [모두의 말뭉치](https://corpus.korean.go.kr/)
+- 일상 대화 말뭉치 2020
+- 구어 말뭉치
+- 문어 말뭉치
+- 신문 말뭉치
+### AIhub
+- [개방데이터 전문분야말뭉치](https://aihub.or.kr/aidata/30717)
+- [개방데이터 한국어대화요약](https://aihub.or.kr/aidata/30714)
+- [개방데이터 감성 대화 말뭉치](https://aihub.or.kr/aidata/7978)
+- [개방데이터 한국어 음성](https://aihub.or.kr/aidata/105)
+- [개방데이터 한국어 SNS](https://aihub.or.kr/aidata/30718)
+### [세종 말뭉치](https://ithub.korean.go.kr/)

config.json ADDED Viewed

+{
+  "_name_or_path": "bart-ko-base",
+  "activation_dropout": 0.1,
+  "activation_function": "gelu",
+  "architectures": [
+    "BartForConditionalGeneration"
+  ],
+  "attention_dropout": 0.1,
+  "bos_token_id": 2,
+  "classifier_dropout": 0.0,
+  "d_model": 768,
+  "decoder_attention_heads": 12,
+  "decoder_ffn_dim": 3072,
+  "decoder_layerdrop": 0.0,
+  "decoder_layers": 6,
+  "decoder_start_token_id": 2,
+  "dropout": 0.1,
+  "encoder_attention_heads": 12,
+  "encoder_ffn_dim": 3072,
+  "encoder_layerdrop": 0.0,
+  "encoder_layers": 6,
+  "eos_token_id": 3,
+  "forced_eos_token_id": 3,
+  "gradient_checkpointing": false,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2"
+  },
+  "init_std": 0.02,
+  "is_encoder_decoder": true,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_2": 2
+  },
+  "max_position_embeddings": 2048,
+  "model_type": "bart",
+  "num_hidden_layers": 6,
+  "pad_token_id": 0,
+  "scale_embedding": false,
+  "torch_dtype": "float32",
+  "transformers_version": "4.9.2",
+  "use_cache": false,
+  "vocab_size": 32000
+}

pytorch_model.bin ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:8574cd1ecab495b2475b447728761a7f57930342f34d0a20854549a9acacd7fe
+size 508087740

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "[BOS]", "eos_token": "[EOS]", "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "mask_token": "[MASK]"}

tf_model.h5 ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:857a8d5a063f4d3329f3809b1597ae775332b504f8106e5b04c1bd7298edd238
+size 508137096

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "[BOS]", "eos_token": "[EOS]", "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "mask_token": "[MASK]"}