letrunglinh commited on
Commit
dfbfb0f
·
1 Parent(s): 2acb23a

Upload 6 files

Browse files
README.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - vi
4
+ - vn
5
+ - en
6
+ tags:
7
+ - question-answering
8
+ - pytorch
9
+ datasets:
10
+ - squad
11
+ license: cc-by-nc-4.0
12
+ pipeline_tag: question-answering
13
+ metrics:
14
+ - squad
15
+ widget:
16
+ - text: "Bình là chuyên gia về gì ?"
17
+ context: "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
18
+ - text: "Bình được công nhận với danh hiệu gì ?"
19
+ context: "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
20
+ ---
21
+ ## Model Description
22
+
23
+ - Language model: [XLM-RoBERTa](https://huggingface.co/transformers/model_doc/xlmroberta.html)
24
+ - Fine-tune: [MRCQuestionAnswering](https://github.com/nguyenvulebinh/extractive-qa-mrc)
25
+ - Language: Vietnamese, Englsih
26
+ - Downstream-task: Extractive QA
27
+ - Dataset (combine English and Vietnamese):
28
+ - [Squad 2.0](https://rajpurkar.github.io/SQuAD-explorer/)
29
+ - [mailong25](https://github.com/mailong25/bert-vietnamese-question-answering/tree/master/dataset)
30
+ - [VLSP MRC 2021](https://vlsp.org.vn/vlsp2021/eval/mrc)
31
+ - [MultiLingual Question Answering](https://github.com/facebookresearch/MLQA)
32
+
33
+ This model is intended to be used for QA in the Vietnamese language so the valid set is Vietnamese only (but English works fine). The evaluation result below uses the VLSP MRC 2021 test set. This experiment achieves TOP 1 on the leaderboard.
34
+
35
+
36
+ | Model | EM | F1 |
37
+ | ------------- | ------------- | ------------- |
38
+ | [large](https://huggingface.co/nguyenvulebinh/vi-mrc-large) public_test_set | 85.847 | 83.826 |
39
+ | [large](https://huggingface.co/nguyenvulebinh/vi-mrc-large) private_test_set | 82.072 | 78.071 |
40
+
41
+ Public leaderboard | Private leaderboard
42
+ :-------------------------:|:-------------------------:
43
+ ![](https://i.ibb.co/tJX6V6T/public-leaderboard.jpg) | ![](https://i.ibb.co/nmsX2pG/private-leaderboard.jpg)
44
+
45
+ [MRCQuestionAnswering](https://github.com/nguyenvulebinh/extractive-qa-mrc) using [XLM-RoBERTa](https://huggingface.co/transformers/model_doc/xlmroberta.html) as a pre-trained language model. By default, XLM-RoBERTa will split word in to sub-words. But in my implementation, I re-combine sub-words representation (after encoded by BERT layer) into word representation using sum strategy.
46
+
47
+ ## Using pre-trained model
48
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Yqgdfaca7L94OyQVnq5iQq8wRTFvVZjv?usp=sharing)
49
+
50
+ - Hugging Face pipeline style (**NOT using sum features strategy**).
51
+
52
+ ```python
53
+ from transformers import pipeline
54
+ # model_checkpoint = "nguyenvulebinh/vi-mrc-large"
55
+ model_checkpoint = "nguyenvulebinh/vi-mrc-base"
56
+ nlp = pipeline('question-answering', model=model_checkpoint,
57
+ tokenizer=model_checkpoint)
58
+ QA_input = {
59
+ 'question': "Bình là chuyên gia về gì ?",
60
+ 'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
61
+ }
62
+ res = nlp(QA_input)
63
+ print('pipeline: {}'.format(res))
64
+ #{'score': 0.5782045125961304, 'start': 45, 'end': 68, 'answer': 'xử lý ngôn ngữ tự nhiên'}
65
+ ```
66
+
67
+ - More accurate infer process ([**Using sum features strategy**](https://github.com/nguyenvulebinh/extractive-qa-mrc))
68
+
69
+ ```python
70
+ from infer import tokenize_function, data_collator, extract_answer
71
+ from model.mrc_model import MRCQuestionAnswering
72
+ from transformers import AutoTokenizer
73
+
74
+ model_checkpoint = "nguyenvulebinh/vi-mrc-large"
75
+ #model_checkpoint = "nguyenvulebinh/vi-mrc-base"
76
+ tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
77
+ model = MRCQuestionAnswering.from_pretrained(model_checkpoint)
78
+
79
+ QA_input = {
80
+ 'question': "Bình được công nhận với danh hiệu gì ?",
81
+ 'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
82
+ }
83
+
84
+ inputs = [tokenize_function(*QA_input)]
85
+ inputs_ids = data_collator(inputs)
86
+ outputs = model(**inputs_ids)
87
+ answer = extract_answer(inputs, outputs, tokenizer)
88
+
89
+ print(answer)
90
+ # answer: Google Developer Expert. Score start: 0.9926977753639221, Score end: 0.9909810423851013
91
+ ```
92
+
93
+ ## About
94
+
95
+ *Built by Binh Nguyen*
96
+ [![Follow](https://img.shields.io/twitter/follow/nguyenvulebinh?style=social)](https://twitter.com/intent/follow?screen_name=nguyenvulebinh)
97
+ For more details, visit the project repository.
98
+ [![GitHub stars](https://img.shields.io/github/stars/nguyenvulebinh/extractive-qa-mrc?style=social)](https://github.com/nguyenvulebinh/extractive-qa-mrc)
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "xlm-roberta-large",
3
+ "architectures": [
4
+ "XLMRobertaForQuestionAnswering"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 2,
9
+ "gradient_checkpointing": false,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 1024,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 4096,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 514,
17
+ "model_type": "roberta",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 24,
20
+ "output_past": true,
21
+ "pad_token_id": 1,
22
+ "position_embedding_type": "absolute",
23
+ "transformers_version": "4.8.2",
24
+ "type_vocab_size": 1,
25
+ "use_cache": true,
26
+ "vocab_size": 250002
27
+ }
gitattributes.txt ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
2
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.h5 filter=lfs diff=lfs merge=lfs -text
5
+ *.tflite filter=lfs diff=lfs merge=lfs -text
6
+ *.tar.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.ot filter=lfs diff=lfs merge=lfs -text
8
+ *.onnx filter=lfs diff=lfs merge=lfs -text
9
+ *.arrow filter=lfs diff=lfs merge=lfs -text
10
+ *.ftz filter=lfs diff=lfs merge=lfs -text
11
+ *.joblib filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.pb filter=lfs diff=lfs merge=lfs -text
15
+ *.pt filter=lfs diff=lfs merge=lfs -text
16
+ *.pth filter=lfs diff=lfs merge=lfs -text
17
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "<s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "xlm-roberta-base", "tokenizer_class": "XLMRobertaTokenizer"}