VoVanPhuc commited on
Commit
fa3b072
1 Parent(s): 4f958d7

update pretrain

Browse files
Files changed (2) hide show
  1. README.md +33 -18
  2. pytorch_model.bin +1 -1
README.md CHANGED
@@ -3,12 +3,11 @@
3
  1. [Introduction](#introduction)
4
  2. [Pretrain model](#models)
5
  3. [Using SimeCSE_Vietnamese with `sentences-transformers`](#sentences-transformers)
6
- \t- [Installation](#install1)
7
- \t- [Example usage](#usage1)
8
  4. [Using SimeCSE_Vietnamese with `transformers`](#transformers)
9
- \t- [Installation](#install2)
10
- \t- [Example usage](#usage2)
11
-
12
  # <a name="introduction"></a> SimeCSE_Vietnamese: Simple Contrastive Learning of Sentence Embeddings with Vietnamese
13
 
14
  Pre-trained SimeCSE_Vietnamese models are the state-of-the-art of Sentence Embeddings with Vietnamese :
@@ -20,7 +19,7 @@ Pre-trained SimeCSE_Vietnamese models are the state-of-the-art of Sentence Embed
20
  ## Pre-trained models <a name="models"></a>
21
 
22
 
23
- Model | #params | Arch.\t
24
  ---|---|---
25
  [`VoVanPhuc/sup-SimCSE-VietNamese-phobert-base`](https://huggingface.co/VoVanPhuc/sup-SimCSE-VietNamese-phobert-base) | 135M | base
26
  [`VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base`](https://huggingface.co/VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base) | 135M | base
@@ -31,13 +30,19 @@ Model | #params | Arch.\t
31
 
32
  ### Installation <a name="install1"></a>
33
  - Install `sentence-transformers`:
34
- \t- `pip install -U sentence-transformers`
35
- \t
 
 
 
 
36
 
37
  ### Example usage <a name="usage1"></a>
38
 
39
  ```python
40
  from sentence_transformers import SentenceTransformer
 
 
41
  model = SentenceTransformer('VoVanPhuc/sup-SimCSE-VietNamese-phobert-base')
42
 
43
  sentences = ['Kẻ đánh bom đinh tồi tệ nhất nước Anh.',
@@ -52,6 +57,7 @@ sentences = ['Kẻ đánh bom đinh tồi tệ nhất nước Anh.',
52
  'Bắn chết người trong cuộc rượt đuổi trên sông.'
53
  ]
54
 
 
55
  embeddings = model.encode(sentences)
56
  ```
57
 
@@ -59,16 +65,22 @@ embeddings = model.encode(sentences)
59
 
60
  ### Installation <a name="install2"></a>
61
  - Install `transformers`:
62
- \t- `pip install -U transformers`
63
- \t
 
 
 
 
 
64
 
65
  ### Example usage <a name="usage2"></a>
66
 
67
  ```python
68
  import torch
69
  from transformers import AutoModel, AutoTokenizer
 
70
 
71
- tokenizer = AutoTokenizer.from_pretrained("VoVanPhuc/sup-SimCSE-VietNamese-phobert-base")
72
  model = AutoModel.from_pretrained("VoVanPhuc/sup-SimCSE-VietNamese-phobert-base")
73
 
74
  sentences = ['Kẻ đánh bom đinh tồi tệ nhất nước Anh.',
@@ -82,7 +94,10 @@ sentences = ['Kẻ đánh bom đinh tồi tệ nhất nước Anh.',
82
  'Chủ ki-ốt bị đâm chết trong chợ đầu mối lớn nhất Thanh Hoá.',
83
  'Bắn chết người trong cuộc rượt đuổi trên sông.'
84
  ]
85
- inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
 
 
 
86
 
87
  with torch.no_grad():
88
  embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output
@@ -94,12 +109,12 @@ with torch.no_grad():
94
  ## Citation
95
 
96
 
97
- \t@article{gao2021simcse,
98
- \t title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
99
- \t author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
100
- \t journal={arXiv preprint arXiv:2104.08821},
101
- \t year={2021}
102
- \t}
103
 
104
  @inproceedings{phobert,
105
  title = {{PhoBERT: Pre-trained language models for Vietnamese}},
 
3
  1. [Introduction](#introduction)
4
  2. [Pretrain model](#models)
5
  3. [Using SimeCSE_Vietnamese with `sentences-transformers`](#sentences-transformers)
6
+ - [Installation](#install1)
7
+ - [Example usage](#usage1)
8
  4. [Using SimeCSE_Vietnamese with `transformers`](#transformers)
9
+ - [Installation](#install2)
10
+ - [Example usage](#usage2)
 
11
  # <a name="introduction"></a> SimeCSE_Vietnamese: Simple Contrastive Learning of Sentence Embeddings with Vietnamese
12
 
13
  Pre-trained SimeCSE_Vietnamese models are the state-of-the-art of Sentence Embeddings with Vietnamese :
 
19
  ## Pre-trained models <a name="models"></a>
20
 
21
 
22
+ Model | #params | Arch.
23
  ---|---|---
24
  [`VoVanPhuc/sup-SimCSE-VietNamese-phobert-base`](https://huggingface.co/VoVanPhuc/sup-SimCSE-VietNamese-phobert-base) | 135M | base
25
  [`VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base`](https://huggingface.co/VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base) | 135M | base
 
30
 
31
  ### Installation <a name="install1"></a>
32
  - Install `sentence-transformers`:
33
+
34
+ - `pip install -U sentence-transformers`
35
+
36
+ - Install `pyvi` to word segment:
37
+
38
+ - `pip install pyvi`
39
 
40
  ### Example usage <a name="usage1"></a>
41
 
42
  ```python
43
  from sentence_transformers import SentenceTransformer
44
+ from pyvi.ViTokenizer import tokenize
45
+
46
  model = SentenceTransformer('VoVanPhuc/sup-SimCSE-VietNamese-phobert-base')
47
 
48
  sentences = ['Kẻ đánh bom đinh tồi tệ nhất nước Anh.',
 
57
  'Bắn chết người trong cuộc rượt đuổi trên sông.'
58
  ]
59
 
60
+ sentences = [tokenize(sentence) for sentence in sentences]
61
  embeddings = model.encode(sentences)
62
  ```
63
 
 
65
 
66
  ### Installation <a name="install2"></a>
67
  - Install `transformers`:
68
+
69
+ - `pip install -U transformers`
70
+
71
+
72
+ - Install `pyvi` to word segment:
73
+
74
+ - `pip install pyvi`
75
 
76
  ### Example usage <a name="usage2"></a>
77
 
78
  ```python
79
  import torch
80
  from transformers import AutoModel, AutoTokenizer
81
+ from pyvi.ViTokenizer import tokenize
82
 
83
+ PhobertTokenizer = AutoTokenizer.from_pretrained("VoVanPhuc/sup-SimCSE-VietNamese-phobert-base")
84
  model = AutoModel.from_pretrained("VoVanPhuc/sup-SimCSE-VietNamese-phobert-base")
85
 
86
  sentences = ['Kẻ đánh bom đinh tồi tệ nhất nước Anh.',
 
94
  'Chủ ki-ốt bị đâm chết trong chợ đầu mối lớn nhất Thanh Hoá.',
95
  'Bắn chết người trong cuộc rượt đuổi trên sông.'
96
  ]
97
+
98
+ sentences = [tokenize(sentence) for sentence in sentences]
99
+
100
+ inputs = PhobertTokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
101
 
102
  with torch.no_grad():
103
  embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output
 
109
  ## Citation
110
 
111
 
112
+ @article{gao2021simcse,
113
+ title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
114
+ author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
115
+ journal={arXiv preprint arXiv:2104.08821},
116
+ year={2021}
117
+ }
118
 
119
  @inproceedings{phobert,
120
  title = {{PhoBERT: Pre-trained language models for Vietnamese}},
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:97ee28100b721e08704cc25ab06d5c6f8afb27d2e713742f1452489586f81905
3
  size 542443775
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eaf34ee269f687df927f23d7c51ae1ef672c9e3efc7b1e2249fef3035f70b70f
3
  size 542443775