ccdv commited on
Commit
9808534
1 Parent(s): b87718d

update files and readme

Browse files
Files changed (3) hide show
  1. README.md +31 -23
  2. config.json +1 -1
  3. modeling_lsg_camembert.py +9 -16
README.md CHANGED
@@ -2,10 +2,10 @@
2
  language: fr
3
  tags:
4
  - long context
 
5
  ---
6
 
7
  # LSG model
8
-
9
  **Transformers >= 4.18.0**\
10
  **This model relies on a custom modeling file, you need to add trust_remote_code=True**\
11
  **See [\#13467](https://github.com/huggingface/transformers/pull/13467)**
@@ -16,16 +16,14 @@ tags:
16
  * [Tasks](#tasks)
17
  * [Training global tokens](#training-global-tokens)
18
 
19
- This model can handle long sequences but faster and more efficiently than Longformer or BigBird (from Transformers) and relies on Local + Sparse + Global attention (LSG).
20
-
21
 
22
- The model requires sequences whose length is a multiple of the block size. The model is "adaptive" and automatically pads the sequences if needed (adaptive=True in config). It is however recommended, thanks to the tokenizer, to truncate the inputs (truncation=True) and optionally to pad with a multiple of the block size (pad_to_multiple_of=...). \
23
 
 
24
 
25
- The model is trained starting from a CamemBERT-base checkpoint on 8Gb of data (French OSCAR) using the same number of parameters/layers and the same tokenizer.
26
-
27
 
28
- Support encoder-decoder and causal masking but I didnt test it extensively.\
29
  Implemented in PyTorch.
30
 
31
  ![attn](attn.png)
@@ -36,8 +34,8 @@ The model relies on a custom modeling file, you need to add trust_remote_code=Tr
36
  ```python:
37
  from transformers import AutoModel, AutoTokenizer
38
 
39
- model = AutoModel.from_pretrained("ccdv/lsg-base-4096-fr", trust_remote_code=True)
40
- tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-base-4096-fr")
41
  ```
42
 
43
  ## Parameters
@@ -54,7 +52,7 @@ Default parameters work well in practice. If you are short on memory, reduce blo
54
  ```python:
55
  from transformers import AutoModel
56
 
57
- model = AutoModel.from_pretrained("ccdv/lsg-base-4096-fr",
58
  trust_remote_code=True,
59
  num_global_tokens=16,
60
  block_size=64,
@@ -66,7 +64,6 @@ model = AutoModel.from_pretrained("ccdv/lsg-base-4096-fr",
66
  )
67
  ```
68
 
69
-
70
  ## Sparse selection type
71
 
72
  There are 5 different sparse selection patterns. The best type is task dependent. \
@@ -92,21 +89,19 @@ Note that for sequences with length < 2*block_size, the type has no effect.
92
  * Each head will use block of tokens strided by sparsify_factor
93
  * Not recommended if sparsify_factor > num_heads
94
 
95
-
96
  ## Tasks
97
  Fill mask example:
98
  ```python:
99
  from transformers import FillMaskPipeline, AutoModelForMaskedLM, AutoTokenizer
100
 
101
- model = AutoModelForMaskedLM.from_pretrained("ccdv/lsg-base-4096-fr", trust_remote_code=True, use_auth_token=True)
102
- tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-base-4096-fr", use_auth_token=True)
103
 
104
- SENTENCES = ["Paris est la <mask> de la france.", "Le sens de la vie est <mask>."]
105
  pipeline = FillMaskPipeline(model, tokenizer)
106
- output = pipeline(SENTENCES, top_k=1)
107
-
108
- output = [o[0]["sequence"] for o in output]
109
- > ['Paris est la capitale de la france.', 'Le sens de la vie est simple.']
110
  ```
111
 
112
 
@@ -114,11 +109,11 @@ Classification example:
114
  ```python:
115
  from transformers import AutoModelForSequenceClassification, AutoTokenizer
116
 
117
- model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-base-4096-fr",
118
  trust_remote_code=True,
119
  pool_with_global=True, # pool with a global token instead of first token
120
  )
121
- tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-base-4096-fr", use_auth_token=True)
122
 
123
  SENTENCE = "This is a test for sequence classification. " * 300
124
  token_ids = tokenizer(
@@ -137,16 +132,29 @@ To train global tokens and the classification head only:
137
  ```python:
138
  from transformers import AutoModelForSequenceClassification, AutoTokenizer
139
 
140
- model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-base-4096-fr",
141
  trust_remote_code=True,
142
  pool_with_global=True, # pool with a global token instead of first token
143
  num_global_tokens=16
144
  )
145
- tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-base-4096-fr")
146
 
147
  for name, param in model.named_parameters():
148
  if "global_embeddings" not in name:
149
  param.requires_grad = False
150
  else:
151
  param.required_grad = True
 
 
 
 
 
 
 
 
 
 
 
 
 
152
  ```
 
2
  language: fr
3
  tags:
4
  - long context
5
+ pipeline_tag: fill-mask
6
  ---
7
 
8
  # LSG model
 
9
  **Transformers >= 4.18.0**\
10
  **This model relies on a custom modeling file, you need to add trust_remote_code=True**\
11
  **See [\#13467](https://github.com/huggingface/transformers/pull/13467)**
 
16
  * [Tasks](#tasks)
17
  * [Training global tokens](#training-global-tokens)
18
 
19
+ This model is adapted from [CamemBERT-base](https://huggingface.co/camembert-base) without additional pretraining yet. It uses the same number of parameters/layers and the same tokenizer.
 
20
 
 
21
 
22
+ This model can handle long sequences but faster and more efficiently than Longformer or BigBird (from Transformers) and relies on Local + Sparse + Global attention (LSG).
23
 
24
+ The model requires sequences whose length is a multiple of the block size. The model is "adaptive" and automatically pads the sequences if needed (adaptive=True in config). It is however recommended, thanks to the tokenizer, to truncate the inputs (truncation=True) and optionally to pad with a multiple of the block size (pad_to_multiple_of=...). \
 
25
 
26
+ Support encoder-decoder but I didnt test it extensively.\
27
  Implemented in PyTorch.
28
 
29
  ![attn](attn.png)
 
34
  ```python:
35
  from transformers import AutoModel, AutoTokenizer
36
 
37
+ model = AutoModel.from_pretrained("ccdv/lsg-camembert-base-4096", trust_remote_code=True)
38
+ tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")
39
  ```
40
 
41
  ## Parameters
 
52
  ```python:
53
  from transformers import AutoModel
54
 
55
+ model = AutoModel.from_pretrained("ccdv/lsg-camembert-base-4096",
56
  trust_remote_code=True,
57
  num_global_tokens=16,
58
  block_size=64,
 
64
  )
65
  ```
66
 
 
67
  ## Sparse selection type
68
 
69
  There are 5 different sparse selection patterns. The best type is task dependent. \
 
89
  * Each head will use block of tokens strided by sparsify_factor
90
  * Not recommended if sparsify_factor > num_heads
91
 
 
92
  ## Tasks
93
  Fill mask example:
94
  ```python:
95
  from transformers import FillMaskPipeline, AutoModelForMaskedLM, AutoTokenizer
96
 
97
+ model = AutoModelForMaskedLM.from_pretrained("ccdv/lsg-camembert-base-4096", trust_remote_code=True)
98
+ tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")
99
 
100
+ SENTENCES = "Paris est la <mask> de la France."
101
  pipeline = FillMaskPipeline(model, tokenizer)
102
+ output = pipeline(SENTENCES)
103
+
104
+ > 'Paris est la capitale de la France.'
 
105
  ```
106
 
107
 
 
109
  ```python:
110
  from transformers import AutoModelForSequenceClassification, AutoTokenizer
111
 
112
+ model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-camembert-base-4096",
113
  trust_remote_code=True,
114
  pool_with_global=True, # pool with a global token instead of first token
115
  )
116
+ tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")
117
 
118
  SENTENCE = "This is a test for sequence classification. " * 300
119
  token_ids = tokenizer(
 
132
  ```python:
133
  from transformers import AutoModelForSequenceClassification, AutoTokenizer
134
 
135
+ model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-camembert-base-4096",
136
  trust_remote_code=True,
137
  pool_with_global=True, # pool with a global token instead of first token
138
  num_global_tokens=16
139
  )
140
+ tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")
141
 
142
  for name, param in model.named_parameters():
143
  if "global_embeddings" not in name:
144
  param.requires_grad = False
145
  else:
146
  param.required_grad = True
147
+ ```
148
+
149
+ **CamemBERT**
150
+ ```
151
+ @inproceedings{Martin_2020,
152
+ doi = {10.18653/v1/2020.acl-main.645},
153
+ url = {https://doi.org/10.18653%2Fv1%2F2020.acl-main.645},
154
+ year = 2020,
155
+ publisher = {Association for Computational Linguistics},
156
+ author = {Louis Martin and Benjamin Muller and Pedro Javier Ortiz Su{\'{a}}rez and Yoann Dupont and Laurent Romary and {\'{E}}ric de la Clergeri and Djam{\'{e}} Seddah and Beno{\^{\i}}t Sagot},
157
+ title = {{CamemBERT}: a Tasty French Language Model},
158
+ booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}
159
+ }
160
  ```
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "ccdv/lsg-base-4096-fr",
3
  "adaptive": true,
4
  "architectures": [
5
  "LSGCamembertForMaskedLM"
 
1
  {
2
+ "_name_or_path": "ccdv/lsg-camembert-base-4096",
3
  "adaptive": true,
4
  "architectures": [
5
  "LSGCamembertForMaskedLM"
modeling_lsg_camembert.py CHANGED
@@ -1032,33 +1032,26 @@ class LSGCamembertModel(LSGCamembertPreTrainedModel, RobertaModel):
1032
  return_dict=return_dict
1033
  )
1034
 
1035
- context = encoder_outputs[0]
1036
  if self.pool_with_global:
1037
- context[:, self.num_global_tokens] = context[:, 0]
1038
 
1039
  diff = t - t_
1040
- n, _, d = context.size()
1041
- context = context[..., self.num_global_tokens:, :]
1042
 
1043
  # Adapt sequence to initial shape
1044
  if diff < 0:
1045
- context = context[:, :t]
1046
 
1047
- encoder_outputs.last_hidden_state = context
1048
- sequence_output = encoder_outputs[0]
1049
  pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
1050
 
1051
  if not return_dict:
1052
  return (sequence_output, pooled_output) + encoder_outputs[1:]
1053
-
1054
- return BaseModelOutputWithPoolingAndCrossAttentions(
1055
- last_hidden_state=sequence_output,
1056
- pooler_output=pooled_output,
1057
- past_key_values=encoder_outputs.past_key_values,
1058
- hidden_states=encoder_outputs.hidden_states,
1059
- attentions=encoder_outputs.attentions,
1060
- cross_attentions=encoder_outputs.cross_attentions,
1061
- )
1062
 
1063
  def get_extended_attention_mask(self, attention_mask, input_shape, device=None):
1064
 
 
1032
  return_dict=return_dict
1033
  )
1034
 
1035
+ sequence_output = encoder_outputs[0]
1036
  if self.pool_with_global:
1037
+ sequence_output[:, self.num_global_tokens] = sequence_output[:, 0]
1038
 
1039
  diff = t - t_
1040
+ n, _, d = sequence_output.size()
1041
+ sequence_output = sequence_output[..., self.num_global_tokens:, :]
1042
 
1043
  # Adapt sequence to initial shape
1044
  if diff < 0:
1045
+ sequence_output = sequence_output[:, :t]
1046
 
 
 
1047
  pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
1048
 
1049
  if not return_dict:
1050
  return (sequence_output, pooled_output) + encoder_outputs[1:]
1051
+
1052
+ encoder_outputs.last_hidden_state = sequence_output
1053
+ encoder_outputs.pooler_output = pooled_output
1054
+ return encoder_outputs
 
 
 
 
 
1055
 
1056
  def get_extended_attention_mask(self, attention_mask, input_shape, device=None):
1057