update files and readme

Browse files

Files changed (3) hide show

README.md +31 -23
config.json +1 -1
modeling_lsg_camembert.py +9 -16

README.md CHANGED Viewed

@@ -2,10 +2,10 @@
 language: fr
 tags:
 - long context
 ---
 # LSG model
 **Transformers >= 4.18.0**\
 **This model relies on a custom modeling file, you need to add trust_remote_code=True**\
 **See [\#13467](https://github.com/huggingface/transformers/pull/13467)**
@@ -16,16 +16,14 @@ tags:
 * [Tasks](#tasks)
 * [Training global tokens](#training-global-tokens)
-This model can handle long sequences but faster and more efficiently than Longformer or BigBird (from Transformers) and relies on Local + Sparse + Global attention (LSG).
-The model requires sequences whose length is a multiple of the block size. The model is "adaptive" and automatically pads the sequences if needed (adaptive=True in config). It is however recommended, thanks to the tokenizer, to truncate the inputs (truncation=True) and optionally to pad with a multiple of the block size (pad_to_multiple_of=...). \
-The model is trained starting from a CamemBERT-base checkpoint on 8Gb of data (French OSCAR) using the same number of parameters/layers and the same tokenizer.
-Support encoder-decoder and causal masking but I didnt test it extensively.\
 Implemented in PyTorch.
 ![attn](attn.png)
@@ -36,8 +34,8 @@ The model relies on a custom modeling file, you need to add trust_remote_code=Tr
 ```python:
 from transformers import AutoModel, AutoTokenizer
-model = AutoModel.from_pretrained("ccdv/lsg-base-4096-fr", trust_remote_code=True)
-tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-base-4096-fr")
 ```
 ## Parameters
@@ -54,7 +52,7 @@ Default parameters work well in practice. If you are short on memory, reduce blo
 ```python:
 from transformers import AutoModel
-model = AutoModel.from_pretrained("ccdv/lsg-base-4096-fr",
     trust_remote_code=True,
     num_global_tokens=16,
     block_size=64,
@@ -66,7 +64,6 @@ model = AutoModel.from_pretrained("ccdv/lsg-base-4096-fr",
 )
 ```
 ## Sparse selection type
 There are 5 different sparse selection patterns. The best type is task dependent. \
@@ -92,21 +89,19 @@ Note that for sequences with length < 2*block_size, the type has no effect.
     * Each head will use block of tokens strided by sparsify_factor
     * Not recommended if sparsify_factor > num_heads
 ## Tasks
 Fill mask example:
 ```python:
 from transformers import FillMaskPipeline, AutoModelForMaskedLM, AutoTokenizer
-model = AutoModelForMaskedLM.from_pretrained("ccdv/lsg-base-4096-fr", trust_remote_code=True, use_auth_token=True)
-tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-base-4096-fr", use_auth_token=True)
-SENTENCES = ["Paris est la <mask> de la france.", "Le sens de la vie est <mask>."]
 pipeline = FillMaskPipeline(model, tokenizer)
-output = pipeline(SENTENCES, top_k=1)
-output = [o[0]["sequence"] for o in output]
-> ['Paris est la capitale de la france.', 'Le sens de la vie est simple.']
 ```
@@ -114,11 +109,11 @@ Classification example:
 ```python:
 from transformers import AutoModelForSequenceClassification, AutoTokenizer
-model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-base-4096-fr",
     trust_remote_code=True,
     pool_with_global=True, # pool with a global token instead of first token
 )
-tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-base-4096-fr", use_auth_token=True)
 SENTENCE = "This is a test for sequence classification. " * 300
 token_ids = tokenizer(
@@ -137,16 +132,29 @@ To train global tokens and the classification head only:
 ```python:
 from transformers import AutoModelForSequenceClassification, AutoTokenizer
-model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-base-4096-fr",
     trust_remote_code=True,
     pool_with_global=True, # pool with a global token instead of first token
     num_global_tokens=16
 )
-tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-base-4096-fr")
 for name, param in model.named_parameters():
     if "global_embeddings" not in name:
         param.requires_grad = False
     else:
         param.required_grad = True
 ```

 language: fr
 tags:
 - long context
+pipeline_tag: fill-mask
 ---
 # LSG model
 **Transformers >= 4.18.0**\
 **This model relies on a custom modeling file, you need to add trust_remote_code=True**\
 **See [\#13467](https://github.com/huggingface/transformers/pull/13467)**
 * [Tasks](#tasks)
 * [Training global tokens](#training-global-tokens)
+This model is adapted from [CamemBERT-base](https://huggingface.co/camembert-base) without additional pretraining yet. It uses the same number of parameters/layers and the same tokenizer.
+This model can handle long sequences but faster and more efficiently than Longformer or BigBird (from Transformers) and relies on Local + Sparse + Global attention (LSG).
+The model requires sequences whose length is a multiple of the block size. The model is "adaptive" and automatically pads the sequences if needed (adaptive=True in config). It is however recommended, thanks to the tokenizer, to truncate the inputs (truncation=True) and optionally to pad with a multiple of the block size (pad_to_multiple_of=...). \
+Support encoder-decoder but I didnt test it extensively.\
 Implemented in PyTorch.
 ![attn](attn.png)
 ```python:
 from transformers import AutoModel, AutoTokenizer
+model = AutoModel.from_pretrained("ccdv/lsg-camembert-base-4096", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")
 ```
 ## Parameters
 ```python:
 from transformers import AutoModel
+model = AutoModel.from_pretrained("ccdv/lsg-camembert-base-4096",
     trust_remote_code=True,
     num_global_tokens=16,
     block_size=64,
 )
 ```
 ## Sparse selection type
 There are 5 different sparse selection patterns. The best type is task dependent. \
     * Each head will use block of tokens strided by sparsify_factor
     * Not recommended if sparsify_factor > num_heads
 ## Tasks
 Fill mask example:
 ```python:
 from transformers import FillMaskPipeline, AutoModelForMaskedLM, AutoTokenizer
+model = AutoModelForMaskedLM.from_pretrained("ccdv/lsg-camembert-base-4096", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")
+SENTENCES = "Paris est la <mask> de la France."
 pipeline = FillMaskPipeline(model, tokenizer)
+output = pipeline(SENTENCES)
+> 'Paris est la capitale de la France.'
 ```
 ```python:
 from transformers import AutoModelForSequenceClassification, AutoTokenizer
+model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-camembert-base-4096",
     trust_remote_code=True,
     pool_with_global=True, # pool with a global token instead of first token
 )
+tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")
 SENTENCE = "This is a test for sequence classification. " * 300
 token_ids = tokenizer(
 ```python:
 from transformers import AutoModelForSequenceClassification, AutoTokenizer
+model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-camembert-base-4096",
     trust_remote_code=True,
     pool_with_global=True, # pool with a global token instead of first token
     num_global_tokens=16
 )
+tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")
 for name, param in model.named_parameters():
     if "global_embeddings" not in name:
         param.requires_grad = False
     else:
         param.required_grad = True
+```
+**CamemBERT**
+```
+@inproceedings{Martin_2020,
+	doi = {10.18653/v1/2020.acl-main.645},
+	url = {https://doi.org/10.18653%2Fv1%2F2020.acl-main.645},
+	year = 2020,
+	publisher = {Association for Computational Linguistics},
+	author = {Louis Martin and Benjamin Muller and Pedro Javier Ortiz Su{\'{a}}rez and Yoann Dupont and Laurent Romary and {\'{E}}ric de la Clergeri and Djam{\'{e}} Seddah and Beno{\^{\i}}t Sagot},
+	title = {{CamemBERT}: a Tasty French Language Model},
+	booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}
+}
 ```

config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "ccdv/lsg-base-4096-fr",
   "adaptive": true,
   "architectures": [
     "LSGCamembertForMaskedLM"

 {
+  "_name_or_path": "ccdv/lsg-camembert-base-4096",
   "adaptive": true,
   "architectures": [
     "LSGCamembertForMaskedLM"

modeling_lsg_camembert.py CHANGED Viewed

@@ -1032,33 +1032,26 @@ class LSGCamembertModel(LSGCamembertPreTrainedModel, RobertaModel):
             return_dict=return_dict
             )
-        context = encoder_outputs[0]
         if self.pool_with_global:
-            context[:, self.num_global_tokens] = context[:, 0]
         diff = t - t_
-        n, _, d = context.size()
-        context = context[..., self.num_global_tokens:, :]
         # Adapt sequence to initial shape
         if diff < 0:
-            context = context[:, :t]
-        encoder_outputs.last_hidden_state = context
-        sequence_output = encoder_outputs[0]
         pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
         if not return_dict:
             return (sequence_output, pooled_output) + encoder_outputs[1:]
-        return BaseModelOutputWithPoolingAndCrossAttentions(
-            last_hidden_state=sequence_output,
-            pooler_output=pooled_output,
-            past_key_values=encoder_outputs.past_key_values,
-            hidden_states=encoder_outputs.hidden_states,
-            attentions=encoder_outputs.attentions,
-            cross_attentions=encoder_outputs.cross_attentions,
-        )
     def get_extended_attention_mask(self, attention_mask, input_shape, device=None):

             return_dict=return_dict
             )
+        sequence_output = encoder_outputs[0]
         if self.pool_with_global:
+            sequence_output[:, self.num_global_tokens] = sequence_output[:, 0]
         diff = t - t_
+        n, _, d = sequence_output.size()
+        sequence_output = sequence_output[..., self.num_global_tokens:, :]
         # Adapt sequence to initial shape
         if diff < 0:
+            sequence_output = sequence_output[:, :t]
         pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
         if not return_dict:
             return (sequence_output, pooled_output) + encoder_outputs[1:]
+        encoder_outputs.last_hidden_state = sequence_output
+        encoder_outputs.pooler_output = pooled_output
+        return encoder_outputs
     def get_extended_attention_mask(self, attention_mask, input_shape, device=None):