commit files to HF hub

Browse files

Files changed (12) hide show

README.md +199 -0
config.json +29 -0
config.yaml +60 -0
configuration_energy.py +58 -0
mlm.py +410 -0
model.safetensors +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer/special_tokens_map.json +37 -0
tokenizer/tokenizer.json +0 -0
tokenizer/tokenizer_config.json +53 -0
tokenizer_config.json +54 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "_name_or_path": "NRJ-350",
+  "activation": "softmax",
+  "alpha": 0.1,
+  "architectures": [
+    "BertEnergyModelForMaskedLM"
+  ],
+  "auto_map": {
+    "AutoModel": "mlm.BertEnergyModelForMaskedLM"
+  },
+  "beta": 0.125,
+  "bias": true,
+  "block_size": 512,
+  "compile": false,
+  "dropout": 0.1,
+  "embedding_dim": 768,
+  "forward_memories": 3072,
+  "layer_norm": 1e-12,
+  "model_type": "bert_energy",
+  "num_heads": 12,
+  "num_layers": 12,
+  "pad_idx": null,
+  "positional": true,
+  "share_layers": false,
+  "tie_weights": false,
+  "torch_dtype": "float32",
+  "transformers_version": "4.47.0",
+  "vocabulary_size": 30000
+}

config.yaml ADDED Viewed

	@@ -0,0 +1,60 @@

+activation: softmax
+adam_beta1: 0.9
+adam_beta2: 0.99
+adam_epsilon: 1.0e-06
+alpha: 0.1
+attn_implementation: null
+beta: 0.125
+bf16: true
+block_size: 512
+checkpoint_dir: mlruns/896390784617014591/892b97fa0aa6499288906c463545ae00/checkpoints
+compile: false
+config_path: configs/JZ/NRJ_base-wiki-original.yaml
+dataloader_num_workers: 8
+dataset_path: /lustre/fswork/projects/rech/oou/uqh26ve/data/pre_training/en/en_wiki/wiki_20220301-cleaned-valid001/data-bin/wiki_20220301-cleaned-valid001-BPE30K/
+ddp_find_unused_parameters: false
+disable_tqdm: true
+do_eval: true
+dropout: 0.1
+embedding_dim: 768
+eval_steps: 25000
+evaluation_strategy: steps
+forward_memories: 3072
+fp16: false
+gradient_accumulation_steps: 1
+ignore_lines: false
+layer_norm: 1.0e-12
+learning_rate: 0.0007
+log_on_each_node: false
+logging_steps: 1000
+logging_strategy: steps
+lr_scheduler_kwargs: {}
+lr_scheduler_type: cosine
+max_steps: 500000
+model_name: NRJ-V_30000K_bpe-NL12-NH12-EMB768-FFN3072
+model_type: energyBERT
+n_run: 51
+num_heads: 12
+num_layers: 12
+num_params: 50638896
+optimizer: adamw_torch
+output_dir: null
+per_device_eval_batch_size: 8
+per_device_train_batch_size: 64
+remove_unused_columns: false
+report_to: mlflow
+save_steps: 25000
+save_strategy: steps
+seed: 42
+share_layers: false
+test_file: /lustre/fswork/projects/rech/oou/uqh26ve/data/pre_training/en/en_wiki/wiki_20220301-cleaned-valid001/wikipedia.test.txt
+tie_weights: false
+tokenizer_path: /lustre/fswork/projects/rech/oou/uqh26ve/data/pre_training/en/en_wiki/wiki_20220301-cleaned-valid001/data-bin/wiki_20220301-cleaned-valid001-BPE30K/tokenizer
+tokenizer_type: bpe
+total_batch_size: 4096
+training_file: /lustre/fswork/projects/rech/oou/uqh26ve/data/pre_training/en/en_wiki/wiki_20220301-cleaned-valid001/wikipedia.train.txt
+valid_file: /lustre/fswork/projects/rech/oou/uqh26ve/data/pre_training/en/en_wiki/wiki_20220301-cleaned-valid001/wikipedia.valid.txt
+vocabulary_size: 30000
+warmup_ratio: 0.0
+warmup_steps: 24000
+weight_decay: 0.01

configuration_energy.py ADDED Viewed

	@@ -0,0 +1,58 @@

+from math import sqrt,log
+import sys
+#sys.path.append("../energy") # Messy
+import torch
+import torch.nn as nn
+from torch.nn.functional import softmax,relu,linear
+from common import PositionalEncoding
+from hopfield import HopfieldLayer, HopfieldMHA, HopfieldReLU, HopfieldSoftmax
+from torch.cuda.amp import autocast
+import yaml
+from transformers import PreTrainedModel, PretrainedConfig
+from transformers.modeling_outputs import MaskedLMOutput, BaseModelOutput
+class BertEnergyConfig(PretrainedConfig):
+    model_type = "bert_energy"
+    def __init__(self, config=None, path=None, vocabulary_size=50, num_layers=12, num_heads=12, forward_memories=2048, embedding_dim=768, activation="relu",positional=True,  bias=True, tie_weights=True, alpha=1.0,
+                 beta=1., layer_norm=1e-05, dropout=0.0, block_size=512, share_layers=False, compile=False, pad_idx=None, **kwargs):
+        self.vocabulary_size = vocabulary_size
+        self.num_layers = num_layers
+        self.num_heads = num_heads
+        self.activation = activation
+        self.positional = positional
+        self.tie_weights = tie_weights
+        self.bias = bias
+        self.forward_memories = forward_memories
+        self.embedding_dim = embedding_dim
+        self.share_layers = share_layers
+        self.alpha = alpha
+        self.beta = beta
+        self.layer_norm = float(layer_norm)
+        self.dropout = dropout
+        self.block_size = block_size
+        self.compile = compile
+        self.pad_idx = pad_idx
+        if config is not None:
+            for key,value in config.to_dict():
+                if key.lower() in self.__dict__.keys():
+                    print(key, file=sys.stderr)
+                    setattr(self,key.lower(),value)
+        elif path is not None:
+            if path.endswith(".yaml"):
+                with open(path) as istream:
+                    config = yaml.safe_load(istream)
+                    for key,value in config.items():
+                        print(key)
+                        if key.lower() in self.__dict__.keys():
+                            setattr(self,key.lower(),value)
+            else:
+                raise NotImplementedError
+        super().__init__(**kwargs)

mlm.py ADDED Viewed

	@@ -0,0 +1,410 @@

+from math import sqrt,log
+import sys
+import torch
+import torch.nn as nn
+from torch.nn.functional import softmax,relu,linear, gelu
+from common import PositionalEncoding
+from hopfield import HopfieldLayer, HopfieldMHA, HopfieldReLU, HopfieldSoftmax
+from configuration_energy import BertEnergyConfig
+from torch.cuda.amp import autocast
+import yaml
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+from transformers import PreTrainedModel, PretrainedConfig
+from transformers.modeling_outputs import MaskedLMOutput, BaseModelOutput
+ACT2FN={'relu': relu, 'gelu': gelu, 'softmax': softmax}
+class BertModel(PreTrainedModel):
+    """ Backbone of standard BERT model
+        outputs : last hidden state, history"""
+    config_class = BertEnergyConfig
+    def __init__(self, config, add_pooling_layer=True, pad_idx=None, **kwargs):
+        super().__init__(config)
+        self.Emb_in         = nn.Embedding(config.vocabulary_size, config.embedding_dim, padding_idx=pad_idx)
+        self.posn           = PositionalEncoding(config.embedding_dim, max_len=config.block_size,dropout=config.dropout) if config.positional else None
+        if config.share_layers: # ALBERT config
+            self.embedding_hidden_in = nn.Linear(config.embedding_dim, config.forward_memories) if config.share_layers else None # Albert uses two matrices instead of one for embeddings see 3.1 in Albert paper
+            # Albert normalise and penalise embeddings
+            self.embed_norm = nn.LayerNorm(config.embedding_dim, eps=config.layer_norm)
+            self.embed_dropout = nn.Dropout(config.dropout)
+        self.num_layers = config.num_layers
+        self.share_layers = config.share_layers
+        if config.share_layers:
+            layer = nn.TransformerEncoderLayer(config.forward_memories,
+                            config.num_heads,
+                            activation=config.activation,
+                            dim_feedforward=config.forward_memories*4,
+                            dropout=config.dropout,
+                            layer_norm_eps=config.layer_norm,
+                            batch_first=True,
+                            norm_first=True,
+                            )
+            self.layers = nn.ModuleList([layer])
+        else:
+            self.layers  = nn.ModuleList([nn.TransformerEncoderLayer(config.embedding_dim,
+                            config.num_heads,
+                            dim_feedforward=config.forward_memories*4,
+                            dropout=config.dropout,
+                            layer_norm_eps=config.layer_norm,
+                            batch_first=True,
+                            norm_first=True,
+                            ) for _ in range(config.num_layers)])
+    def forward(self,input_ids, attention_mask=None, **kwargs):
+        """ Warning : expect attention mask with 0 pad tokens -> mismatch Pytorch/HF tokenizer"""
+        xbatch = self.Emb_in(input_ids)
+        if self.posn:
+            X      = xbatch + self.posn(xbatch)
+        else:
+            X      = xbatch
+        if self.share_layers:
+            X = self.embed_norm(X)
+            X = self.embed_dropout(X)
+            X = self.embedding_hidden_in(X)
+        history = None if self.training else [X]
+        # WARNING
+        attention_mask = ~attention_mask.bool() # Mismatch between HF tokenizer and Torch attention mask https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html#torch.nn.Transformer
+        for i in range(self.num_layers):
+            if self.share_layers:
+                layer = self.layers[0]
+            else:
+                layer = self.layers[i]
+            X   = layer(X, src_key_padding_mask=attention_mask)
+            if not self.training:
+                history.append(X)
+        # TODO add return attention
+        return BaseModelOutput(last_hidden_state=X,
+                                hidden_states=history,
+                                attentions=None)
+class BertModelForMaskedLM(PreTrainedModel):
+    """ Bert model to be trained on the MLM task.
+        Based on the backbone Bert model + projection on the vocabulary with tied weight and norm
+        outputs: cross entropy loss / logits / hidden states
+    """
+    config_class = BertEnergyConfig
+    ignore_index = -100
+    _tied_weights_keys = ["Emb_out.weight", "Emb_out.bias"]
+    def __init__(self, config, add_pooling_layer=True, pad_idx=None):
+        super().__init__(config)
+        self.config = config
+        self.model = BertModel(config, pad_idx=pad_idx)
+        self.norm = nn.LayerNorm(config.embedding_dim, eps=config.layer_norm)
+        self.dense = nn.Linear(config.forward_memories, config.embedding_dim)
+        self.activation = ACT2FN[config.activation]
+        """
+        if config.tie_weights:
+            self.Emb_out = nn.Linear(config.embedding_dim, config.vocabulary_size, bias=False)
+            self.tie_weights()
+        else:
+            self.Emb_out = nn.Linear(config.embedding_dim, config.vocabulary_size)
+            self.bias = nn.Parameter(torch.zeros(config.vocabulary_size))
+            self.Emb_out.bias = self.bias
+        """
+        self.Emb_out = nn.Linear(config.forward_memories, config.vocabulary_size)
+        self.bias = nn.Parameter(torch.zeros(config.vocabulary_size))
+        self.Emb_out.bias = self.bias
+    def get_input_embeddings(self):
+        return self.model.Emb_in
+    def set_output_embeddings(self, new_embeddings):
+        self.Emb_out = new_embeddings
+    def forward(self,input_ids, attention_mask=None,  labels=None,  **kwargs):
+        outputs = self.model(input_ids, attention_mask, **kwargs)
+        last_hidden_state = outputs.last_hidden_state
+        hidden_states = outputs.hidden_states
+        attentions = outputs.attentions
+        last_hidden_state = self.dense(last_hidden_state)
+        last_hidden_state = self.activation(last_hidden_state)
+        last_hidden_state = self.norm(last_hidden_state)
+        """
+        if self.config.tie_weights:
+            logits = last_hidden_state @ self.Emb_out.weight.transpose(-1,-2)
+        else:
+            logits = self.Emb_out(last_hidden_state)
+        """
+        logits = self.Emb_out(last_hidden_state)
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.config.vocabulary_size), labels.view(-1))
+        return MaskedLMOutput(loss=loss,
+                            logits=logits,
+                            hidden_states=hidden_states,
+                            attentions=attentions)
+class BertModelForSequenceClassification(PreTrainedModel):
+    """ Bert model to be trained on Sequence classification tasks.
+        Based on the backbone Bert model + projection on the vocabulary with tied weight and norm
+        outputs: cross entropy loss / logits / hidden states
+    """
+    config_class = BertEnergyConfig
+    ignore_index = -100
+    def __init__(self, config, add_pooling_layer=True, pad_idx=None,
+                 num_labels=2, classifier_dropout=None, return_dict=True):
+        super().__init__(config)
+        self.config = config
+        self.num_labels = num_labels
+        self.classifier_dropout = classifier_dropout
+        self.return_dict = return_dict
+        self.model = BertModel(config, pad_idx=pad_idx)
+        self.dense = nn.Linear(config.forward_memories, config.forward_memories)
+        classifier_dropout = (
+            classifier_dropout if classifier_dropout is not None else config.dropout
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.forward_memories,num_labels)
+        self.norm    = nn.LayerNorm(config.embedding_dim)
+        #self.Emb_out = nn.Linear(config.embedding_dim, config.vocabulary_size, bias=False)
+        #self.Emb_out.weight = self.model.Emb_in.weight  # weight tying
+    def forward(self,input_ids, labels=None, return_dict=False,  **kwargs):
+        outputs = self.model(input_ids, **kwargs)
+        last_hidden_state = self.norm(outputs.last_hidden_state)
+        # Code from roberta : https://github.com/huggingface/transformers/blob/v4.39.3/src/transformers/models/roberta/modeling_roberta.py#L1426
+        x = last_hidden_state[:, 0, :]  # take <s> token (equiv. to [CLS])
+        x = self.dropout(x)
+        x = self.dense(x)
+        x = torch.tanh(x)
+        x = self.dropout(x)
+        logits = self.classifier(x)
+        hidden_states = outputs.hidden_states
+        attentions = outputs.attentions
+        loss = None
+        if labels is not None:
+            # move labels to correct device to enable model parallelism
+            labels = labels.to(logits.device)
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+    def compute_loss(self, logits, labels):
+        # code from https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_pt_utils.py#L494
+        log_probs = -nn.functional.log_softmax(logits, dim=-1)
+        if labels.dim() == log_probs.dim() - 1:
+            labels = labels.unsqueeze(-1)
+        padding_mask = labels.eq(self.ignore_index)
+        # In case the ignore_index is -100, the gather will fail, so we replace labels by 0. The padding_mask
+        # will ignore them in any case.
+        labels = torch.clamp(labels, min=0)
+        nll_loss = log_probs.gather(dim=-1, index=labels)
+        nll_loss.masked_fill_(padding_mask, 0.0)
+        num_active_elements = padding_mask.numel() - padding_mask.long().sum()
+        nll_loss = nll_loss.sum() / num_active_elements
+        return nll_loss
+class BertEnergyModel(PreTrainedModel):
+    config_class = BertEnergyConfig
+    def __init__(self, config, add_pooling_layer=True, pad_idx=None, **kwargs):
+        super().__init__(config)
+        self.Emb_in = nn.Embedding(config.vocabulary_size, config.embedding_dim, padding_idx=pad_idx)
+        self.posn    = PositionalEncoding(config.embedding_dim,max_len=config.block_size,dropout=config.dropout) if config.positional else None
+        self.num_layers = config.num_layers
+        self.layer   = HopfieldLayer(config.embedding_dim,config.num_heads,forward_memories=config.forward_memories,forward_activation=config.activation,bias=config.bias,beta=config.beta,dropout=config.dropout)
+        self.alpha   = config.alpha
+    def forward(self,input_ids, attention_mask=None, **kwargs):
+        xbatch = self.Emb_in(input_ids)
+        if self.posn:
+            X      = xbatch + self.posn(xbatch)
+        else:
+            X      = xbatch
+        history = None if self.training else [X]
+        for _ in range(self.num_layers):
+            #TODO add src_key pad attention mask
+            X = X - self.alpha * self.layer(X, src_key_padding_mask=attention_mask, is_causal=False)
+            if not self.training:
+                history.append(X)
+        return BaseModelOutput(last_hidden_state=X,
+                                hidden_states=history,
+                                attentions=None)
+class BertEnergyModelForMaskedLM(PreTrainedModel):
+    config_class = BertEnergyConfig
+    ignore_index = -100
+    _tied_weights_keys = ["Emb_out.weight", "Emb_out.bias"]
+    def __init__(self, config, add_pooling_layer=True, pad_idx=None):
+        super().__init__(config)
+        self.config = config
+        self.model = BertEnergyModel(config, pad_idx=pad_idx)
+        self.norm = nn.LayerNorm(config.embedding_dim, eps=config.layer_norm)
+        self.dense = nn.Linear(config.embedding_dim, config.embedding_dim)
+        self.activation = ACT2FN[config.activation]
+        self.Emb_out = nn.Linear(config.embedding_dim, config.vocabulary_size)
+        self.bias = nn.Parameter(torch.zeros(config.vocabulary_size))
+        self.Emb_out.bias = self.bias
+    def get_input_embeddings(self):
+        return self.model.Emb_in
+    def set_output_embeddings(self, new_embeddings):
+        self.Emb_out = new_embeddings
+    def forward(self,input_ids, attention_mask=None, labels=None, **kwargs  ):
+        outputs = self.model(input_ids , attention_mask=attention_mask)
+        last_hidden_state = outputs.last_hidden_state
+        hidden_states = outputs.hidden_states
+        attentions = outputs.attentions
+        last_hidden_state = self.dense(last_hidden_state)
+        last_hidden_state = gelu(last_hidden_state) #XXX
+        last_hidden_state = self.norm(last_hidden_state)
+        #logits = self.norm(last_hidden_state) @ self.Emb_out.weight.transpose(-1,-2)
+        if self.config.tie_weights:
+            logits = last_hidden_state @ self.Emb_out.weight.transpose(-1,-2)
+        else:
+            logits = self.Emb_out(last_hidden_state)
+        loss = None
+        hidden_states = hidden_states
+        attentions = None
+        #if labels is not None:
+        #    loss = self.compute_loss(logits, labels)
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.config.vocabulary_size), labels.view(-1))
+        return MaskedLMOutput(loss=loss,
+                            logits=logits,
+                            hidden_states=hidden_states,
+                            attentions=attentions)
+if __name__ == '__main__':
+    def grads(f, x):
+        """ Autograd used for the energy """
+        return torch.func.jacrev(f)(x)
+    #from test import *
+    x = torch.randn(1,10)
+    input_ids = torch.tensor([[3,12,44, 2]])
+    #test relu
+    #print('relu')
+    #hrelu = HopfieldReLU(10,4,bias=False)
+    #print(hrelu(x),hrelu.energy(x))
+    #print(grads(hrelu.energy,x))
+    #test softmax
+    #print('softmax')
+    #hsoftmax = HopfieldSoftmax(10,4,bias=None)
+    #print(hsoftmax(x),hsoftmax.energy(x))
+    #print(grads(hsoftmax.energy,x))
+    #test MHA
+    #print('mha')
+    #mha = HopfieldMHA(15,3)
+    #X = torch.randn(2,4,15)
+    #causal = True
+    #print(mha(X,is_causal=causal),mha.energy(X,is_causal=causal))
+    #print()
+    #print('=== Ref=== ')
+    #for x in X: #autograd breaks with higher order tensors
+    #    print(grads(lambda y: mha.energy(y,is_causal=causal) ,x))
+    config = HopfieldConfig(path="../lmconfig.yaml")
+    print(config)
+    #exit()
+    mdl = HFHopfieldModel(config)
+    mdl.eval()
+    #print(mdl)
+    out = mdl(input_ids)
+    print(out[0].mean())
+    mdl.save_pretrained("test_checkpoint")
+    reloaded = HFHopfieldModel.from_pretrained("test_checkpoint")
+    out_reloaded = reloaded(input_ids)
+    print(out_reloaded[0].mean())
+    reloaded.to("cuda:0")
+    print(reloaded(input_ids.to("cuda:0"))[0])

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:84b9d586b1e5d0e9662d1ca8cacd558a323da7d6c30ac29efd6b2678ae51c923
+size 202676920

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,53 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "eos_token": "[SEP]",
+  "mask_token": "[MASK]",
+  "max_length": 512,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "[PAD]",
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "[UNK]"
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,54 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "eos_token": "[SEP]",
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_length": 512,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "[PAD]",
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "[UNK]"
+}