Training in progress, step 250

Browse files

Files changed (5) hide show

README.md +10 -64
config.json +7 -8
model.safetensors +3 -0
sparsification_sftt.py +12 -5
training_args.bin +2 -2

README.md CHANGED Viewed

@@ -1,21 +1,19 @@
 ---
-license: apache-2.0
-base_model: mistralai/Mistral-7B-v0.1
 tags:
 - generated_from_trainer
 model-index:
-- name: Mistral_Sparse_refined_web_50p_graceful_True
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
-# Mistral_Sparse_refined_web_50p_graceful_True
-This model is a fine-tuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) on the None dataset.
 It achieves the following results on the evaluation set:
-- Loss: 2.3402
 ## Model description
@@ -35,72 +33,20 @@ More information needed
 The following hyperparameters were used during training:
 - learning_rate: 1e-05
-- train_batch_size: 1
-- eval_batch_size: 1
 - seed: 0
 - distributed_type: multi-GPU
-- num_devices: 4
 - gradient_accumulation_steps: 4
-- total_train_batch_size: 16
-- total_eval_batch_size: 4
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 - lr_scheduler_type: linear
-- training_steps: 2500
 ### Training results
-| Training Loss | Epoch | Step | Validation Loss |
-|:-------------:|:-----:|:----:|:---------------:|
-| 3.7252        | 0.01  | 50   | 2.3893          |
-| 2.2531        | 0.02  | 100  | 2.4723          |
-| 2.32          | 0.02  | 150  | 2.4385          |
-| 2.2363        | 0.03  | 200  | 2.4210          |
-| 2.3078        | 0.04  | 250  | 2.4118          |
-| 2.2389        | 0.05  | 300  | 2.4025          |
-| 2.0902        | 0.06  | 350  | 2.3984          |
-| 2.2878        | 0.06  | 400  | 2.3965          |
-| 2.2485        | 0.07  | 450  | 2.3924          |
-| 2.2375        | 0.08  | 500  | 2.3895          |
-| 2.1901        | 0.09  | 550  | 2.3909          |
-| 2.1128        | 0.1   | 600  | 2.3886          |
-| 2.2983        | 0.1   | 650  | 2.3892          |
-| 2.2547        | 0.11  | 700  | 2.3873          |
-| 2.1322        | 0.12  | 750  | 2.3861          |
-| 2.2715        | 0.13  | 800  | 2.3827          |
-| 2.263         | 0.14  | 850  | 2.3845          |
-| 2.2066        | 0.14  | 900  | 2.3836          |
-| 2.2781        | 0.15  | 950  | 2.3837          |
-| 2.2597        | 0.16  | 1000 | 2.3778          |
-| 2.2642        | 0.17  | 1050 | 2.3764          |
-| 2.2296        | 0.18  | 1100 | 2.3805          |
-| 2.2289        | 0.18  | 1150 | 2.3784          |
-| 2.1372        | 0.19  | 1200 | 2.3773          |
-| 2.2059        | 0.2   | 1250 | 2.3732          |
-| 2.2847        | 0.21  | 1300 | 2.3719          |
-| 2.1404        | 0.22  | 1350 | 2.3739          |
-| 2.2261        | 0.22  | 1400 | 2.3752          |
-| 2.1713        | 0.23  | 1450 | 2.3750          |
-| 2.1787        | 0.24  | 1500 | 2.3732          |
-| 2.1866        | 0.25  | 1550 | 2.3759          |
-| 2.2471        | 0.26  | 1600 | 2.3760          |
-| 2.307         | 0.26  | 1650 | 2.3745          |
-| 2.2457        | 0.27  | 1700 | 2.3746          |
-| 2.2265        | 0.28  | 1750 | 2.3775          |
-| 2.163         | 0.29  | 1800 | 2.3797          |
-| 2.2411        | 0.3   | 1850 | 2.3760          |
-| 2.247         | 0.3   | 1900 | 2.3770          |
-| 2.2449        | 0.31  | 1950 | 2.3749          |
-| 2.1884        | 0.32  | 2000 | 2.3728          |
-| 2.1909        | 0.33  | 2050 | 2.3770          |
-| 2.2813        | 0.34  | 2100 | 2.3773          |
-| 2.2306        | 0.34  | 2150 | 2.3755          |
-| 2.2158        | 0.35  | 2200 | 2.3777          |
-| 2.1557        | 0.36  | 2250 | 2.3783          |
-| 2.2715        | 0.37  | 2300 | 2.3704          |
-| 2.2053        | 0.38  | 2350 | 2.3729          |
-| 2.2541        | 0.38  | 2400 | 2.3715          |
-| 2.0971        | 0.39  | 2450 | 2.3747          |
-| 2.2791        | 0.4   | 2500 | 2.3727          |
 ### Framework versions

 ---
 tags:
 - generated_from_trainer
 model-index:
+- name: Mistral_Sparse_refined_web_50p_graceful_False
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
+# Mistral_Sparse_refined_web_50p_graceful_False
+This model is a fine-tuned version of [](https://huggingface.co/) on the None dataset.
 It achieves the following results on the evaluation set:
+- Loss: 10.3740
 ## Model description
 The following hyperparameters were used during training:
 - learning_rate: 1e-05
+- train_batch_size: 8
+- eval_batch_size: 16
 - seed: 0
 - distributed_type: multi-GPU
+- num_devices: 2
 - gradient_accumulation_steps: 4
+- total_train_batch_size: 64
+- total_eval_batch_size: 32
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 - lr_scheduler_type: linear
+- training_steps: 200
 ### Training results
 ### Framework versions

config.json CHANGED Viewed

@@ -1,5 +1,4 @@
 {
-  "_name_or_path": "mistralai/Mistral-7B-v0.1",
   "architectures": [
     "SparseMistralforCausalLM"
   ],
@@ -11,23 +10,23 @@
   "bos_token_id": 1,
   "eos_token_id": 2,
   "hidden_act": "silu",
-  "hidden_size": 4096,
   "initializer_range": 0.02,
-  "intermediate_size": 14336,
-  "max_position_embeddings": 32768,
   "model_type": "sparse_mistral",
   "num_attention_heads": 32,
-  "num_hidden_layers": 32,
   "num_key_value_heads": 8,
-  "rms_norm_eps": 1e-05,
   "rope_theta": 10000.0,
   "sliding_window": 4096,
   "thresholds": null,
   "tie_word_embeddings": false,
-  "torch_dtype": "bfloat16",
   "transformers_version": "4.37.2",
   "us_sparse_regularization": true,
-  "use_cache": false,
   "use_sparse_model": true,
   "use_sparse_predictor": false,
   "use_sparse_regularization": false,

 {
   "architectures": [
     "SparseMistralforCausalLM"
   ],
   "bos_token_id": 1,
   "eos_token_id": 2,
   "hidden_act": "silu",
+  "hidden_size": 64,
   "initializer_range": 0.02,
+  "intermediate_size": 64,
+  "max_position_embeddings": 131072,
   "model_type": "sparse_mistral",
   "num_attention_heads": 32,
+  "num_hidden_layers": 2,
   "num_key_value_heads": 8,
+  "rms_norm_eps": 1e-06,
   "rope_theta": 10000.0,
   "sliding_window": 4096,
   "thresholds": null,
   "tie_word_embeddings": false,
+  "torch_dtype": "float32",
   "transformers_version": "4.37.2",
   "us_sparse_regularization": true,
+  "use_cache": true,
   "use_sparse_model": true,
   "use_sparse_predictor": false,
   "use_sparse_regularization": false,

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:58951f5db2bd738f8fb0a4e130aeca9b378176ee3d6b24bdfa9979b546866a9c
+size 16567728

sparsification_sftt.py CHANGED Viewed

@@ -4,7 +4,6 @@ from peft import PeftModel
 from datasets import Dataset
 from typing import Any, Dict, Union, Optional, Tuple
 from torch.nn import MSELoss
 import warnings
 import torch
 import torch.nn as nn
@@ -14,6 +13,9 @@ import time
 import os
 import copy
 from transformers.models.mistral.modeling_mistral import (
     MistralMLP,
     MistralModel,
@@ -516,6 +518,7 @@ class GracefulRegularizationScheduler(TrainerCallback):
         test_dataset: Dataset = None,
         targeted_sparsity: float = 0.5,
         keep_regularization_with_kill: bool = False,
     ):
         """Scheduler for regularizing the model first before applying the dead threshold.
@@ -533,6 +536,7 @@ class GracefulRegularizationScheduler(TrainerCallback):
         if self.is_enabled:
             print("GracefulRegularizationScheduler is enabled.")
         self.trainer = None
     def set_trainer(self, trainer):
         self.trainer = trainer
@@ -563,14 +567,17 @@ class GracefulRegularizationScheduler(TrainerCallback):
             # set_layer_specific_regularization(model.get_base_model())
             print_dead_neuron_stats(model.get_base_model())
-        if state.global_step % 2000 == 0:
             if is_mainprocess():
                 ds_print(
-                    f"Saving to /scr/lukeai/{self.model_name}_{state.global_step}.pt",
                 )
                 torch.save(
                     model.state_dict(),
-                    f"/scr/lukeai/{self.model_name}_{state.global_step}.pt",
                 )
@@ -727,7 +734,7 @@ def print_dead_neuron_stats(model):
         if isinstance(layer.mlp, MistralSparseSiluMLP):
             dead_percentage = layer.mlp.dead_percentage * 100
             agg_sparsity = layer.mlp.agg_sparsity * 100
-            ds_print(f"layer {i} threshold: {layer.mlp.dead_threshold:.3f}%")
             ds_print(f"layer {i} sparsity: {dead_percentage:.3f}%")
             ds_print(f"layer {i} agg sparsity: {agg_sparsity:.3f}%")
             total_sparsity += dead_percentage

 from datasets import Dataset
 from typing import Any, Dict, Union, Optional, Tuple
 from torch.nn import MSELoss
 import warnings
 import torch
 import torch.nn as nn
 import os
 import copy
+# from deepspeed.utils import save_state_dict
 from transformers.models.mistral.modeling_mistral import (
     MistralMLP,
     MistralModel,
         test_dataset: Dataset = None,
         targeted_sparsity: float = 0.5,
         keep_regularization_with_kill: bool = False,
+        start_steps: int = 0,
     ):
         """Scheduler for regularizing the model first before applying the dead threshold.
         if self.is_enabled:
             print("GracefulRegularizationScheduler is enabled.")
         self.trainer = None
+        self.start_steps = start_steps
     def set_trainer(self, trainer):
         self.trainer = trainer
             # set_layer_specific_regularization(model.get_base_model())
             print_dead_neuron_stats(model.get_base_model())
+        if state.global_step % 10 == 0:
             if is_mainprocess():
+                current_steps = self.start_steps + state.global_step
                 ds_print(
+                    f"Saving to /scr/lukeai/{self.model_name}_{current_steps}.pt",
                 )
+                # save_state_dict(model, f"/scr/lukeai/{self.model_name}_{state.global_step}.pt")
+                print("Saving a model...")
                 torch.save(
                     model.state_dict(),
+                    f"/scr/lukeai/{self.model_name}_{current_steps}.pt",
                 )
         if isinstance(layer.mlp, MistralSparseSiluMLP):
             dead_percentage = layer.mlp.dead_percentage * 100
             agg_sparsity = layer.mlp.agg_sparsity * 100
+            ds_print(f"layer {i} threshold: {layer.mlp.dead_threshold:.3f}")
             ds_print(f"layer {i} sparsity: {dead_percentage:.3f}%")
             ds_print(f"layer {i} agg sparsity: {agg_sparsity:.3f}%")
             total_sparsity += dead_percentage

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:cbe7af3ff5bd4495c979a876ffd09e2ba96df0dad9a03fb0cf367d0f5cf1da79
-size 6520

 version https://git-lfs.github.com/spec/v1
+oid sha256:4ab002fb28887f771d81428c91f67235593bf02bce49639941825db88db1c965
+size 4728