Add SetFit model

Browse files

Files changed (13) hide show

1_Pooling/config.json +1 -1
README.md +62 -62
config.json +12 -15
config_sentence_transformers.json +2 -2
config_setfit.json +2 -2
model.safetensors +2 -2
model_head.pkl +2 -2
modules.json +6 -0
sentence_bert_config.json +1 -1
special_tokens_map.json +6 -20
tokenizer.json +2 -2
tokenizer_config.json +23 -20
vocab.txt +0 -0

1_Pooling/config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "word_embedding_dimension": 768,
   "pooling_mode_cls_token": false,
   "pooling_mode_mean_tokens": true,
   "pooling_mode_max_tokens": false,

 {
+  "word_embedding_dimension": 384,
   "pooling_mode_cls_token": false,
   "pooling_mode_mean_tokens": true,
   "pooling_mode_max_tokens": false,

README.md CHANGED Viewed

@@ -1,8 +1,8 @@
 ---
-base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
 library_name: setfit
 metrics:
-- f1
 pipeline_tag: text-classification
 tags:
 - setfit
@@ -10,41 +10,27 @@ tags:
 - text-classification
 - generated_from_setfit_trainer
 widget:
-- text: Politically Motivated Murders Increased by 80% in October On November 7, news
-    outlets reported that murders due to political violence in Colombia increased
-    in October by 80%, according to the Resource Center for the Analysis of Conflicts.
-- text: Ils auraient menacé la femme d’un VDP et fouiller leur avant de repartir avec
-    une arme.
-- text: En rappel, cette décision de la réouverture des points de vente de céréales
-    au profit des personnes vulnerables pendant le premier trimestre de 2021, a été
-    prise par le Conseil des ministres du 24 février 2021.
-- text: IRC clinics have seen double the number of patients this month due to increasing
-    pressure on other facilities where there are PPE shortages or a reduction in health
-    staff who have had to self-isolate as a precaution.
-- text: Según los hallazgos de las instituciones que participaron en la misión recientemente,
-    se conoció que las comunidades que continúan en el resguardo están en riesgo de
-    desplazamiento hacia Montería, debido a la continuidad de combates, operaciones
-    militares y presencia activa del GDO.
 inference: true
-model-index:
-- name: SetFit with sentence-transformers/paraphrase-multilingual-mpnet-base-v2
-  results:
-  - task:
-      type: text-classification
-      name: Text Classification
-    dataset:
-      name: Unknown
-      type: unknown
-      split: test
-    metrics:
-    - type: f1
-      value: 0.7804878048780488
-      name: F1
 ---
-# SetFit with sentence-transformers/paraphrase-multilingual-mpnet-base-v2
-This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
 The model has been trained using an efficient few-shot learning technique that involves:
@@ -55,9 +41,9 @@ The model has been trained using an efficient few-shot learning technique that i
 ### Model Description
 - **Model Type:** SetFit
-- **Sentence Transformer body:** [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2)
 - **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
-- **Maximum Sequence Length:** 128 tokens
 - **Number of Classes:** 2 classes
 <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
 <!-- - **Language:** Unknown -->
@@ -70,17 +56,10 @@ The model has been trained using an efficient few-shot learning technique that i
 - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
 ### Model Labels
-| Label | Examples                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
-|:------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| 1     | <ul><li>'Meanwhile in Mentao camp, access had been cut off for more than a year, following a series of attacks.'</li><li>'La Fundación Desarrollo y Paz (Fundepaz) hizo pública este domingo, la amenaza que recibió el municipio de Cumbales, departamento de Nariño, por un supuesto panfleto atribuido al ELN en el que lanzan varias amenazas en contra de la población civil y ordenan restricciones a la movilidad.'</li><li>'Most focal points (82%) continue to report that living conditions have worsened for their communities since the beginning of the pandemic, more so in low-density areas (96%).All focal points in SDF and 94% in GoS areas say this, but in NSAG/TBAF areas, just over half report little change in people’s ability to meet their needs, and only 42% note a deterioration.'</li></ul> |
-| 0     | <ul><li>'The murders increased from 10 in August to 18 in October.'</li><li>'Cette tendance s’explique par le fait que ces trois (3) régions ont connu des attaques terroristes et en subissent les conséquences de l’insécurité.'</li><li>'In addition to Protection, Child Protection, and GBV focal points, the Protection Sector has also activated its Protection Emergency Response Units to disseminate messages and ensure that the most vulnerable are reached.'</li></ul>                                                                                                                                                                                                                                                                                                                                         |
-## Evaluation
-### Metrics
-| Label   | F1     |
-|:--------|:-------|
-| **all** | 0.7805 |
 ## Uses
@@ -98,9 +77,9 @@ Then you can load this model and run inference.
 from setfit import SetFitModel
 # Download from the 🤗 Hub
-model = SetFitModel.from_pretrained("Sfekih/sentence_independancy_model")
 # Run inference
-preds = model("Ils auraient menacé la femme d’un VDP et fouiller leur avant de repartir avec une arme.")
 ```
 <!--
@@ -132,12 +111,12 @@ preds = model("Ils auraient menacé la femme d’un VDP et fouiller leur avant d
 ### Training Set Metrics
 | Training set | Min | Median  | Max |
 |:-------------|:----|:--------|:----|
-| Word count   | 3   | 24.4407 | 78  |
 | Label | Training Sample Count |
 |:------|:----------------------|
-| 0     | 59                    |
-| 1     | 59                    |
 ### Training Hyperparameters
 - batch_size: (32, 32)
@@ -161,21 +140,42 @@ preds = model("Ils auraient menacé la femme d’un VDP et fouiller leur avant d
 ### Training Results
 | Epoch  | Step | Training Loss | Validation Loss |
 |:------:|:----:|:-------------:|:---------------:|
-| 0.0039 | 1    | 0.2854        | -               |
-| 0.1931 | 50   | 0.2645        | -               |
-| 0.3861 | 100  | 0.0945        | -               |
-| 0.5792 | 150  | 0.0022        | -               |
-| 0.7722 | 200  | 0.0008        | -               |
-| 0.9653 | 250  | 0.0006        | -               |
 ### Framework Versions
-- Python: 3.10.12
 - SetFit: 1.1.0
 - Sentence Transformers: 3.1.1
-- Transformers: 4.44.2
-- PyTorch: 2.4.1+cu121
-- Datasets: 3.0.1
-- Tokenizers: 0.19.1
 ## Citation

 ---
+base_model: sentence-transformers/all-MiniLM-L6-v2
 library_name: setfit
 metrics:
+- accuracy
 pipeline_tag: text-classification
 tags:
 - setfit
 - text-classification
 - generated_from_setfit_trainer
 widget:
+- text: Since the early morning of March 21st, 2021, different armed actions have
+    taken place in Venezuela in the state of Apure (Venezuela) that borders the department
+    of Arauca (Colombia).
+- text: 3 Clear, responsive and inclusive communication and open channels for raising
+    and addressing concerns is paramount at this time, consistent with accountability
+    to affected populations and Age-Gender and Diversity Principles, as is effective
+    and timely tracking and response to rumours.
+- text: Selon ces PDIs, des parents restés ou retournés au village les auraient informées
+    de l’amélioration de la situation sécuritaire.
+- text: Market supply has been impacted by significant deterioration of agricultural
+    service roads due to rainfall erosion, limiting any production delivery to consumption
+    centers.
+- text: Prevention of moderate acute malnutrition activities among children and PLW
+    of households vulnerable to food insecurity during the lean season are also underway
+    and WFP assisted a total of 17,471 children aged 6-23 months and 14,015 PLW.
 inference: true
 ---
+# SetFit with sentence-transformers/all-MiniLM-L6-v2
+This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
 The model has been trained using an efficient few-shot learning technique that involves:
 ### Model Description
 - **Model Type:** SetFit
+- **Sentence Transformer body:** [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
 - **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
+- **Maximum Sequence Length:** 256 tokens
 - **Number of Classes:** 2 classes
 <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
 <!-- - **Language:** Unknown -->
 - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
 ### Model Labels
+| Label | Examples                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+|:------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| 1     | <ul><li>'In January, as part of its advocacy for the protection of civilians and human rights, the United Nations Joint Human Rights Office in the Democratic Republic of the Congo issued two public reports highlighting the upward trend in human rights violations and abuses committed in Ituri and North Kivu by armed groups, as well as by members of the national security and defence forces.'</li><li>'A son indépendance, en 1960, la RDC avait un PIB par habitant de 325 USD et était la deuxième économie la plus industrialisée d’Afrique, après l’Afrique du Sud.'</li><li>"Les populations les plus gravement touchées sont celles qui ont été déplacées, les groupes de réfugiés et de populations rentrées chez elles, les familles d'accueil et les populations victimes de catastrophes naturelles (inondations, glissements de terrain, incendies) ainsi que les ménages dont le chef de famille est une femme."</li></ul> |
+| 0     | <ul><li>'This may be driven by children’s varying levels of education and their different language skills,'</li><li>'Ce sont des travaux très pénibles qui nuisent à leur santé physique.'</li><li>'Screening and treatment of MAM were enabled for 10,184 children aged 6-59 months and 2,613 PLW.'</li></ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
 ## Uses
 from setfit import SetFitModel
 # Download from the 🤗 Hub
+model = SetFitModel.from_pretrained("setfit_model_id")
 # Run inference
+preds = model("Selon ces PDIs, des parents restés ou retournés au village les auraient informées de l’amélioration de la situation sécuritaire.")
 ```
 <!--
 ### Training Set Metrics
 | Training set | Min | Median  | Max |
 |:-------------|:----|:--------|:----|
+| Word count   | 1   | 25.2763 | 95  |
 | Label | Training Sample Count |
 |:------|:----------------------|
+| 0     | 295                   |
+| 1     | 313                   |
 ### Training Hyperparameters
 - batch_size: (32, 32)
 ### Training Results
 | Epoch  | Step | Training Loss | Validation Loss |
 |:------:|:----:|:-------------:|:---------------:|
+| 0.0008 | 1    | 0.4533        | -               |
+| 0.0376 | 50   | 0.3371        | -               |
+| 0.0752 | 100  | 0.2585        | -               |
+| 0.1128 | 150  | 0.2574        | -               |
+| 0.1504 | 200  | 0.2535        | -               |
+| 0.1880 | 250  | 0.2513        | -               |
+| 0.2256 | 300  | 0.2573        | -               |
+| 0.2632 | 350  | 0.246         | -               |
+| 0.3008 | 400  | 0.2471        | -               |
+| 0.3383 | 450  | 0.247         | -               |
+| 0.3759 | 500  | 0.2348        | -               |
+| 0.4135 | 550  | 0.2165        | -               |
+| 0.4511 | 600  | 0.1911        | -               |
+| 0.4887 | 650  | 0.1402        | -               |
+| 0.5263 | 700  | 0.0865        | -               |
+| 0.5639 | 750  | 0.049         | -               |
+| 0.6015 | 800  | 0.0279        | -               |
+| 0.6391 | 850  | 0.0188        | -               |
+| 0.6767 | 900  | 0.0108        | -               |
+| 0.7143 | 950  | 0.0072        | -               |
+| 0.7519 | 1000 | 0.0051        | -               |
+| 0.7895 | 1050 | 0.0039        | -               |
+| 0.8271 | 1100 | 0.0032        | -               |
+| 0.8647 | 1150 | 0.0039        | -               |
+| 0.9023 | 1200 | 0.0025        | -               |
+| 0.9398 | 1250 | 0.0024        | -               |
+| 0.9774 | 1300 | 0.0023        | -               |
 ### Framework Versions
+- Python: 3.11.5
 - SetFit: 1.1.0
 - Sentence Transformers: 3.1.1
+- Transformers: 4.45.1
+- PyTorch: 2.1.0
+- Datasets: 2.17.1
+- Tokenizers: 0.20.0
 ## Citation

config.json CHANGED Viewed

@@ -1,29 +1,26 @@
 {
-  "_name_or_path": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
   "architectures": [
-    "XLMRobertaModel"
   ],
   "attention_probs_dropout_prob": 0.1,
-  "bos_token_id": 0,
   "classifier_dropout": null,
-  "eos_token_id": 2,
   "gradient_checkpointing": false,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.1,
-  "hidden_size": 768,
   "initializer_range": 0.02,
-  "intermediate_size": 3072,
-  "layer_norm_eps": 1e-05,
-  "max_position_embeddings": 514,
-  "model_type": "xlm-roberta",
   "num_attention_heads": 12,
-  "num_hidden_layers": 12,
-  "output_past": true,
-  "pad_token_id": 1,
   "position_embedding_type": "absolute",
   "torch_dtype": "float32",
-  "transformers_version": "4.44.2",
-  "type_vocab_size": 1,
   "use_cache": true,
-  "vocab_size": 250002
 }

 {
+  "_name_or_path": "sentence-transformers/all-MiniLM-L6-v2",
   "architectures": [
+    "BertModel"
   ],
   "attention_probs_dropout_prob": 0.1,
   "classifier_dropout": null,
   "gradient_checkpointing": false,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.1,
+  "hidden_size": 384,
   "initializer_range": 0.02,
+  "intermediate_size": 1536,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
   "num_attention_heads": 12,
+  "num_hidden_layers": 6,
+  "pad_token_id": 0,
   "position_embedding_type": "absolute",
   "torch_dtype": "float32",
+  "transformers_version": "4.45.1",
+  "type_vocab_size": 2,
   "use_cache": true,
+  "vocab_size": 30522
 }

config_sentence_transformers.json CHANGED Viewed

@@ -1,8 +1,8 @@
 {
   "__version__": {
     "sentence_transformers": "3.1.1",
-    "transformers": "4.44.2",
-    "pytorch": "2.4.1+cu121"
   },
   "prompts": {},
   "default_prompt_name": null,

 {
   "__version__": {
     "sentence_transformers": "3.1.1",
+    "transformers": "4.45.1",
+    "pytorch": "2.1.0"
   },
   "prompts": {},
   "default_prompt_name": null,

config_setfit.json CHANGED Viewed

@@ -1,4 +1,4 @@
 {
-  "labels": null,
-  "normalize_embeddings": false
 }

 {
+  "normalize_embeddings": false,
+  "labels": null
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:947addd35bd056ba5d341ece63ea1fbd86968e47c186d47750f829202716a775
-size 1112197096

 version https://git-lfs.github.com/spec/v1
+oid sha256:65f71a8359b6303e5d02940c400afd85298b5e4f99c9cb9b263cd1b80a911138
+size 90864192

model_head.pkl CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:64a1325c4ba131c26e47376887312c005bcb3e827e60eadef8ec4ff00e3828bb
-size 7007

 version https://git-lfs.github.com/spec/v1
+oid sha256:ef4b4c7c2956787b60c529eb67e22b32ad749e7ca5673d4abf66dd0d0204a62a
+size 3935

modules.json CHANGED Viewed

@@ -10,5 +10,11 @@
     "name": "1",
     "path": "1_Pooling",
     "type": "sentence_transformers.models.Pooling"
   }
 ]

     "name": "1",
     "path": "1_Pooling",
     "type": "sentence_transformers.models.Pooling"
+  },
+  {
+    "idx": 2,
+    "name": "2",
+    "path": "2_Normalize",
+    "type": "sentence_transformers.models.Normalize"
   }
 ]

sentence_bert_config.json CHANGED Viewed

@@ -1,4 +1,4 @@
 {
-  "max_seq_length": 128,
   "do_lower_case": false
 }

 {
+  "max_seq_length": 256,
   "do_lower_case": false
 }

special_tokens_map.json CHANGED Viewed

@@ -1,48 +1,34 @@
 {
-  "bos_token": {
-    "content": "<s>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
   "cls_token": {
-    "content": "<s>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "eos_token": {
-    "content": "</s>",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
   "mask_token": {
-    "content": "<mask>",
-    "lstrip": true,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
   "pad_token": {
-    "content": "<pad>",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
   "sep_token": {
-    "content": "</s>",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
   "unk_token": {
-    "content": "<unk>",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,

 {
   "cls_token": {
+    "content": "[CLS]",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
   "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
   "pad_token": {
+    "content": "[PAD]",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
   "sep_token": {
+    "content": "[SEP]",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
   "unk_token": {
+    "content": "[UNK]",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,

tokenizer.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:cad551d5600a84242d0973327029452a1e3672ba6313c2a3c3d69c4310e12719
-size 17082987

 version https://git-lfs.github.com/spec/v1
+oid sha256:851ca67100d372ca3ae031a6abd168f53489eebfd7d89523f35c5c9b4d372c3c
+size 711649

tokenizer_config.json CHANGED Viewed

@@ -1,61 +1,64 @@
 {
   "added_tokens_decoder": {
     "0": {
-      "content": "<s>",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
-    "1": {
-      "content": "<pad>",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
-    "2": {
-      "content": "</s>",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
-    "3": {
-      "content": "<unk>",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
-    "250001": {
-      "content": "<mask>",
-      "lstrip": true,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     }
   },
-  "bos_token": "<s>",
-  "clean_up_tokenization_spaces": true,
-  "cls_token": "<s>",
-  "eos_token": "</s>",
-  "mask_token": "<mask>",
   "max_length": 128,
-  "model_max_length": 128,
   "pad_to_multiple_of": null,
-  "pad_token": "<pad>",
   "pad_token_type_id": 0,
   "padding_side": "right",
-  "sep_token": "</s>",
   "stride": 0,
-  "tokenizer_class": "XLMRobertaTokenizer",
   "truncation_side": "right",
   "truncation_strategy": "longest_first",
-  "unk_token": "<unk>"
 }

 {
   "added_tokens_decoder": {
     "0": {
+      "content": "[PAD]",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
+    "100": {
+      "content": "[UNK]",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
+    "101": {
+      "content": "[CLS]",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
+    "102": {
+      "content": "[SEP]",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     }
   },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
   "max_length": 128,
+  "model_max_length": 256,
+  "never_split": null,
   "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
   "pad_token_type_id": 0,
   "padding_side": "right",
+  "sep_token": "[SEP]",
   "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
   "truncation_side": "right",
   "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
 }

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff