dvilasuero HF staff commited on
Commit
9f2d78b
Β·
1 Parent(s): f057b32

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -5
README.md CHANGED
@@ -10,10 +10,13 @@ datasets:
10
  ---
11
 
12
  # πŸ˜΅β€πŸ’«πŸ¦™ Alpaca HalluciHunter
13
- <img src="front-image.png" alt="Alpaca Cleaned" width="200" height="150" >
14
 
 
15
 
16
- This is a cross-lingual SetFit model [SetFit model](https://github.com/huggingface/setfit) to detect potentially bad instructions from Alpaca (and likely other synthetically generated instruction datasets).
 
 
 
17
 
18
  The model has been fine-tuned with 1,000 labeled examples from the AlpacaCleaned dataset. It leverages a multilingual sentence transformer `paraphrase-multilingual-mpnet-base-v2`, inspired by the findings from the SetFit paper (Section 6. Multilingual experiments.), where they trained models in English that performed well across languages.
19
 
@@ -23,8 +26,6 @@ It's a binary classifier with two labels:
23
  - `BAD INSTRUCTION`, there's an issue with the instruction, and/or input and output.
24
 
25
 
26
- This model can greatly speed up the validation of Alpaca Datasets, flagging examples that need to be fixed or simply discarded.
27
-
28
  ## Usage
29
 
30
  To use this model for inference, first install the SetFit library:
@@ -79,7 +80,7 @@ def get_predictions(texts):
79
  ds = ds.map(lambda batch: {"prediction": list(get_predictions(batch["text"]))}, batched=True)
80
  ```
81
 
82
- Load the data into Argilla for exploration and validation. You [need to launch Argilla](https://www.argilla.io/blog/launching-argilla-huggingface-hub):
83
  ```python
84
  # Replace api_url with the url to your HF Spaces URL if using Spaces
85
  # Replace api_key if you configured a custom API key
@@ -92,8 +93,37 @@ rg_dataset = rg.DatasetForTextClassification().from_datasets(ds)
92
  rg.log(records=rg_dataset, name="alpaca_to_clean")
93
  ```
94
 
 
 
 
 
 
 
95
  ## Examples
96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
 
99
  ## BibTeX entry and citation info
 
10
  ---
11
 
12
  # πŸ˜΅β€πŸ’«πŸ¦™ Alpaca HalluciHunter
 
13
 
14
+ This is a cross-lingual SetFit model [SetFit model](https://github.com/huggingface/setfit) to detect potentially bad instructions from Alpaca. This model can greatly speed up the validation of Alpaca Datasets, flagging examples that need to be fixed or simply discarded.
15
 
16
+
17
+ <div style="text-align:center;width:50%">
18
+ <img src="https://huggingface.co/argilla/alpaca-hallucihunter-multilingual/resolve/main/front-image.png" alt="Alpaca Cleaned"">
19
+ </div>
20
 
21
  The model has been fine-tuned with 1,000 labeled examples from the AlpacaCleaned dataset. It leverages a multilingual sentence transformer `paraphrase-multilingual-mpnet-base-v2`, inspired by the findings from the SetFit paper (Section 6. Multilingual experiments.), where they trained models in English that performed well across languages.
22
 
 
26
  - `BAD INSTRUCTION`, there's an issue with the instruction, and/or input and output.
27
 
28
 
 
 
29
  ## Usage
30
 
31
  To use this model for inference, first install the SetFit library:
 
80
  ds = ds.map(lambda batch: {"prediction": list(get_predictions(batch["text"]))}, batched=True)
81
  ```
82
 
83
+ Load the data into Argilla for exploration and validation. First, you [need to launch Argilla](https://www.argilla.io/blog/launching-argilla-huggingface-hub). Then run:
84
  ```python
85
  # Replace api_url with the url to your HF Spaces URL if using Spaces
86
  # Replace api_key if you configured a custom API key
 
93
  rg.log(records=rg_dataset, name="alpaca_to_clean")
94
  ```
95
 
96
+ ## Live demo
97
+
98
+ You can explore the dataset using this Space (credentials: `argilla` / `1234`):
99
+
100
+ (https://huggingface.co/spaces/argilla/alpaca-hallucihunter)[https://huggingface.co/spaces/argilla/alpaca-hallucihunter]
101
+
102
  ## Examples
103
 
104
+ This model has been tested with English, German, and Spanish. This approach will be used by ongoing efforts for improving the quality of Alpaca-based datasets, and updates will be reflected here.
105
+
106
+ Here are some examples of highest scored examples of `BAD INSTRUCTION`.
107
+
108
+
109
+ ### English
110
+
111
+ <div style="text-align:center;width:50%">
112
+ <img src="https://huggingface.co/argilla/alpaca-hallucihunter-multilingual/resolve/main/front-image.png" alt="Alpaca Cleaned"">
113
+ </div>
114
+
115
+ ### German
116
+
117
+ <div style="text-align:center;width:50%">
118
+ <img src="https://huggingface.co/argilla/alpaca-hallucihunter-multilingual/resolve/main/german-alpaca.png" alt="Alpaca Cleaned"">
119
+ </div>
120
+
121
+ ### Spanish
122
+ <div style="text-align:center;width:50%">
123
+ <img src="https://huggingface.co/argilla/alpaca-hallucihunter-multilingual/resolve/main/spanish-alpaca.png" alt="Alpaca Cleaned"">
124
+ </div>
125
+
126
+
127
 
128
 
129
  ## BibTeX entry and citation info