File size: 12,528 Bytes
8ab1478
84c22d7
 
8ab1478
84c22d7
 
 
 
 
dafd89b
 
 
84c22d7
dafd89b
 
 
 
 
 
 
8ab1478
dafd89b
 
 
 
 
 
 
9bfc5e4
 
 
 
 
dafd89b
 
 
 
 
 
 
 
 
 
 
 
ba57149
dafd89b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ba57149
 
dafd89b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6c8daa1
dafd89b
 
 
 
a46bb62
 
dafd89b
6cda4e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9bfc5e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dafd89b
 
 
 
 
4032c60
 
 
 
 
 
d7441ad
dafd89b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- classification
- information-extraction
- zero-shot
datasets:
- multi_nli
- xnli
- fancyzhx/dbpedia_14
- SetFit/bbc-news
- squad_v2
- race
metrics:
- accuracy
- f1
pipeline_tag: zero-shot-classification
---

**comprehend_it-base**

This is a model based on [DeBERTaV3-base](https://huggingface.co/microsoft/deberta-v3-base) that was trained on natural language inference datasets as well as on multiple text classification datasets. 

It demonstrates better quality on the diverse set of text classification datasets in a zero-shot setting than [Bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) while being almost 3 times smaller.

Moreover, the model can be used for multiple information extraction tasks in zero-shot setting.

Possible use cases of the model:
* Text classification
* Reranking of search results;
* Named-entity recognition;
* Relation extraction;
* Entity linking;
* Question-answering;

#### With the zero-shot classification pipeline

The model can be loaded with the `zero-shot-classification` pipeline like so:

```python
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="knowledgator/comprehend_it-base")
```

You can then use this pipeline to classify sequences into any of the class names you specify.

```python
sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels)
#{'labels': ['travel', 'dancing', 'cooking'],
# 'scores': [0.9938651323318481, 0.0032737774308770895, 0.002861034357920289],
# 'sequence': 'one day I will see the world'}
```

If more than one candidate label can be correct, pass `multi_label=True` to calculate each class independently:

```python
candidate_labels = ['travel', 'cooking', 'dancing', 'exploration']
classifier(sequence_to_classify, candidate_labels, multi_label=True)
#{'labels': ['travel', 'exploration', 'dancing', 'cooking'],
# 'scores': [0.9945111274719238,
#  0.9383890628814697,
#  0.0057061901316046715,
#  0.0018193122232332826],
# 'sequence': 'one day I will see the world'}
```


#### With manual PyTorch

```python
# pose sequence as a NLI premise and label as a hypothesis
from transformers import AutoModelForSequenceClassification, AutoTokenizer
nli_model = AutoModelForSequenceClassification.from_pretrained('knowledgator/comprehend_it-base')
tokenizer = AutoTokenizer.from_pretrained('knowledgator/comprehend_it-base')

premise = sequence
hypothesis = f'This example is {label}.'

# run through model pre-trained on MNLI
x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
                     truncation_strategy='only_first')
logits = nli_model(x.to(device))[0]

# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true 
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
prob_label_is_true = probs[:,1]
```

### Benchmarking
Below, you can see the F1 score on several text classification datasets. All tested models were not fine-tuned on those datasets and were tested in a zero-shot setting.
| Model                       | IMDB | AG_NEWS | Emotions |
|-----------------------------|------|---------|----------|
| [Bart-large-mnli (407 M)](https://huggingface.co/facebook/bart-large-mnli)      | 0.89 | 0.6887  | 0.3765   |
| [Deberta-base-v3 (184 M)](https://huggingface.co/cross-encoder/nli-deberta-v3-base)      | 0.85 | 0.6455  | 0.5095   |
| [Comprehendo (184M)](https://huggingface.co/knowledgator/comprehend_it-base)           | 0.90 | 0.7982  | 0.5660   |
| SetFit [BAAI/bge-small-en-v1.5 (33.4M)](https://huggingface.co/BAAI/bge-small-en-v1.5) | 0.86 | 0.5636 | 0.5754 |

### Few-shot learning
You can effectively fine-tune the model using 💧[LiqFit](https://github.com/Knowledgator/LiqFit). LiqFit is an easy-to-use framework for few-shot learning of cross-encoder models. 

Download and install `LiqFit` by running:

```bash
pip install liqfit
```

For the most up-to-date version, you can build from source code by executing:

```bash
pip install git+https://github.com/knowledgator/LiqFit.git
```

You need to process a dataset, initialize a model, choose a loss function and set training arguments. Read more in a quick start section of the [documentation](https://docs.knowledgator.com/docs/frameworks/liqfit/quick-start).

```python
from liqfit.modeling import LiqFitModel
from liqfit.losses import FocalLoss
from liqfit.collators import NLICollator
from transformers import TrainingArguments, Trainer

backbone_model = AutoModelForSequenceClassification.from_pretrained('knowledgator/comprehend_it-base')

loss_func = FocalLoss(multi_target=True)

model = LiqFitModel(backbone_model.config, backbone_model, loss_func=loss_func)

data_collator = NLICollator(tokenizer, max_length=128, padding=True, truncation=True)


training_args = TrainingArguments(
    output_dir='comprehendo',
    learning_rate=3e-5,
    per_device_train_batch_size=3,
    per_device_eval_batch_size=3,
    num_train_epochs=9,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_steps = 5000,
    save_total_limit=3,
    remove_unused_columns=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=nli_train_dataset,
    eval_dataset=nli_test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()
```
### Benchmarks:
| Model & examples per label | Emotion | AgNews | SST5 |
|-|-|-|-|
| Comprehend-it/0 | 56.60 | 79.82 | 37.9 |  
| Comprehend-it/8 | 63.38 | 85.9 | 46.67 |
| Comprehend-it/64 | 80.7 | 88 | 47 |
| SetFit/0 | 57.54 | 56.36 | 24.11 |
| SetFit/8 | 56.81 | 64.93 | 33.61 |  
| SetFit/64 | 79.03 | 88 | 45.38 |

### Alternative usage
Besides text classification, the model can be used for many other information extraction tasks.

**Question-answering**

The model can be used to solve open question-answering as well as reading comprehension tasks if it's possible to transform a task into a multi-choice Q&A.
```python
#open question-answering
question = "What is the capital city of Ukraine?"
candidate_answers = ['Kyiv', 'London', 'Berlin', 'Warsaw']
classifier(question, candidate_answers)

# labels': ['Kyiv', 'Warsaw', 'London', 'Berlin'],
#  'scores': [0.8633171916007996,
#   0.11328165978193283,
#   0.012766502797603607,
#   0.010634596459567547]
```

```python
#reading comprehension
question = 'In what country is Normandy located?'
text = 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.'
input_ = f"{question}\n{text}"

candidate_answers = ['Denmark', 'Iceland', 'France', "Norway"]

classifier(input_, candidate_answers)

#  'labels': ['France', 'Iceland', 'Norway', 'Denmark'],
#  'scores': [0.9102861285209656,
#   0.03861876204609871,
#   0.028696594759821892,
#   0.02239849977195263]
```

```python
#binary question-answering
question = "Does drug development regulation become more aligned with modern technologies and trends, choose yes or no?"
text = "Drug development has become unbearably slow and expensive. A key underlying problem is the clinical prediction challenge: the inability to predict which drug candidates will be safe in the human body and for whom. Recently, a dramatic regulatory change has removed FDA's mandated reliance on antiquated, ineffective animal studies. A new frontier is an integration of several disruptive technologies [machine learning (ML), patient-on-chip, real-time sensing, and stem cells], which, when integrated, have the potential to address this challenge, drastically cutting the time and cost of developing drugs, and tailoring them to individual patients."
input_ = f"{question}\n{text}"

candidate_answers = ['yes', 'no']

classifier(input_, candidate_answers)

# 'labels': ['yes', 'no'],
#  'scores': [0.5876278281211853, 0.4123721718788147]}
```

**Named-entity classification and disambiguation**

The model can be used to classify named entities or disambiguate similar ones. It can be put as one of the components in a entity-linking systems as a reranker.
```python
text = """Knowledgator is an open-source ML research organization focused on advancing the information extraction field."""

candidate_labels = ['Knowledgator - company',
 'Knowledgator - product', 
 'Knowledgator - city']

classifier(text, candidate_labels)

# 'labels': ['Knowledgator - company',
#   'Knowledgator - product',
#   'Knowledgator - city'],
#  'scores': [0.887371301651001, 0.097423255443573, 0.015205471776425838]
```

**Relation classification**

With the same principle, the model can be utilized to classify relations from a text.
```python
text = """The FKBP5 gene codifies a co-chaperone protein associated with the modulation of glucocorticoid receptor interaction involved in the adaptive stress response. The FKBP5 intracellular concentration affects the binding affinity of the glucocorticoid receptor (GR) to glucocorticoids (GCs). This gene has glucocorticoid response elements (GRES) located in introns 2, 5 and 7, which affect its expression. Recent studies have examined GRE activity and the effects of genetic variants on transcript efficiency and their contribution to susceptibility to behavioral disorders. Epigenetic changes and environmental factors can influence the effects of these allele-specific variants, impacting the response to GCs of the FKBP5 gene. The main epigenetic mark investigated in FKBP5 intronic regions is DNA methylation, however, few studies have been performed for all GRES located in these regions. One of the major findings was the association of low DNA methylation levels in the intron 7 of FKBP5 in patients with psychiatric disorders. To date, there are no reports of DNA methylation in introns 2 and 5 of the gene associated with diagnoses of psychiatric disorders. This review highlights what has been discovered so far about the relationship between polymorphisms and epigenetic targets in intragenic regions, and reveals the gaps that need to be explored, mainly concerning the role of DNA methylation in these regions and how it acts in psychiatric disease susceptibility."""

candidate_labels = ['FKBP5-associated with -> PTSD',
 'FKBP5 - has no effect on -> PTSD',
 'FKBP5 - is similar to -> PTSD',
 'FKBP5 - inhibitor of-> PTSD',
 'FKBP5 - ancestor of -> PTSD']

classifier(text, candidate_labels)

#  'labels': ['FKBP5-associated with -> PTSD',
#   'FKBP5 - is similar to -> PTSD',
#   'FKBP5 - has no effect on -> PTSD',
#   'FKBP5 - ancestor of -> PTSD',
#   'FKBP5 - inhibitor of-> PTSD'],
#  'scores': [0.5880666971206665,
#   0.17369700968265533,
#   0.14067059755325317,
#   0.05044548586010933,
#   0.04712018370628357]
```

### Future reading
Check our blogpost - ["The new milestone in zero-shot capabilities (it’s not Generative AI)."](https://medium.com/p/9b5a081fbf27), where we highlighted possible use-cases of the model and why next-token prediction is not the only way to achive amazing zero-shot capabilites.
While most of the AI industry is focused on generative AI and decoder-based models, we are committed to developing encoder-based models.
We aim to achieve the same level of generalization for such models as their decoder brothers. Encoders have several wonderful properties, such as bidirectional attention, and they are the best choice for many information extraction tasks in terms of efficiency and controllability.

### Feedback
We value your input! Share your feedback and suggestions to help us improve our models.
Fill out the feedback [form](https://forms.gle/5CPFFuLzNWznjcpL7)

### Join Our Discord
Connect with our community on Discord for news, support, and discussion about our models.
Join [Discord](https://discord.gg/dkyeAgs9DG)