|
--- |
|
language: en |
|
tags: |
|
- text-classification |
|
- onnx |
|
- bge-small-en-v1.5 |
|
- emotions |
|
- multi-class-classification |
|
- multi-label-classification |
|
datasets: |
|
- go_emotions |
|
models: |
|
- BAAI/bge-small-en-v1.5 |
|
license: mit |
|
inference: false |
|
widget: |
|
- text: ONNX is so much faster, its very handy! |
|
--- |
|
|
|
### Overview |
|
|
|
This is a multi-label, multi-class linear classifer for emotions that works with [BGE-small-en-v1.5 embeddings](https://huggingface.co/BAAI/bge-small-en-v1.5), having been trained on the [go_emotions](https://huggingface.co/datasets/go_emotions) dataset. |
|
|
|
### Labels |
|
|
|
The 28 labels from the [go_emotions](https://huggingface.co/datasets/go_emotions) dataset are: |
|
``` |
|
['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral'] |
|
``` |
|
|
|
### Metrics (exact match of labels per item) |
|
|
|
This is a multi-label, multi-class dataset, so each label is effectively a separate binary classification. Evaluating across all labels per item in the go_emotions test split the metrics are shown below. |
|
|
|
Optimising the threshold per label to optimise the F1 metric, the metrics (evaluated on the go_emotions test split) are: |
|
|
|
- Precision: 0.445 |
|
- Recall: 0.476 |
|
- F1: 0.449 |
|
|
|
Weighted by the relative support of each label in the dataset, this is: |
|
|
|
- Precision: 0.472 |
|
- Recall: 0.582 |
|
- F1: 0.514 |
|
|
|
Using a fixed threshold of 0.5 to convert the scores to binary predictions for each label, the metrics (evaluated on the go_emotions test split, and unweighted by support) are: |
|
|
|
- Precision: 0.602 |
|
- Recall: 0.250 |
|
- F1: 0.303 |
|
|
|
### Metrics (per-label) |
|
|
|
This is a multi-label, multi-class dataset, so each label is effectively a separate binary classification and metrics are better measured per label. |
|
|
|
Optimising the threshold per label to optimise the F1 metric, the metrics (evaluated on the go_emotions test split) are: |
|
| | f1 | precision | recall | support | threshold | |
|
| -------------- | ----- | --------- | ------ | ------- | --------- | |
|
| admiration | 0.583 | 0.574 | 0.593 | 504 | 0.30 | |
|
| amusement | 0.668 | 0.722 | 0.621 | 264 | 0.25 | |
|
| anger | 0.350 | 0.309 | 0.404 | 198 | 0.15 | |
|
| annoyance | 0.299 | 0.318 | 0.281 | 320 | 0.20 | |
|
| approval | 0.338 | 0.281 | 0.425 | 351 | 0.15 | |
|
| caring | 0.321 | 0.323 | 0.319 | 135 | 0.20 | |
|
| confusion | 0.384 | 0.313 | 0.497 | 153 | 0.15 | |
|
| curiosity | 0.467 | 0.432 | 0.507 | 284 | 0.20 | |
|
| desire | 0.426 | 0.381 | 0.482 | 83 | 0.20 | |
|
| disappointment | 0.210 | 0.147 | 0.364 | 151 | 0.10 | |
|
| disapproval | 0.366 | 0.288 | 0.502 | 267 | 0.15 | |
|
| disgust | 0.416 | 0.409 | 0.423 | 123 | 0.20 | |
|
| embarrassment | 0.370 | 0.341 | 0.405 | 37 | 0.30 | |
|
| excitement | 0.313 | 0.368 | 0.272 | 103 | 0.25 | |
|
| fear | 0.615 | 0.677 | 0.564 | 78 | 0.40 | |
|
| gratitude | 0.828 | 0.810 | 0.847 | 352 | 0.25 | |
|
| grief | 0.545 | 0.600 | 0.500 | 6 | 0.85 | |
|
| joy | 0.455 | 0.429 | 0.484 | 161 | 0.20 | |
|
| love | 0.642 | 0.673 | 0.613 | 238 | 0.30 | |
|
| nervousness | 0.350 | 0.412 | 0.304 | 23 | 0.60 | |
|
| optimism | 0.439 | 0.417 | 0.462 | 186 | 0.20 | |
|
| pride | 0.480 | 0.667 | 0.375 | 16 | 0.70 | |
|
| realization | 0.232 | 0.191 | 0.297 | 145 | 0.10 | |
|
| relief | 0.353 | 0.500 | 0.273 | 11 | 0.50 | |
|
| remorse | 0.643 | 0.529 | 0.821 | 56 | 0.20 | |
|
| sadness | 0.526 | 0.497 | 0.558 | 156 | 0.20 | |
|
| surprise | 0.329 | 0.318 | 0.340 | 141 | 0.15 | |
|
| neutral | 0.634 | 0.528 | 0.794 | 1787 | 0.30 | |
|
|
|
The thesholds are stored in `thresholds.json`. |
|
|
|
### Use with ONNXRuntime |
|
|
|
The input to the model is called `logits`, and there is one output per label. Each output produces a 2d array, with 1 row per input row, and each row having 2 columns - the first being a proba output for the negative case, and the second being a proba output for the positive case. |
|
|
|
```python |
|
# Assuming you have embeddings from BAAI/bge-small-en-v1.5 for the input sentences |
|
# E.g. produced from sentence-transformers E.g. huggingface.co/BAAI/bge-small-en-v1.5 |
|
# or from an ONNX version E.g. huggingface.co/Xenova/bge-small-en-v1.5 |
|
|
|
print(embeddings.shape) # E.g. a batch of 1 sentence |
|
> (1, 384) |
|
|
|
import onnxruntime as ort |
|
|
|
sess = ort.InferenceSession("path_to_model_dot_onnx", providers=['CPUExecutionProvider']) |
|
|
|
outputs = [o.name for o in sess.get_outputs()] # list of labels, in the order of the outputs |
|
preds_onnx = sess.run(_outputs, {'logits': embeddings}) |
|
# preds_onnx is a list with 28 entries, one per label, |
|
# each with a numpy array of shape (1, 2) given the input was a batch of 1 |
|
|
|
print(outputs[0]) |
|
> surprise |
|
print(preds_onnx[0]) |
|
> array([[0.97136074, 0.02863926]], dtype=float32) |
|
|
|
# load thresholds.json and use that (per label) to convert the positive case score to a binary prediction |
|
``` |
|
|
|
### Commentary on the dataset |
|
|
|
Some labels (E.g. gratitude) when considered independently perform very strongly, whilst others (E.g. relief) perform very poorly. |
|
|
|
This is a challenging dataset. Labels such as relief do have much fewer examples in the training data (less than 100 out of the 40k+, and only 11 in the test split). |
|
|
|
But there is also some ambiguity and/or labelling errors visible in the training data of go_emotions that is suspected to constrain the performance. Data cleaning on the dataset to reduce some of the mistakes, ambiguity, conflicts and duplication in the labelling would produce a higher performing model. |