Update README.md

5c9d446 over 1 year ago

6.1 kB

	---
	language: en
	tags:
	- text-classification
	- onnx
	- bge-small-en-v1.5
	- emotions
	- multi-class-classification
	- multi-label-classification
	datasets:
	- go_emotions
	models:
	- BAAI/bge-small-en-v1.5
	license: mit
	inference: false
	widget:
	- text: ONNX is so much faster, its very handy!
	---

	### Overview

	This is a multi-label, multi-class linear classifer for emotions that works with [BGE-small-en-v1.5 embeddings](https://huggingface.co/BAAI/bge-small-en-v1.5), having been trained on the [go_emotions](https://huggingface.co/datasets/go_emotions) dataset.

	### Labels

	The 28 labels from the [go_emotions](https://huggingface.co/datasets/go_emotions) dataset are:
	```
	['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']
	```

	### Metrics (exact match of labels per item)

	This is a multi-label, multi-class dataset, so each label is effectively a separate binary classification. Evaluating across all labels per item in the go_emotions test split the metrics are shown below.

	Optimising the threshold per label to optimise the F1 metric, the metrics (evaluated on the go_emotions test split) are:

	- Precision: 0.445
	- Recall: 0.476
	- F1: 0.449

	Weighted by the relative support of each label in the dataset, this is:

	- Precision: 0.472
	- Recall: 0.582
	- F1: 0.514

	Using a fixed threshold of 0.5 to convert the scores to binary predictions for each label, the metrics (evaluated on the go_emotions test split, and unweighted by support) are:

	- Precision: 0.602
	- Recall: 0.250
	- F1: 0.303

	### Metrics (per-label)

	This is a multi-label, multi-class dataset, so each label is effectively a separate binary classification and metrics are better measured per label.

	Optimising the threshold per label to optimise the F1 metric, the metrics (evaluated on the go_emotions test split) are:
	\| \| f1 \| precision \| recall \| support \| threshold \|
	\| -------------- \| ----- \| --------- \| ------ \| ------- \| --------- \|
	\| admiration \| 0.583 \| 0.574 \| 0.593 \| 504 \| 0.30 \|
	\| amusement \| 0.668 \| 0.722 \| 0.621 \| 264 \| 0.25 \|
	\| anger \| 0.350 \| 0.309 \| 0.404 \| 198 \| 0.15 \|
	\| annoyance \| 0.299 \| 0.318 \| 0.281 \| 320 \| 0.20 \|
	\| approval \| 0.338 \| 0.281 \| 0.425 \| 351 \| 0.15 \|
	\| caring \| 0.321 \| 0.323 \| 0.319 \| 135 \| 0.20 \|
	\| confusion \| 0.384 \| 0.313 \| 0.497 \| 153 \| 0.15 \|
	\| curiosity \| 0.467 \| 0.432 \| 0.507 \| 284 \| 0.20 \|
	\| desire \| 0.426 \| 0.381 \| 0.482 \| 83 \| 0.20 \|
	\| disappointment \| 0.210 \| 0.147 \| 0.364 \| 151 \| 0.10 \|
	\| disapproval \| 0.366 \| 0.288 \| 0.502 \| 267 \| 0.15 \|
	\| disgust \| 0.416 \| 0.409 \| 0.423 \| 123 \| 0.20 \|
	\| embarrassment \| 0.370 \| 0.341 \| 0.405 \| 37 \| 0.30 \|
	\| excitement \| 0.313 \| 0.368 \| 0.272 \| 103 \| 0.25 \|
	\| fear \| 0.615 \| 0.677 \| 0.564 \| 78 \| 0.40 \|
	\| gratitude \| 0.828 \| 0.810 \| 0.847 \| 352 \| 0.25 \|
	\| grief \| 0.545 \| 0.600 \| 0.500 \| 6 \| 0.85 \|
	\| joy \| 0.455 \| 0.429 \| 0.484 \| 161 \| 0.20 \|
	\| love \| 0.642 \| 0.673 \| 0.613 \| 238 \| 0.30 \|
	\| nervousness \| 0.350 \| 0.412 \| 0.304 \| 23 \| 0.60 \|
	\| optimism \| 0.439 \| 0.417 \| 0.462 \| 186 \| 0.20 \|
	\| pride \| 0.480 \| 0.667 \| 0.375 \| 16 \| 0.70 \|
	\| realization \| 0.232 \| 0.191 \| 0.297 \| 145 \| 0.10 \|
	\| relief \| 0.353 \| 0.500 \| 0.273 \| 11 \| 0.50 \|
	\| remorse \| 0.643 \| 0.529 \| 0.821 \| 56 \| 0.20 \|
	\| sadness \| 0.526 \| 0.497 \| 0.558 \| 156 \| 0.20 \|
	\| surprise \| 0.329 \| 0.318 \| 0.340 \| 141 \| 0.15 \|
	\| neutral \| 0.634 \| 0.528 \| 0.794 \| 1787 \| 0.30 \|

	The thesholds are stored in `thresholds.json`.

	### Use with ONNXRuntime

	The input to the model is called `logits`, and there is one output per label. Each output produces a 2d array, with 1 row per input row, and each row having 2 columns - the first being a proba output for the negative case, and the second being a proba output for the positive case.

	```python
	# Assuming you have embeddings from BAAI/bge-small-en-v1.5 for the input sentences
	# E.g. produced from sentence-transformers E.g. huggingface.co/BAAI/bge-small-en-v1.5
	# or from an ONNX version E.g. huggingface.co/Xenova/bge-small-en-v1.5

	print(embeddings.shape) # E.g. a batch of 1 sentence
	> (1, 384)

	import onnxruntime as ort

	sess = ort.InferenceSession("path_to_model_dot_onnx", providers=['CPUExecutionProvider'])

	outputs = [o.name for o in sess.get_outputs()] # list of labels, in the order of the outputs
	preds_onnx = sess.run(_outputs, {'logits': embeddings})
	# preds_onnx is a list with 28 entries, one per label,
	# each with a numpy array of shape (1, 2) given the input was a batch of 1

	print(outputs[0])
	> surprise
	print(preds_onnx[0])
	> array([[0.97136074, 0.02863926]], dtype=float32)

	# load thresholds.json and use that (per label) to convert the positive case score to a binary prediction
	```

	### Commentary on the dataset

	Some labels (E.g. gratitude) when considered independently perform very strongly, whilst others (E.g. relief) perform very poorly.

	This is a challenging dataset. Labels such as relief do have much fewer examples in the training data (less than 100 out of the 40k+, and only 11 in the test split).

	But there is also some ambiguity and/or labelling errors visible in the training data of go_emotions that is suspected to constrain the performance. Data cleaning on the dataset to reduce some of the mistakes, ambiguity, conflicts and duplication in the labelling would produce a higher performing model.