initial commit

Files changed (8) hide show

README.md +30 -0
added_tokens.json +1 -0
config.json +46 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tokenizer.json +0 -0
tokenizer_config.json +1 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,33 @@
 ---
 license: mit
 ---

 ---
+tags:
+- bert
 license: mit
+language:
+  - ru
 ---
+# multilabel-context-russian-inapropriate-messages
+[BERT classifier from Skoltech](https://huggingface.co/Skoltech/russian-inappropriate-messages), finetuned on contextual data with 4 labels.
+# Training
+*Skoltech/russian-inappropriate-messages* was finetuned on a multiclass data with four classes
+1) OK label -- the message is OK in context and does not intent to offend or somehow harm the reputation of a speaker.
+2) Toxic label -- the message might be seen as a offensive one in given context.
+3) Severe toxic label -- the message is offencive, full of anger and was written to provoke a fight or any other discomfort
+4) Risks label -- the message touches on sensitive topics and can harm the reputation of the speaker (i.e. religion, politics)
+The model was finetuned on DATASET_LINK.
+# Evaluation results
+Model achieves the following results:
+|                         | OK - Precision | OK - Recall | OK - F1-score | TOXIC - Precision | TOXIC - Recall | TOXIC - F1-score | SEVERE TOXIC - Precision | SEVERE TOXIC - Recall | SEVERE TOXIC - F1-score | RISKS - Precision | RISKS - Recall | RISKS - F1-score |
+|-------------------------|----------------|-------------|---------------|-------------------|----------------|------------------|--------------------------|-----------------------|-------------------------|-------------------|----------------|------------------|
+| DATASET_TWITTER val.csv | 0.883          | 0.913       | 0.896         | 0.368             | 0.330          | 0.348            | 0.515                    | 0.468                 | 0.490                   | 0.659             | 0.535          | 0.591            |
+| DATASET_GENA val.csv    | 0.953          | 0.927       | 0.940         | 0.260             | 0.343          | 0.295            | 0.666                    | 0.806                 | 0.729                   | 0.523             | 0.423          | 0.46             |
+The work was done during internship at Tinkoff.

added_tokens.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"[RESPONSE_TOKEN]": 100792}

config.json ADDED Viewed

	@@ -0,0 +1,46 @@

+{
+  "_name_or_path": "./context-russian-inappropriate-messages/checkpoint-584/",
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "directionality": "bidi",
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "ok",
+    "1": "risks",
+    "2": "severe_toxic",
+    "3": "toxic"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "ok": 0,
+    "risks": 1,
+    "severe_toxic": 2,
+    "toxic": 3
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "output_past": true,
+  "pad_token_id": 0,
+  "pooler_fc_size": 768,
+  "pooler_num_attention_heads": 12,
+  "pooler_num_fc_layers": 3,
+  "pooler_size_per_head": 128,
+  "pooler_type": "first_token_transform",
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "torch_dtype": "float32",
+  "transformers_version": "4.18.0",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 100793
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3f83b623726d48538a077857238102133c33838562cb1b7706c07551ab291392
+size 653868081

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "additional_special_tokens": ["[RESPONSE_TOKEN]"]}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "do_basic_tokenize": true, "never_split": null, "use_fast": true, "special_tokens_map_file": "/root/.cache/huggingface/transformers/1f428acdde727eed5de979d6856ce350a470be2a64e134a1fdae04af78a27301.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d", "name_or_path": "./context-russian-inappropriate-messages/checkpoint-584/", "tokenizer_class": "BertTokenizer"}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff