# DAEDRA: Determining Adverse Event Disposition for Regulatory Affairs

DAEDRA is a language model intended to predict the disposition (outcome) of an adverse event based on the text of the event report. Intended to be used to classify reports in passive reporting systems, it is trained on the [VAERS](https://vaers.hhs.gov/) dataset, which contains reports of adverse events following vaccination in the United States.

In [1]:
%pip install accelerate -U

/bin/bash: /anaconda/envs/azureml_py38_PT_TF/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install transformers datasets shap watermark wandb evaluate codecarbon

/bin/bash: /anaconda/envs/azureml_py38_PT_TF/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd
import numpy as np
import torch
import os
from typing import List, Union
from transformers import AutoTokenizer, Trainer, AutoModelForSequenceClassification, TrainingArguments, DataCollatorWithPadding, pipeline
from datasets import load_dataset, Dataset, DatasetDict
import shap
import wandb
import evaluate
from codecarbon import EmissionsTracker
import logging

wandb.finish()

logging.getLogger('codecarbon').propagate = False

os.environ["TOKENIZERS_PARALLELISM"] = "false"
tracker = EmissionsTracker()

%load_ext watermark

[codecarbon INFO @ 04:20:20] [setup] RAM Tracking...
[codecarbon INFO @ 04:20:20] [setup] GPU Tracking...
[codecarbon INFO @ 04:20:20] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 04:20:20] [setup] CPU Tracking...
[codecarbon INFO @ 04:20:21] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
[codecarbon INFO @ 04:20:21] >>> Tracker's metadata:
[codecarbon INFO @ 04:20:21]   Platform system: Linux-5.15.0-1040-azure-x86_64-with-glibc2.10
[codecarbon INFO @ 04:20:21]   Python version: 3.8.5
[codecarbon INFO @ 04:20:21]   CodeCarbon version: 2.3.3
[codecarbon INFO @ 04:20:21]   Available RAM : 440.883 GB
[codecarbon INFO @ 04:20:21]   CPU count: 24
[codecarbon INFO @ 04:20:21]   CPU model: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
[codecarbon INFO @ 04:20:21]   GPU count: 4
[codecarbon INFO @ 04:20:21]   GPU model: 4 x Tesla V100-PCIE-16GB


In [5]:
device: str = 'cuda' if torch.cuda.is_available() else 'cpu'

SEED: int = 42

BATCH_SIZE: int = 32
EPOCHS: int = 5
model_ckpt: str = "distilbert-base-uncased"

# WandB configuration
os.environ["WANDB_PROJECT"] = "DAEDRA multiclass model training" 
os.environ["WANDB_LOG_MODEL"] = "checkpoint"  # log all model checkpoints
os.environ["WANDB_NOTEBOOK_NAME"] = "DAEDRA.ipynb"

In [6]:
%watermark --iversion

re      : 2.2.1
pandas  : 2.0.2
evaluate: 0.4.1
logging : 0.5.1.2
torch   : 1.12.0
shap    : 0.44.1
wandb   : 0.16.2
numpy   : 1.23.5



In [7]:
!nvidia-smi

/bin/bash: /anaconda/envs/azureml_py38_PT_TF/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Mon Jan 29 04:20:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-PCIE-16GB           Off | 00000001:00:00.0 Off |                  Off |
| N/A   26C    P0              25W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+------------------------------------

## Loading the data set

In [8]:
dataset = load_dataset("chrisvoncsefalvay/vaers-outcomes")

In [9]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label'],
        num_rows: 1270444
    })
    test: Dataset({
        features: ['id', 'text', 'label'],
        num_rows: 272238
    })
    val: Dataset({
        features: ['id', 'text', 'label'],
        num_rows: 272238
    })
})

In [10]:
SUBSAMPLING = 1.0

if SUBSAMPLING < 1:
    _ = DatasetDict()
    for each in dataset.keys():
        _[each] = dataset[each].shuffle(seed=SEED).select(range(int(len(dataset[each]) * SUBSAMPLING)))

    dataset = _

## Tokenisation and encoding

In [11]:
def encode_ds(ds: Union[Dataset, DatasetDict], tokenizer_model: str = model_ckpt) -> Union[Dataset, DatasetDict]:
    return ds_enc

## Evaluation metrics

In [12]:
accuracy = evaluate.load("accuracy")
precision, recall = evaluate.load("precision"), evaluate.load("recall")
f1 = evaluate.load("f1")

In [13]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {
        'accuracy': accuracy.compute(predictions=predictions, references=labels)["accuracy"],
        'precision_macroaverage': precision.compute(predictions=predictions, references=labels, average='macro')["precision"],
        'precision_microaverage': precision.compute(predictions=predictions, references=labels, average='micro')["precision"],
        'recall_macroaverage': recall.compute(predictions=predictions, references=labels, average='macro')["recall"],
        'recall_microaverage': recall.compute(predictions=predictions, references=labels, average='micro')["recall"],
        'f1_microaverage': f1.compute(predictions=predictions, references=labels, average='micro')["f1"]
    }

## Training

We specify a label map – this has to be done manually, even if `Datasets` has a function for it, as `AutoModelForSequenceClassification` requires an object with a length :(

In [14]:
label_map = {i: label for i, label in enumerate(dataset["test"].features["label"].names)}

In [15]:
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

cols = dataset["train"].column_names
cols.remove("label")
ds_enc = dataset.map(lambda x: tokenizer(x["text"], truncation=True), batched=True, remove_columns=cols)


Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1270444/1270444 [08:09<00:00, 2595.90 examples/s]


In [16]:

model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, 
    num_labels=len(dataset["test"].features["label"].names), 
    id2label=label_map, 
    label2id={v:k for k,v in label_map.items()})

args = TrainingArguments(
    output_dir="vaers",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=.01,
    logging_steps=1,
    load_best_model_at_end=True,
    run_name=f"daedra-training",
    report_to=["wandb"])

trainer = Trainer(
        model=model,
        args=args,
        train_dataset=ds_enc["train"],
        eval_dataset=ds_enc["test"],
        tokenizer=tokenizer,
        compute_metrics=compute_metrics)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
if SUBSAMPLING != 1.0:
    wandb_tag: List[str] = [f"subsample-{SUBSAMPLING}"]
else:
    wandb_tag: List[str] = [f"full_sample"]

wandb_tag.append(f"batch_size-{BATCH_SIZE}")
wandb_tag.append(f"base:{model_ckpt}")
    
wandb.init(name="daedra_training_run", tags=wandb_tag, magic=True)

[34m[1mwandb[0m: Currently logged in as: [33mchrisvoncsefalvay[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [18]:
tracker.start()
trainer.train()
tracker.stop()


Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.


Epoch,Training Loss,Validation Loss


[codecarbon INFO @ 04:33:12] Energy consumed for RAM : 0.000689 kWh. RAM Power : 165.33123922348022 W
[codecarbon INFO @ 04:33:12] Energy consumed for all GPUs : 0.001450 kWh. Total GPU Power : 347.66451200921796 W
[codecarbon INFO @ 04:33:12] Energy consumed for all CPUs : 0.000177 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 04:33:12] 0.002317 kWh of electricity used since the beginning.
[codecarbon INFO @ 04:33:27] Energy consumed for RAM : 0.001378 kWh. RAM Power : 165.33123922348022 W
[codecarbon INFO @ 04:33:27] Energy consumed for all GPUs : 0.004012 kWh. Total GPU Power : 615.4556826768763 W
[codecarbon INFO @ 04:33:27] Energy consumed for all CPUs : 0.000355 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 04:33:27] 0.005745 kWh of electricity used since the beginning.
[codecarbon INFO @ 04:33:42] Energy consumed for RAM : 0.002066 kWh. RAM Power : 165.33123922348022 W
[codecarbon INFO @ 04:33:42] Energy consumed for all GPUs : 0.006596 kWh. Total GPU Power : 620.911021117

In [None]:
wandb.finish()

In [None]:
variant = "full_sample" if SUBSAMPLING == 1.0 else f"subsample-{SUBSAMPLING}"
tokenizer._tokenizer.save("tokenizer.json")
tokenizer.push_to_hub("chrisvoncsefalvay/daedra")
sample = "full sample" if SUBSAMPLING == 1.0 else f"{SUBSAMPLING * 100}% of the full sample"

model.push_to_hub("chrisvoncsefalvay/daedra", 
                  variant=variant,
                  commit_message=f"DAEDRA model trained on {sample} of the VAERS dataset (training set size: {dataset['train'].num_rows:,})")

In [None]:
variant = "full_sample" if SUBSAMPLING == 1.0 else f"subsample-{SUBSAMPLING}"
tokenizer._tokenizer.save("tokenizer.json")
tokenizer.push_to_hub("chrisvoncsefalvay/daedra")
sample = "full sample" if SUBSAMPLING == 1.0 else f"{SUBSAMPLING * 100}% of the full sample"

model.push_to_hub("chrisvoncsefalvay/daedra", 
                  variant=variant,
                  commit_message=f"DAEDRA model trained on {sample} of the VAERS dataset (training set size: {dataset['train'].num_rows:,})")