--- license: openrail datasets: - shainaraza/clinical_bias language: - en metrics: - f1 - accuracy --- language: en tags: NER Named Entity Recognition Bias Clinical Healthcare license: Apache-2.0 model-index: huggingface: shainaraza/clinical-bias-ner datasets: shainaraza/clinical_bias metrics: f1-score ## Clinical Bias NER Model This is a Named Entity Recognition (NER) model trained on clinical text data to detect biased language. The model identifies named entities in text, specifically mentions of patient groups and conditions, and marks them as potentially biased. ## Model Details The model was trained on the clinical notes dataset using the distilbert-base-uncased transformer model. It was fine-tuned for 3 epochs using a batch size of 8 on Google Colab. The model is capable of identifying named entities with two labels - O (for non-biased words) and BIAS (for potentially biased words). The BIAS label is annotated manually by looking into each record and finding which sentence has bias. ## Performance The model achieved an F1-score of 0.93 on the validation set of the dataset. ## Usage The model can be used to identify potentially biased language in clinical text data. It can be integrated into a larger NLP pipeline or used as a standalone tool. To use the model, simply import the AutoModelForTokenClassification and AutoTokenizer classes from the transformers library, and load the model and tokenizer using the from_pretrained() method. ``` import torch from transformers import AutoModelForTokenClassification, AutoTokenizer from prettytable import PrettyTable # Load the model and tokenizer from the Hugging Face model hub model_name = "shainaraza/clinical-bias-ner" model = AutoModelForTokenClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) # Define the text to classify text = "The patient is a 50-year poor, take drugs and has aggressive behavior." # Tokenize the text tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(text))) # Convert tokens to input IDs input_ids = tokenizer.convert_tokens_to_ids(tokens) # Generate attention masks attention_masks = [1] * len(input_ids) # Prepare the input tensors input_ids = torch.tensor(input_ids).unsqueeze(0) attention_masks = torch.tensor(attention_masks).unsqueeze(0) # Run the model and get the predicted labels with torch.no_grad(): outputs = model(input_ids, attention_masks) predicted_labels = torch.argmax(outputs[0], dim=2) # Convert predicted labels back to text predicted_labels = predicted_labels.squeeze().tolist() predicted_labels = [model.config.id2label[label_id] for label_id in predicted_labels] predicted_text = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(tokenizer.encode(text))) predicted_text_with_labels = "" for i, token in enumerate(tokens): predicted_text_with_labels += f"{token}/{predicted_labels[i]} " # Display the predicted labels in a table table = PrettyTable(['Token', 'Label']) for i, token in enumerate(tokens): table.add_row([token, predicted_labels[i]]) print(predicted_text) print(table) ``` This will output ``` +------------+-------+ | Token | Label | +------------+-------+ | [CLS] | O | | [UNK] | O | | patient | O | | is | O | | a | O | | 50 | O | | - | O | | year | O | | poor | BIAS | | , | O | | take | O | | drugs | O | | and | O | | has | O | | aggressive | BIAS | | behavior | O | | . | O | | [SEP] | O | +------------+-------+ ``` ## Limitations and Future Work The model is not perfect and may not capture all instances of biased language. It is important to note that the model only identifies potentially biased language and does not make any judgments on intent or impact. In future work, the model could be fine-tuned on a larger and more diverse dataset to improve its performance. Additionally, the model could be extended to identify other types of biased language, such as ageism, racism, or sexism. ## Acknowledgments This model was developed by Shaina Raza as part of her project. ## Contact For any questions or comments, please contact shaina.raza@torontomu.ca