Pathology notes NER Model Example

In this script we will provide the code to use our NER model.

Part 1: Define label list, load model and tokenizer

1.1 Define label list

Label list is the list of all the labels in the IOB-scheme: Each entity/attribute has a B- (beginning) and I- (inner) label. The words with no tag are labeled as "O".

 ["B-Mutation", "B-ExpressionSignal", "B-PolaritySignal", "I-HematoDiagnosis", "I-MorphologicAbnormality", "B-Infection", "I-Infection", "I-Proliferation", "B-Hematopoiesis", "I-DiagnosisType", "I-CellAssociation", "B-SizeSignal", "I-ShiftSignal", "I-PolaritySignal", "O", "B-AmountSignal", "B-MalignancySignal", "I-SizeSignal", "I-OtherDiagnosis", "I-MalignancySignal", "B-Expression", "B-DiagnosisType", "B-Proliferation", "I-Expression", "B-QuantitySignal", "B-MorphologicAbnormality", "B-ShiftSignal", "B-HematoDiagnosis", "B-CellType", "B-OtherDiagnosis", "B-ClonalitySignal", "B-CellAssociation", "I-QuantitySignal", "I-Mutation", "I-Hematopoiesis", "I-CellType", "I-AmountSignal", "I-ClonalitySignal", "I-ExpressionSignal"]

label_list = ["B-Mutation", "B-ExpressionSignal", "B-Polarity", "I-HematoDiagnosis", "I-MorphologicAbnormality", "B-InfectiousAgent", "I-InfectiousAgent", "I-Proliferation", "B-Hematopoiesis", "I-DiagnosisType", "I-CellAssociation", "B-Size", "I-ShiftSignal", "I-Polarity", "O", "B-Amount", "B-MalignancySignal", "I-Size", "I-OtherDiagnosis", "I-MalignancySignal", "B-Expression", "B-DiagnosisType", "B-Proliferation", "I-Expression", "B-Quantity", "B-MorphologicAbnormality", "B-ShiftSignal", "B-HematoDiagnosis", "B-CellType", "B-OtherDiagnosis", "B-ClonalitySignal", "B-CellAssociation", "I-Quantity", "I-Mutation", "I-Hematopoiesis", "I-CellType", "I-Amount", "I-ClonalitySignal", "I-ExpressionSignal"]
label_list

1.2 Load fine-tuned NER model

#create Classmap
from datasets import ClassLabel
classmap = ClassLabel(num_classes=len(label_list), names=label_list)


#load model 
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("GerMedBERT-best_model", num_labels=len(label_list), id2label={i:classmap.int2str(i) for i in range(classmap.num_classes)}, label2id={c:classmap.str2int(c) for c in classmap.names})

1.3 Load tokenizer

# %% load tokenizer 
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("GerMedBERT/medbert-512")

Part 2: Application of the model to an example pathology note

2.1 Create nlp pipeline

# Create pipeline
from transformers import pipeline
import pandas as pd 

nlp = pipeline("ner", model=model, tokenizer=tokenizer)

2.2 First Example in English and German

The results of the following examples show that even though the model was trained only on German annotated texts, the model also works on English text, but to a lesser extent.

# Example 1 in English and German
english_example1 = "Immunohistochemically, there is a slightly increased amount of plasma cells, which are partly situated in small groups (MUM1, CD138). "
german_example1 = "Immunhistochemisch zeigt sich eine leichte Vermehrung der Plasmazellen, die teils in kleinen Gruppen angeordnet sind (MUM1, CD138)"

#print results of english example
eng_results = nlp(english_example1)
df_eng1 = pd.DataFrame(eng_results)
print(df_eng1)
# print results of german example
ger_results = nlp(german_example1)
df_ger1 = pd.DataFrame(ger_results)
print(df_ger1)

2.3 Second example in English and German

english_example2 = "The diffuse infiltrates of blasts show a homogeneous and strong expression of CD20 and CD10 in absence of CD3, BCL-2, and TDT." german_example2 = "Diffuse Blasteninfiltrate zeigen eine homogene und starke Expression von CD20 und CD10 in Abwesenheit von CD3, BCL-2 und TDT."

#print results of english example
eng_results = nlp(english_example2)
df_eng2 = pd.DataFrame(eng_results)
print(df_eng2)

# print results of german example
ger_results = nlp(german_example2)
df_ger2 = pd.DataFrame(ger_results)
print(df_ger2)

IMI-HD
/

medbert-hematopatho