NER on hematologic pathology notes
BERT models for corpus validation as dscribed in our paper.
2 items
In this script we will provide the code to use our NER model.
Label list is the list of all the labels in the IOB-scheme: Each entity/attribute has a B- (beginning) and I- (inner) label. The words with no tag are labeled as "O".
["B-Mutation", "B-ExpressionSignal", "B-PolaritySignal", "I-HematoDiagnosis", "I-MorphologicAbnormality", "B-Infection", "I-Infection", "I-Proliferation", "B-Hematopoiesis", "I-DiagnosisType", "I-CellAssociation", "B-SizeSignal", "I-ShiftSignal", "I-PolaritySignal", "O", "B-AmountSignal", "B-MalignancySignal", "I-SizeSignal", "I-OtherDiagnosis", "I-MalignancySignal", "B-Expression", "B-DiagnosisType", "B-Proliferation", "I-Expression", "B-QuantitySignal", "B-MorphologicAbnormality", "B-ShiftSignal", "B-HematoDiagnosis", "B-CellType", "B-OtherDiagnosis", "B-ClonalitySignal", "B-CellAssociation", "I-QuantitySignal", "I-Mutation", "I-Hematopoiesis", "I-CellType", "I-AmountSignal", "I-ClonalitySignal", "I-ExpressionSignal"]
label_list = ["B-Mutation", "B-ExpressionSignal", "B-Polarity", "I-HematoDiagnosis", "I-MorphologicAbnormality", "B-InfectiousAgent", "I-InfectiousAgent", "I-Proliferation", "B-Hematopoiesis", "I-DiagnosisType", "I-CellAssociation", "B-Size", "I-ShiftSignal", "I-Polarity", "O", "B-Amount", "B-MalignancySignal", "I-Size", "I-OtherDiagnosis", "I-MalignancySignal", "B-Expression", "B-DiagnosisType", "B-Proliferation", "I-Expression", "B-Quantity", "B-MorphologicAbnormality", "B-ShiftSignal", "B-HematoDiagnosis", "B-CellType", "B-OtherDiagnosis", "B-ClonalitySignal", "B-CellAssociation", "I-Quantity", "I-Mutation", "I-Hematopoiesis", "I-CellType", "I-Amount", "I-ClonalitySignal", "I-ExpressionSignal"]
#create Classmap
from datasets import ClassLabel
classmap = ClassLabel(num_classes=len(label_list), names=label_list)
#load model
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("GerMedBERT-best_model", num_labels=len(label_list), id2label={i:classmap.int2str(i) for i in range(classmap.num_classes)}, label2id={c:classmap.str2int(c) for c in classmap.names})
# %% load tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("GerMedBERT/medbert-512")
# Create pipeline
from transformers import pipeline
import pandas as pd
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
The results of the following examples show that even though the model was trained only on German annotated texts, the model also works on English text, but to a lesser extent.
# Example 1 in English and German
english_example1 = "Immunohistochemically, there is a slightly increased amount of plasma cells, which are partly situated in small groups (MUM1, CD138). "
german_example1 = "Immunhistochemisch zeigt sich eine leichte Vermehrung der Plasmazellen, die teils in kleinen Gruppen angeordnet sind (MUM1, CD138)"
#print results of english example
eng_results = nlp(english_example1)
df_eng1 = pd.DataFrame(eng_results)
# print results of german example
ger_results = nlp(german_example1)
df_ger1 = pd.DataFrame(ger_results)
english_example2 = "The diffuse infiltrates of blasts show a homogeneous and strong expression of CD20 and CD10 in absence of CD3, BCL-2, and TDT." german_example2 = "Diffuse Blasteninfiltrate zeigen eine homogene und starke Expression von CD20 und CD10 in Abwesenheit von CD3, BCL-2 und TDT."
#print results of english example
eng_results = nlp(english_example2)
df_eng2 = pd.DataFrame(eng_results)
# print results of german example
ger_results = nlp(german_example2)
df_ger2 = pd.DataFrame(ger_results)
Base model