--- language: - de base_model: - GerMedBERT/medbert-512 pipeline_tag: token-classification license: apache-2.0 --- # Pathology notes NER Model Example In this script we will provide the code to use our NER model. ## Part 1: Define label list, load model and tokenizer #### 1.1 Define label list Label list is the list of all the labels in the IOB-scheme: Each entity/attribute has a B- (beginning) and I- (inner) label. The words with no tag are labeled as "O". ```python ["B-Mutation", "B-ExpressionSignal", "B-PolaritySignal", "I-HematoDiagnosis", "I-MorphologicAbnormality", "B-Infection", "I-Infection", "I-Proliferation", "B-Hematopoiesis", "I-DiagnosisType", "I-CellAssociation", "B-SizeSignal", "I-ShiftSignal", "I-PolaritySignal", "O", "B-AmountSignal", "B-MalignancySignal", "I-SizeSignal", "I-OtherDiagnosis", "I-MalignancySignal", "B-Expression", "B-DiagnosisType", "B-Proliferation", "I-Expression", "B-QuantitySignal", "B-MorphologicAbnormality", "B-ShiftSignal", "B-HematoDiagnosis", "B-CellType", "B-OtherDiagnosis", "B-ClonalitySignal", "B-CellAssociation", "I-QuantitySignal", "I-Mutation", "I-Hematopoiesis", "I-CellType", "I-AmountSignal", "I-ClonalitySignal", "I-ExpressionSignal"] label_list = ["B-Mutation", "B-ExpressionSignal", "B-Polarity", "I-HematoDiagnosis", "I-MorphologicAbnormality", "B-InfectiousAgent", "I-InfectiousAgent", "I-Proliferation", "B-Hematopoiesis", "I-DiagnosisType", "I-CellAssociation", "B-Size", "I-ShiftSignal", "I-Polarity", "O", "B-Amount", "B-MalignancySignal", "I-Size", "I-OtherDiagnosis", "I-MalignancySignal", "B-Expression", "B-DiagnosisType", "B-Proliferation", "I-Expression", "B-Quantity", "B-MorphologicAbnormality", "B-ShiftSignal", "B-HematoDiagnosis", "B-CellType", "B-OtherDiagnosis", "B-ClonalitySignal", "B-CellAssociation", "I-Quantity", "I-Mutation", "I-Hematopoiesis", "I-CellType", "I-Amount", "I-ClonalitySignal", "I-ExpressionSignal"] label_list ``` #### 1.2 Load fine-tuned NER model ```python #create Classmap from datasets import ClassLabel classmap = ClassLabel(num_classes=len(label_list), names=label_list) #load model from transformers import AutoModelForTokenClassification model = AutoModelForTokenClassification.from_pretrained("GerMedBERT-best_model", num_labels=len(label_list), id2label={i:classmap.int2str(i) for i in range(classmap.num_classes)}, label2id={c:classmap.str2int(c) for c in classmap.names}) ``` #### 1.3 Load tokenizer ```python # %% load tokenizer from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("GerMedBERT/medbert-512") ``` ## Part 2: Application of the model to an example pathology note #### 2.1 Create nlp pipeline ```python # Create pipeline from transformers import pipeline import pandas as pd nlp = pipeline("ner", model=model, tokenizer=tokenizer) ``` #### 2.2 First Example in English and German The results of the following examples show that even though the model was trained only on German annotated texts, the model also works on English text, but to a lesser extent. ```python # Example 1 in English and German english_example1 = "Immunohistochemically, there is a slightly increased amount of plasma cells, which are partly situated in small groups (MUM1, CD138). " german_example1 = "Immunhistochemisch zeigt sich eine leichte Vermehrung der Plasmazellen, die teils in kleinen Gruppen angeordnet sind (MUM1, CD138)" #print results of english example eng_results = nlp(english_example1) df_eng1 = pd.DataFrame(eng_results) print(df_eng1) # print results of german example ger_results = nlp(german_example1) df_ger1 = pd.DataFrame(ger_results) print(df_ger1) ``` #### 2.3 Second example in English and German english_example2 = "The diffuse infiltrates of blasts show a homogeneous and strong expression of CD20 and CD10 in absence of CD3, BCL-2, and TDT." german_example2 = "Diffuse Blasteninfiltrate zeigen eine homogene und starke Expression von CD20 und CD10 in Abwesenheit von CD3, BCL-2 und TDT." ```python #print results of english example eng_results = nlp(english_example2) df_eng2 = pd.DataFrame(eng_results) print(df_eng2) # print results of german example ger_results = nlp(german_example2) df_ger2 = pd.DataFrame(ger_results) print(df_ger2) ```