adambuttrick commited on
Commit
1d25ba9
1 Parent(s): 1a5eb84

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -1
README.md CHANGED
@@ -1,3 +1,36 @@
1
  ---
2
- license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
  ---
4
+ # BERT NER Organization Name Extraction
5
+
6
+ ## Overview
7
+ Fine-tune of the `bert-base-uncased` [model](https://huggingface.co/bert-base-uncased), trained to identify and classify named entities within affiliation strings, focusing on organizations and locations.
8
+
9
+ ## Training Data
10
+ Training data comprised approximately 500,000 programatically annotated items, where named entities in affiliation strings were tagged with their respective types (organizations, locations), and all other text is marked as extraneous. Example annotation format:
11
+
12
+ ```
13
+ O: Internal
14
+ O: Medicine
15
+ O: Complex
16
+ B-ORG: College
17
+ I-ORG: of
18
+ I-ORG: Medical
19
+ I-ORG: Sciences
20
+ B-LOC: New
21
+ I-LOC: Delhi
22
+ B-LOC: India
23
+ ```
24
+
25
+ The training data was derived from [OpenAlex](https://openalex.org/) affiliation strings and their [ROR ID](https://ror.org) assignments. Tagging was done using the corresponding name and location metadata from the assigned ROR record. Location names were further supplemented with aliases derived from the [Unicode Common Locale Data Repository (CLDR)](https://cldr.unicode.org/).
26
+
27
+ ## Training Details
28
+ - **Dataset Size**: ~500,000 items
29
+ - **Number of Epochs**: 3
30
+ - **Optimizer**: AdamW
31
+ - **Training Environment**: Google Colab T-4 High Ram instance
32
+ - **Training Duration**: Approximately 8 hours
33
+
34
+
35
+ ## Usage
36
+ See https://github.com/ror-community/affiliation-matching-experimental/tree/main/ner_tests/inference for example usage.