Update README.md
Browse files
README.md
CHANGED
@@ -18,9 +18,8 @@ license: apache-2.0
|
|
18 |
# Model Card for sbb_ner
|
19 |
|
20 |
<!-- Provide a quick summary of what the model is/does. [Optional] -->
|
21 |
-
A BERT model trained on three German corpora containing contemporary and historical texts for named entity recognition tasks. It predicts the classes PER
|
22 |
-
The model was developed by the Berlin State Library (SBB) in the [QURATOR](https://staatsbibliothek-berlin.de/die-staatsbibliothek/projekte/project-id-1060-2018)
|
23 |
-
and [Mensch.Maschine.Kultur]( https://mmk.sbb.berlin/?lang=en) projects.
|
24 |
|
25 |
|
26 |
|
@@ -68,9 +67,9 @@ and [Mensch.Maschine.Kultur]( https://mmk.sbb.berlin/?lang=en) projects.
|
|
68 |
|
69 |
<!-- Provide a longer summary of what this model is/does. -->
|
70 |
A BERT model trained on three German corpora containing contemporary and historical texts for named entity recognition tasks.
|
71 |
-
It predicts the classes PER
|
72 |
|
73 |
-
- **Developed by:** [Kai Labusch](
|
74 |
- **Shared by [Optional]:** [Staatsbibliothek zu Berlin / Berlin State Library](https://huggingface.co/SBB)
|
75 |
- **Model type:** Language model
|
76 |
- **Language(s) (NLP):** de
|
@@ -87,8 +86,8 @@ It predicts the classes PER, LOC and ORG.
|
|
87 |
|
88 |
## Direct Use
|
89 |
|
90 |
-
The model can directly be used to perform NER on historical German texts obtained by OCR from digitized documents.
|
91 |
-
Supported entity types are PER
|
92 |
|
93 |
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
94 |
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
|
@@ -98,7 +97,7 @@ Supported entity types are PER, LOC and ORG.
|
|
98 |
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
99 |
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
|
100 |
|
101 |
-
The model has been pre-trained on 2
|
102 |
Therefore it is adapted to OCR-error prone historical German texts and might be used for particular applications that involve such text material.
|
103 |
|
104 |
|
@@ -146,7 +145,7 @@ The BERT model is trained directly with respect to the NER by implementation of
|
|
146 |
|
147 |
### Preprocessing
|
148 |
|
149 |
-
The model was pre-trained on 2
|
150 |
The texts have been obtained by OCR from the page scans of the documents.
|
151 |
|
152 |
### Speeds, Sizes, Times
|
@@ -221,7 +220,7 @@ See above.
|
|
221 |
|
222 |
### Software
|
223 |
|
224 |
-
See published code on [GithHub](
|
225 |
|
226 |
# Citation
|
227 |
|
@@ -229,15 +228,15 @@ See published code on [GithHub]( https://github.com/qurator-spk/sbb_ner).
|
|
229 |
|
230 |
**BibTeX:**
|
231 |
|
232 |
-
@article{labusch_bert_2019,
|
233 |
-
title = {{BERT} for {Named} {Entity} {Recognition} in {Contemporary} and {Historical} {German}},
|
234 |
-
volume = {Conference on Natural Language Processing},
|
235 |
-
url = {https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf},
|
236 |
-
abstract = {We apply a pre-trained transformer based representational language model, i.e. BERT (Devlin et al., 2018), to named entity recognition (NER) in contemporary and historical German text and observe state of the art performance for both text categories. We further improve the recognition performance for historical German by unsupervised pre-training on a large corpus of historical German texts of the Berlin State Library and show that best performance for historical German is obtained by unsupervised pre-training on historical German plus supervised pre-training with contemporary NER ground-truth.},
|
237 |
-
language = {en},
|
238 |
-
author = {Labusch, Kai and Neudecker, Clemens and Zellhöfer, David},
|
239 |
-
year = {2019},
|
240 |
-
pages = {9},
|
241 |
}
|
242 |
|
243 |
**APA:**
|
@@ -254,13 +253,13 @@ More information needed.
|
|
254 |
|
255 |
In addition to what has been documented above, it should be noted that there are two NER Ground Truth datasets available:
|
256 |
|
257 |
-
1) [Data provided for the 2020 HIPE campaign on named entity processing](
|
258 |
-
2) [Data providided for the 2022 HIPE shared task on named entity processing](
|
259 |
|
260 |
Furthermore, two papers have been published on NER/NED, using BERT:
|
261 |
|
262 |
-
1) [Entity Linking in Multilingual Newspapers and Classical Commentaries with BERT](
|
263 |
-
2) [Named Entity Disambiguation and Linking Historic Newspaper OCR with BERT](
|
264 |
|
265 |
|
266 |
# Model Card Authors [optional]
|
|
|
18 |
# Model Card for sbb_ner
|
19 |
|
20 |
<!-- Provide a quick summary of what the model is/does. [Optional] -->
|
21 |
+
A BERT model trained on three German corpora containing contemporary and historical texts for named entity recognition tasks. It predicts the classes `PER`, `LOC` and `ORG`.
|
22 |
+
The model was developed by the Berlin State Library (SBB) in the [QURATOR](https://staatsbibliothek-berlin.de/die-staatsbibliothek/projekte/project-id-1060-2018) project.
|
|
|
23 |
|
24 |
|
25 |
|
|
|
67 |
|
68 |
<!-- Provide a longer summary of what this model is/does. -->
|
69 |
A BERT model trained on three German corpora containing contemporary and historical texts for named entity recognition tasks.
|
70 |
+
It predicts the classes `PER`, `LOC` and `ORG`.
|
71 |
|
72 |
+
- **Developed by:** [Kai Labusch](https://huggingface.co/labusch), [Clemens Neudecker](https://huggingface.co/cneud), David Zellhöfer
|
73 |
- **Shared by [Optional]:** [Staatsbibliothek zu Berlin / Berlin State Library](https://huggingface.co/SBB)
|
74 |
- **Model type:** Language model
|
75 |
- **Language(s) (NLP):** de
|
|
|
86 |
|
87 |
## Direct Use
|
88 |
|
89 |
+
The model can directly be used to perform NER on historical German texts obtained by Optical Character Recognition (OCR) from digitized documents.
|
90 |
+
Supported entity types are `PER`, `LOC` and `ORG`.
|
91 |
|
92 |
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
93 |
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
|
|
|
97 |
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
98 |
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
|
99 |
|
100 |
+
The model has been pre-trained on 2,333,647 pages of OCR-text of the digitized collections of Berlin State Library.
|
101 |
Therefore it is adapted to OCR-error prone historical German texts and might be used for particular applications that involve such text material.
|
102 |
|
103 |
|
|
|
145 |
|
146 |
### Preprocessing
|
147 |
|
148 |
+
The model was pre-trained on 2,333,647 pages of German texts from the digitized collections of the Berlin State Library.
|
149 |
The texts have been obtained by OCR from the page scans of the documents.
|
150 |
|
151 |
### Speeds, Sizes, Times
|
|
|
220 |
|
221 |
### Software
|
222 |
|
223 |
+
See published code on [GithHub](https://github.com/qurator-spk/sbb_ner).
|
224 |
|
225 |
# Citation
|
226 |
|
|
|
228 |
|
229 |
**BibTeX:**
|
230 |
|
231 |
+
@article{labusch_bert_2019,
|
232 |
+
title = {{BERT} for {Named} {Entity} {Recognition} in {Contemporary} and {Historical} {German}},
|
233 |
+
volume = {Conference on Natural Language Processing},
|
234 |
+
url = {https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf},
|
235 |
+
abstract = {We apply a pre-trained transformer based representational language model, i.e. BERT (Devlin et al., 2018), to named entity recognition (NER) in contemporary and historical German text and observe state of the art performance for both text categories. We further improve the recognition performance for historical German by unsupervised pre-training on a large corpus of historical German texts of the Berlin State Library and show that best performance for historical German is obtained by unsupervised pre-training on historical German plus supervised pre-training with contemporary NER ground-truth.},
|
236 |
+
language = {en},
|
237 |
+
author = {Labusch, Kai and Neudecker, Clemens and Zellhöfer, David},
|
238 |
+
year = {2019},
|
239 |
+
pages = {9},
|
240 |
}
|
241 |
|
242 |
**APA:**
|
|
|
253 |
|
254 |
In addition to what has been documented above, it should be noted that there are two NER Ground Truth datasets available:
|
255 |
|
256 |
+
1) [Data provided for the 2020 HIPE campaign on named entity processing](https://impresso.github.io/CLEF-HIPE-2020/)
|
257 |
+
2) [Data providided for the 2022 HIPE shared task on named entity processing](https://hipe-eval.github.io/HIPE-2022/)
|
258 |
|
259 |
Furthermore, two papers have been published on NER/NED, using BERT:
|
260 |
|
261 |
+
1) [Entity Linking in Multilingual Newspapers and Classical Commentaries with BERT](http://ceur-ws.org/Vol-3180/paper-85.pdf)
|
262 |
+
2) [Named Entity Disambiguation and Linking Historic Newspaper OCR with BERT](http://ceur-ws.org/Vol-2696/paper_163.pdf)
|
263 |
|
264 |
|
265 |
# Model Card Authors [optional]
|