Update README.md
Browse files
README.md
CHANGED
@@ -8,8 +8,9 @@ license: agpl-3.0
|
|
8 |
|
9 |
Since their beginnings in the 1830s and 1840s, news agencies have played an important role in the national and international news market, aiming to deliver news as fast and as reliable as possible. While we know that newspapers have been using agency content for a long time to produce their stories, the amount to which the agencies shape our news often remains unclear. Although researchers have already addressed this question, recently by using computational methods to assess the influence of news agencies at present, large-scale studies on the role of news agencies in the past continue to be rare.
|
10 |
|
11 |
-
This project
|
12 |
|
|
|
13 |
|
14 |
## Research Summary
|
15 |
|
@@ -17,7 +18,32 @@ Results show that ca. 10% of the articles explicitly reference news agencies, wi
|
|
17 |
Differences in the usage of agency content across time, countries and languages as well as between newspapers reveal a complex network of news flows, whose exploration provides many opportunities for future work.
|
18 |
|
19 |
|
20 |
-
## Intended uses
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
#### How to use
|
23 |
|
@@ -37,9 +63,6 @@ ner_results = nlp(example)
|
|
37 |
print(ner_results)
|
38 |
```
|
39 |
|
40 |
-
#### Limitations and bias
|
41 |
-
|
42 |
-
|
43 |
## Training data
|
44 |
|
45 |
|
|
|
8 |
|
9 |
Since their beginnings in the 1830s and 1840s, news agencies have played an important role in the national and international news market, aiming to deliver news as fast and as reliable as possible. While we know that newspapers have been using agency content for a long time to produce their stories, the amount to which the agencies shape our news often remains unclear. Although researchers have already addressed this question, recently by using computational methods to assess the influence of news agencies at present, large-scale studies on the role of news agencies in the past continue to be rare.
|
10 |
|
11 |
+
This project aimed at bridging this gap by detecting news agencies in a large corpus of Swiss and Luxembourgish newspaper articles (the [impresso](https://impresso-project.ch/) corpus) for the years 1840-2000 using deep learning methods. For this, we first build and annotate a multilingual dataset with news agency mentions, which we then use to train and evaluate several BERT-based agency detection and classification models. Based on these experiments, we choose two models (for French and German) for the inference on the [impresso](https://impresso-project.ch/) corpus.
|
12 |
|
13 |
+
dbmdz/bert-base-french-europeana-cased
|
14 |
|
15 |
## Research Summary
|
16 |
|
|
|
18 |
Differences in the usage of agency content across time, countries and languages as well as between newspapers reveal a complex network of news flows, whose exploration provides many opportunities for future work.
|
19 |
|
20 |
|
21 |
+
## Intended uses
|
22 |
+
|
23 |
+
dbmdz/bert-base-french-europeana-cased
|
24 |
+
|
25 |
+
## Dataset Characteristics
|
26 |
+
The dataset contains 1,133 French and 397 German annotated documents, with 1,058,449 tokens, of which 1,976 have annotations. Below is an overview of the corpus statistics:
|
27 |
+
The annotated dataset is released on [Zenodo](https://doi.org/10.5281/zenodo.8333933).
|
28 |
+
|
29 |
+
|
30 |
+
Overview of corpus statistics. %noisy gives the percentage of agency mentions with at least one OCR error.
|
31 |
+
|
32 |
+
| Lg. | Docs | Tokens | Mentions | %noisy |
|
33 |
+
|-------|------|---------|----------|--------|
|
34 |
+
| Train | de | 333 | 247,793 | 493 | 9% |
|
35 |
+
| | fr | 903 | 606,671 | 1,122 | 5% |
|
36 |
+
| Total | | 1,236 | 854,464 | 1,615 | 6% |
|
37 |
+
| Dev | de | 32 | 28,745 | 26 | 8% |
|
38 |
+
| | fr | 110 | 77,746 | 114 | 3% |
|
39 |
+
| Total | | 142 | 106,491 | 140 | 4% |
|
40 |
+
| Test | de | 32 | 22,437 | 58 | 3% |
|
41 |
+
| | fr | 120 | 75,057 | 163 | 7% |
|
42 |
+
| Total | | 152 | 97,494 | 221 | 6% |
|
43 |
+
| All | de | 397 | 298,975 | 577 | 9% |
|
44 |
+
| | fr | 1,133 | 759,474 | 1,399 | 5% |
|
45 |
+
| Total | | 1,530 | 1,058,449| 1,976 | 6% |
|
46 |
+
|
47 |
|
48 |
#### How to use
|
49 |
|
|
|
63 |
print(ner_results)
|
64 |
```
|
65 |
|
|
|
|
|
|
|
66 |
## Training data
|
67 |
|
68 |
|