AntoineBourgois
commited on
Commit
•
f5fbf09
1
Parent(s):
b71d132
Upload 3 files
Browse files- JCLS_model_card.md +74 -55
- README.md +124 -79
- final_model +3 -0
JCLS_model_card.md
CHANGED
@@ -1,70 +1,95 @@
|
|
1 |
-
|
2 |
---
|
3 |
language: fr
|
4 |
tags:
|
5 |
-
-
|
|
|
|
|
|
|
6 |
- camembert
|
7 |
- literary-texts
|
8 |
- nested-entities
|
9 |
- BookNLP-fr
|
10 |
license: apache-2.0
|
11 |
metrics:
|
12 |
-
-
|
13 |
-
-
|
14 |
-
-
|
|
|
15 |
base_model:
|
16 |
- almanach/camembert-large
|
17 |
-
pipeline_tag: token-classification
|
18 |
---
|
19 |
|
20 |
## INTRODUCTION:
|
21 |
-
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **
|
22 |
|
23 |
-
|
24 |
-
- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
|
25 |
-
- facilities (FAC): chatêau, sentier, chambre, couloir, ...
|
26 |
-
- time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
|
27 |
-
- geo-political entities (GPE): Montrouge, France, le petit hameau, ...
|
28 |
-
- locations (LOC): le sud, Mars, l'océan, le bois, ...
|
29 |
-
- vehicles (VEH): avion, voitures, calèche, vélos, ...
|
30 |
|
31 |
## MODEL PERFORMANCES (LOOCV):
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
|
36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
38 |
## TRAINING PARAMETERS:
|
39 |
-
- Entities types:
|
40 |
-
- Tagging scheme: BIOES
|
41 |
-
- Nested entities levels: [0, 1]
|
42 |
- Split strategy: Leave-one-out cross-validation (29 files)
|
43 |
- Train/Validation split: 0.85 / 0.15
|
44 |
-
- Batch size: 16
|
45 |
-
- Initial learning rate: 0.
|
|
|
|
|
|
|
|
|
46 |
|
47 |
## MODEL ARCHITECTURE:
|
48 |
-
Model Input:
|
49 |
-
|
50 |
-
-
|
51 |
-
|
52 |
-
-
|
53 |
-
-
|
54 |
-
-
|
55 |
-
-
|
56 |
-
|
57 |
-
-
|
58 |
-
|
59 |
-
-
|
60 |
-
|
61 |
-
-
|
62 |
-
-
|
63 |
-
-
|
64 |
-
|
65 |
-
-
|
66 |
-
|
67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
68 |
|
69 |
## HOW TO USE:
|
70 |
*** IN CONSTRUCTION ***
|
@@ -81,9 +106,9 @@ Model Output: BIOES labels sequence
|
|
81 |
| 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | False |
|
82 |
| 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | False |
|
83 |
| 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | False |
|
84 |
-
| 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | True
|
85 |
-
| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | True
|
86 |
-
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | True
|
87 |
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | False |
|
88 |
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | False |
|
89 |
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | False |
|
@@ -94,20 +119,14 @@ Model Output: BIOES labels sequence
|
|
94 |
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | False |
|
95 |
| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | False |
|
96 |
| 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | False |
|
97 |
-
| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | True
|
98 |
| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | False |
|
99 |
| 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | False |
|
100 |
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | False |
|
101 |
-
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | True
|
102 |
| 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | False |
|
103 |
| 28 | Manon_Lescaut_PEDRO | 71,219 tokens | False |
|
104 |
| 29 | TOTAL | 346,579 tokens | 5 files used for cross-validation |
|
105 |
|
106 |
-
## PREDICTIONS CONFUSION MATRIX:
|
107 |
-
| Gold Labels | PER | O | support |
|
108 |
-
|---------------|-------|-----|-----------|
|
109 |
-
| PER | 3,864 | 197 | 4,061 |
|
110 |
-
| O | 370 | 0 | 370 |
|
111 |
-
|
112 |
## CONTACT:
|
113 |
mail: antoine [dot] bourgois [at] protonmail [dot] com
|
|
|
|
|
1 |
---
|
2 |
language: fr
|
3 |
tags:
|
4 |
+
- coreference-resolution
|
5 |
+
- anaphora-resolution
|
6 |
+
- mentions-linking
|
7 |
+
- literary-texts
|
8 |
- camembert
|
9 |
- literary-texts
|
10 |
- nested-entities
|
11 |
- BookNLP-fr
|
12 |
license: apache-2.0
|
13 |
metrics:
|
14 |
+
- MUC
|
15 |
+
- B3
|
16 |
+
- CEAF
|
17 |
+
- CoNLL-F1
|
18 |
base_model:
|
19 |
- almanach/camembert-large
|
|
|
20 |
---
|
21 |
|
22 |
## INTRODUCTION:
|
23 |
+
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.
|
24 |
|
25 |
+
This specific model has been trained to link entities of the following types: PER.
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
## MODEL PERFORMANCES (LOOCV):
|
28 |
+
Overall Coreference Resolution Performances for non-overlapping windows of different length:
|
29 |
+
| | Window width (tokens) | Document count | Sample count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 |
|
30 |
+
|----|-------------------------|------------------|----------------|----------|---------|------------|------------|
|
31 |
+
| 0 | 500 | 5 | 64 | 93.49% | 86.27% | 77.85% | 85.87% |
|
32 |
+
| 1 | 1,000 | 5 | 30 | 93.68% | 81.32% | 71.92% | 82.31% |
|
33 |
+
| 2 | 2,000 | 5 | 14 | 93.98% | 76.90% | 67.26% | 79.38% |
|
34 |
+
| 3 | 5,000 | 3 | 5 | 94.83% | 68.34% | 59.88% | 74.35% |
|
35 |
+
| 4 | 10,000 | 2 | 2 | 96.16% | 62.22% | 57.12% | 71.84% |
|
36 |
+
|
37 |
+
Coreference Resolution Performances on the fully annotated sample for each document:
|
38 |
+
| | Token count | Mention count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 |
|
39 |
+
|----|---------------|-----------------|----------|---------|------------|------------|
|
40 |
+
| 0 | 2,554 | 330 | 90.24% | 65.27% | 72.36% | 75.96% |
|
41 |
+
| 1 | 2,929 | 386 | 95.65% | 78.21% | 64.23% | 79.37% |
|
42 |
+
| 2 | 5,425 | 558 | 90.46% | 53.03% | 59.52% | 67.67% |
|
43 |
+
| 3 | 10,982 | 1,095 | 97.18% | 65.30% | 60.49% | 74.32% |
|
44 |
+
| 4 | 11,902 | 1,692 | 95.03% | 58.83% | 45.59% | 66.49% |
|
45 |
|
46 |
## TRAINING PARAMETERS:
|
47 |
+
- Entities types: PER
|
|
|
|
|
48 |
- Split strategy: Leave-one-out cross-validation (29 files)
|
49 |
- Train/Validation split: 0.85 / 0.15
|
50 |
+
- Batch size: 16,000
|
51 |
+
- Initial learning rate: 0.0004
|
52 |
+
- Focal loss gamma: 1
|
53 |
+
- Focal loss alpha: 0.25
|
54 |
+
- Pronoun lookup antecedents: 30
|
55 |
+
- Common and Proper nouns lookup antecedents: 300
|
56 |
|
57 |
## MODEL ARCHITECTURE:
|
58 |
+
Model Input: 2,165 dimensions vector
|
59 |
+
- Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
|
60 |
+
- Additional mentions features (106 dimensions):
|
61 |
+
- Length of mentions
|
62 |
+
- Position of the mention's start token within the sentence
|
63 |
+
- Grammatical category of the mentions (pronoun, common noun, proper noun)
|
64 |
+
- Dependency relation of the mention's head (one-hot encoded)
|
65 |
+
- Gender of the mentions (one-hot encoded)
|
66 |
+
- Number (singular/plural) of the mentions (one-hot encoded)
|
67 |
+
- Grammatical person of the mentions (one-hot encoded)
|
68 |
+
- Additional mention pairs features (11 dimensions):
|
69 |
+
- Distance between mention IDs
|
70 |
+
- Distance between start tokens of mentions
|
71 |
+
- Distance between end tokens of mentions
|
72 |
+
- Distance between sentences containing mentions
|
73 |
+
- Distance between paragraphs containing mentions
|
74 |
+
- Difference in nesting levels of mentions
|
75 |
+
- Ratio of shared tokens between mentions
|
76 |
+
- Exact text match between mentions (binary)
|
77 |
+
- Exact match of mention heads (binary)
|
78 |
+
- Match of syntactic heads between mentions (binary)
|
79 |
+
- Match of entity types between mentions (binary)
|
80 |
+
|
81 |
+
- Hidden Layers:
|
82 |
+
- Number of layers: 3
|
83 |
+
- Units per layer: 1,900 nodes
|
84 |
+
- Activation function: relu
|
85 |
+
- Dropout rate: 0.6
|
86 |
+
|
87 |
+
- Final Layer:
|
88 |
+
- Type: Linear
|
89 |
+
- Input: 1900 dimensions
|
90 |
+
- Output: 1 dimension (mention pair coreference score)
|
91 |
+
|
92 |
+
Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.
|
93 |
|
94 |
## HOW TO USE:
|
95 |
*** IN CONSTRUCTION ***
|
|
|
106 |
| 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | False |
|
107 |
| 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | False |
|
108 |
| 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | False |
|
109 |
+
| 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | **True** |
|
110 |
+
| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | **True** |
|
111 |
+
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | **True** |
|
112 |
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | False |
|
113 |
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | False |
|
114 |
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | False |
|
|
|
119 |
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | False |
|
120 |
| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | False |
|
121 |
| 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | False |
|
122 |
+
| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | **True** |
|
123 |
| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | False |
|
124 |
| 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | False |
|
125 |
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | False |
|
126 |
+
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | **True** |
|
127 |
| 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | False |
|
128 |
| 28 | Manon_Lescaut_PEDRO | 71,219 tokens | False |
|
129 |
| 29 | TOTAL | 346,579 tokens | 5 files used for cross-validation |
|
130 |
|
|
|
|
|
|
|
|
|
|
|
|
|
131 |
## CONTACT:
|
132 |
mail: antoine [dot] bourgois [at] protonmail [dot] com
|
README.md
CHANGED
@@ -1,70 +1,121 @@
|
|
1 |
-
|
2 |
---
|
3 |
language: fr
|
4 |
tags:
|
5 |
-
-
|
|
|
|
|
|
|
6 |
- camembert
|
7 |
- literary-texts
|
8 |
- nested-entities
|
9 |
- BookNLP-fr
|
10 |
license: apache-2.0
|
11 |
metrics:
|
12 |
-
-
|
13 |
-
-
|
14 |
-
-
|
|
|
15 |
base_model:
|
16 |
- almanach/camembert-large
|
17 |
-
pipeline_tag: token-classification
|
18 |
---
|
19 |
|
20 |
## INTRODUCTION:
|
21 |
-
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **
|
22 |
|
23 |
-
|
24 |
-
- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
|
25 |
-
- facilities (FAC): chatêau, sentier, chambre, couloir, ...
|
26 |
-
- time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
|
27 |
-
- geo-political entities (GPE): Montrouge, France, le petit hameau, ...
|
28 |
-
- locations (LOC): le sud, Mars, l'océan, le bois, ...
|
29 |
-
- vehicles (VEH): avion, voitures, calèche, vélos, ...
|
30 |
|
31 |
## MODEL PERFORMANCES (LOOCV):
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
|
36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
38 |
## TRAINING PARAMETERS:
|
39 |
-
- Entities types:
|
40 |
-
- Tagging scheme: BIOES
|
41 |
-
- Nested entities levels: [0, 1]
|
42 |
- Split strategy: Leave-one-out cross-validation (29 files)
|
43 |
- Train/Validation split: 0.85 / 0.15
|
44 |
-
- Batch size: 16
|
45 |
-
- Initial learning rate: 0.
|
|
|
|
|
|
|
|
|
46 |
|
47 |
## MODEL ARCHITECTURE:
|
48 |
-
Model Input:
|
49 |
-
|
50 |
-
-
|
51 |
-
|
52 |
-
-
|
53 |
-
-
|
54 |
-
-
|
55 |
-
-
|
56 |
-
|
57 |
-
-
|
58 |
-
|
59 |
-
-
|
60 |
-
|
61 |
-
-
|
62 |
-
-
|
63 |
-
-
|
64 |
-
|
65 |
-
-
|
66 |
-
|
67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
68 |
|
69 |
## HOW TO USE:
|
70 |
*** IN CONSTRUCTION ***
|
@@ -72,42 +123,36 @@ Model Output: BIOES labels sequence
|
|
72 |
## TRAINING CORPUS:
|
73 |
| | Document | Tokens Count | Is included in model eval |
|
74 |
|----|----------------------------------------------------------------|----------------|------------------------------------|
|
75 |
-
| 0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens | True
|
76 |
-
| 1 | 1840_Sand-George_Pauline | 12,315 tokens | True
|
77 |
-
| 2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | True
|
78 |
-
| 3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens | True
|
79 |
-
| 4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | True
|
80 |
-
| 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | True
|
81 |
-
| 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | True
|
82 |
-
| 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | True
|
83 |
-
| 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | True
|
84 |
-
| 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | True
|
85 |
-
| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | True
|
86 |
-
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | True
|
87 |
-
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | True
|
88 |
-
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | True
|
89 |
-
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | True
|
90 |
-
| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens | True
|
91 |
-
| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens | True
|
92 |
-
| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens | True
|
93 |
-
| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens | True
|
94 |
-
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | True
|
95 |
-
| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | True
|
96 |
-
| 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | True
|
97 |
-
| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | True
|
98 |
-
| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | True
|
99 |
-
| 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | True
|
100 |
-
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | True
|
101 |
-
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | True
|
102 |
-
| 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | True
|
103 |
-
| 28 | Manon_Lescaut_PEDRO | 71,219 tokens | True
|
104 |
| 29 | TOTAL | 346,579 tokens | 29 files used for cross-validation |
|
105 |
|
106 |
-
## PREDICTIONS CONFUSION MATRIX:
|
107 |
-
| Gold Labels | PER | O | support |
|
108 |
-
|---------------|--------|-------|-----------|
|
109 |
-
| PER | 40,227 | 3,200 | 43,427 |
|
110 |
-
| O | 3,401 | 0 | 3,401 |
|
111 |
-
|
112 |
## CONTACT:
|
113 |
mail: antoine [dot] bourgois [at] protonmail [dot] com
|
|
|
|
|
1 |
---
|
2 |
language: fr
|
3 |
tags:
|
4 |
+
- coreference-resolution
|
5 |
+
- anaphora-resolution
|
6 |
+
- mentions-linking
|
7 |
+
- literary-texts
|
8 |
- camembert
|
9 |
- literary-texts
|
10 |
- nested-entities
|
11 |
- BookNLP-fr
|
12 |
license: apache-2.0
|
13 |
metrics:
|
14 |
+
- MUC
|
15 |
+
- B3
|
16 |
+
- CEAF
|
17 |
+
- CoNLL-F1
|
18 |
base_model:
|
19 |
- almanach/camembert-large
|
|
|
20 |
---
|
21 |
|
22 |
## INTRODUCTION:
|
23 |
+
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.
|
24 |
|
25 |
+
This specific model has been trained to link entities of the following types: PER.
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
## MODEL PERFORMANCES (LOOCV):
|
28 |
+
Overall Coreference Resolution Performances for non-overlapping windows of different length:
|
29 |
+
| | Window width (tokens) | Document count | Sample count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 |
|
30 |
+
|----|-------------------------|------------------|----------------|----------|---------|------------|------------|
|
31 |
+
| 0 | 500 | 29 | 677 | 92.18% | 83.86% | 76.86% | 84.30% |
|
32 |
+
| 1 | 1,000 | 29 | 332 | 92.65% | 79.79% | 71.77% | 81.40% |
|
33 |
+
| 2 | 2,000 | 28 | 162 | 93.29% | 75.85% | 67.34% | 78.83% |
|
34 |
+
| 3 | 5,000 | 19 | 56 | 93.76% | 69.60% | 61.16% | 74.84% |
|
35 |
+
| 4 | 10,000 | 18 | 27 | 94.28% | 65.73% | 58.59% | 72.86% |
|
36 |
+
| 5 | 25,000 | 2 | 3 | 94.76% | 62.48% | 53.33% | 70.19% |
|
37 |
+
| 6 | 50,000 | 1 | 1 | 97.39% | 56.43% | 47.40% | 67.07% |
|
38 |
+
|
39 |
+
Coreference Resolution Performances on the fully annotated sample for each document:
|
40 |
+
| | Token count | Mention count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 |
|
41 |
+
|----|---------------|-----------------|----------|---------|------------|------------|
|
42 |
+
| 0 | 1,864 | 253 | 98.16% | 95.39% | 60.34% | 84.63% |
|
43 |
+
| 1 | 2,034 | 321 | 97.47% | 92.79% | 80.04% | 90.10% |
|
44 |
+
| 2 | 2,141 | 297 | 95.06% | 77.99% | 65.08% | 79.38% |
|
45 |
+
| 3 | 2,251 | 235 | 91.95% | 80.47% | 46.56% | 73.00% |
|
46 |
+
| 4 | 2,343 | 239 | 83.87% | 61.95% | 43.58% | 63.13% |
|
47 |
+
| 5 | 2,441 | 314 | 91.85% | 55.70% | 60.82% | 69.46% |
|
48 |
+
| 6 | 2,554 | 330 | 90.24% | 65.27% | 72.36% | 75.96% |
|
49 |
+
| 7 | 2,860 | 369 | 93.65% | 84.89% | 74.93% | 84.49% |
|
50 |
+
| 8 | 2,929 | 386 | 95.65% | 78.21% | 64.23% | 79.37% |
|
51 |
+
| 9 | 4,067 | 429 | 97.46% | 85.20% | 62.52% | 81.73% |
|
52 |
+
| 10 | 5,425 | 558 | 90.46% | 53.03% | 59.52% | 67.67% |
|
53 |
+
| 11 | 10,305 | 1,436 | 96.37% | 74.83% | 59.91% | 77.04% |
|
54 |
+
| 12 | 10,982 | 1,095 | 97.18% | 65.30% | 60.49% | 74.32% |
|
55 |
+
| 13 | 11,768 | 1,734 | 93.30% | 64.14% | 64.12% | 73.85% |
|
56 |
+
| 14 | 11,834 | 600 | 92.21% | 67.51% | 60.74% | 73.49% |
|
57 |
+
| 15 | 11,902 | 1,692 | 95.03% | 58.83% | 45.59% | 66.49% |
|
58 |
+
| 16 | 12,281 | 1,089 | 95.06% | 62.05% | 72.55% | 76.55% |
|
59 |
+
| 17 | 12,285 | 1,489 | 95.28% | 77.84% | 57.43% | 76.85% |
|
60 |
+
| 18 | 12,315 | 1,501 | 95.36% | 57.07% | 64.26% | 72.23% |
|
61 |
+
| 19 | 12,389 | 1,654 | 93.19% | 54.21% | 51.84% | 66.41% |
|
62 |
+
| 20 | 12,557 | 1,085 | 92.30% | 66.97% | 46.65% | 68.64% |
|
63 |
+
| 21 | 12,703 | 1,731 | 90.40% | 53.70% | 61.37% | 68.49% |
|
64 |
+
| 22 | 13,023 | 1,559 | 93.86% | 61.71% | 62.41% | 72.66% |
|
65 |
+
| 23 | 14,299 | 1,582 | 97.23% | 69.25% | 67.04% | 77.84% |
|
66 |
+
| 24 | 14,637 | 2,127 | 95.78% | 71.34% | 63.28% | 76.80% |
|
67 |
+
| 25 | 15,408 | 1,769 | 92.85% | 54.11% | 56.12% | 67.69% |
|
68 |
+
| 26 | 24,776 | 2,716 | 94.31% | 63.51% | 54.12% | 70.65% |
|
69 |
+
| 27 | 30,987 | 2,980 | 89.55% | 54.25% | 59.68% | 67.83% |
|
70 |
+
| 28 | 71,219 | 11,857 | 97.38% | 50.85% | 45.93% | 64.72% |
|
71 |
|
72 |
## TRAINING PARAMETERS:
|
73 |
+
- Entities types: PER
|
|
|
|
|
74 |
- Split strategy: Leave-one-out cross-validation (29 files)
|
75 |
- Train/Validation split: 0.85 / 0.15
|
76 |
+
- Batch size: 16,000
|
77 |
+
- Initial learning rate: 0.0004
|
78 |
+
- Focal loss gamma: 1
|
79 |
+
- Focal loss alpha: 0.25
|
80 |
+
- Pronoun lookup antecedents: 30
|
81 |
+
- Common and Proper nouns lookup antecedents: 300
|
82 |
|
83 |
## MODEL ARCHITECTURE:
|
84 |
+
Model Input: 2,165 dimensions vector
|
85 |
+
- Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
|
86 |
+
- Additional mentions features (106 dimensions):
|
87 |
+
- Length of mentions
|
88 |
+
- Position of the mention's start token within the sentence
|
89 |
+
- Grammatical category of the mentions (pronoun, common noun, proper noun)
|
90 |
+
- Dependency relation of the mention's head (one-hot encoded)
|
91 |
+
- Gender of the mentions (one-hot encoded)
|
92 |
+
- Number (singular/plural) of the mentions (one-hot encoded)
|
93 |
+
- Grammatical person of the mentions (one-hot encoded)
|
94 |
+
- Additional mention pairs features (11 dimensions):
|
95 |
+
- Distance between mention IDs
|
96 |
+
- Distance between start tokens of mentions
|
97 |
+
- Distance between end tokens of mentions
|
98 |
+
- Distance between sentences containing mentions
|
99 |
+
- Distance between paragraphs containing mentions
|
100 |
+
- Difference in nesting levels of mentions
|
101 |
+
- Ratio of shared tokens between mentions
|
102 |
+
- Exact text match between mentions (binary)
|
103 |
+
- Exact match of mention heads (binary)
|
104 |
+
- Match of syntactic heads between mentions (binary)
|
105 |
+
- Match of entity types between mentions (binary)
|
106 |
+
|
107 |
+
- Hidden Layers:
|
108 |
+
- Number of layers: 3
|
109 |
+
- Units per layer: 1,900 nodes
|
110 |
+
- Activation function: relu
|
111 |
+
- Dropout rate: 0.6
|
112 |
+
|
113 |
+
- Final Layer:
|
114 |
+
- Type: Linear
|
115 |
+
- Input: 1900 dimensions
|
116 |
+
- Output: 1 dimension (mention pair coreference score)
|
117 |
+
|
118 |
+
Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.
|
119 |
|
120 |
## HOW TO USE:
|
121 |
*** IN CONSTRUCTION ***
|
|
|
123 |
## TRAINING CORPUS:
|
124 |
| | Document | Tokens Count | Is included in model eval |
|
125 |
|----|----------------------------------------------------------------|----------------|------------------------------------|
|
126 |
+
| 0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens | **True** |
|
127 |
+
| 1 | 1840_Sand-George_Pauline | 12,315 tokens | **True** |
|
128 |
+
| 2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | **True** |
|
129 |
+
| 3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens | **True** |
|
130 |
+
| 4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | **True** |
|
131 |
+
| 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | **True** |
|
132 |
+
| 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | **True** |
|
133 |
+
| 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | **True** |
|
134 |
+
| 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | **True** |
|
135 |
+
| 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | **True** |
|
136 |
+
| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | **True** |
|
137 |
+
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | **True** |
|
138 |
+
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | **True** |
|
139 |
+
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | **True** |
|
140 |
+
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | **True** |
|
141 |
+
| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens | **True** |
|
142 |
+
| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens | **True** |
|
143 |
+
| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens | **True** |
|
144 |
+
| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens | **True** |
|
145 |
+
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | **True** |
|
146 |
+
| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | **True** |
|
147 |
+
| 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | **True** |
|
148 |
+
| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | **True** |
|
149 |
+
| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | **True** |
|
150 |
+
| 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | **True** |
|
151 |
+
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | **True** |
|
152 |
+
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | **True** |
|
153 |
+
| 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | **True** |
|
154 |
+
| 28 | Manon_Lescaut_PEDRO | 71,219 tokens | **True** |
|
155 |
| 29 | TOTAL | 346,579 tokens | 29 files used for cross-validation |
|
156 |
|
|
|
|
|
|
|
|
|
|
|
|
|
157 |
## CONTACT:
|
158 |
mail: antoine [dot] bourgois [at] protonmail [dot] com
|
final_model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ded44d7c40c73481125be81d5312700a92baa8fa2bb3c6255ed6910c1f97b3ae
|
3 |
+
size 45374764
|