AntoineBourgois commited on
Commit
f5fbf09
1 Parent(s): b71d132

Upload 3 files

Browse files
Files changed (3) hide show
  1. JCLS_model_card.md +74 -55
  2. README.md +124 -79
  3. final_model +3 -0
JCLS_model_card.md CHANGED
@@ -1,70 +1,95 @@
1
-
2
  ---
3
  language: fr
4
  tags:
5
- - NER
 
 
 
6
  - camembert
7
  - literary-texts
8
  - nested-entities
9
  - BookNLP-fr
10
  license: apache-2.0
11
  metrics:
12
- - f1
13
- - precision
14
- - recall
 
15
  base_model:
16
  - almanach/camembert-large
17
- pipeline_tag: token-classification
18
  ---
19
 
20
  ## INTRODUCTION:
21
- This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **NER model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
22
 
23
- The predicted entities are:
24
- - mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
25
- - facilities (FAC): chatêau, sentier, chambre, couloir, ...
26
- - time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
27
- - geo-political entities (GPE): Montrouge, France, le petit hameau, ...
28
- - locations (LOC): le sud, Mars, l'océan, le bois, ...
29
- - vehicles (VEH): avion, voitures, calèche, vélos, ...
30
 
31
  ## MODEL PERFORMANCES (LOOCV):
32
- | NER_tag | precision | recall | f1_score | support | support % |
33
- |-----------|-------------|----------|------------|-----------|-------------|
34
- | PER | 91.05% | 95.15% | 93.05% | 4,061 | 100.00% |
35
- | micro_avg | 91.05% | 95.15% | 93.05% | 4,061 | 100.00% |
36
- | macro_avg | 91.05% | 95.15% | 93.05% | 4,061 | 100.00% |
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ## TRAINING PARAMETERS:
39
- - Entities types: ['PER']
40
- - Tagging scheme: BIOES
41
- - Nested entities levels: [0, 1]
42
  - Split strategy: Leave-one-out cross-validation (29 files)
43
  - Train/Validation split: 0.85 / 0.15
44
- - Batch size: 16
45
- - Initial learning rate: 0.00014
 
 
 
 
46
 
47
  ## MODEL ARCHITECTURE:
48
- Model Input: Maximum context camembert-large embeddings (1024 dimensions)
49
-
50
- - Locked Dropout: 0.5
51
-
52
- - Projection layer:
53
- - layer type: highway layer
54
- - input: 1024 dimensions
55
- - output: 2048 dimensions
56
-
57
- - BiLSTM layer:
58
- - input: 2048 dimensions
59
- - output: 256 dimensions (hidden state)
60
-
61
- - Linear layer:
62
- - input: 256 dimensions
63
- - output: 5 dimensions (predicted labels with BIOES tagging scheme)
64
-
65
- - CRF layer
66
-
67
- Model Output: BIOES labels sequence
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
  ## HOW TO USE:
70
  *** IN CONSTRUCTION ***
@@ -81,9 +106,9 @@ Model Output: BIOES labels sequence
81
  | 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | False |
82
  | 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | False |
83
  | 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | False |
84
- | 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | True |
85
- | 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | True |
86
- | 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | True |
87
  | 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | False |
88
  | 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | False |
89
  | 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | False |
@@ -94,20 +119,14 @@ Model Output: BIOES labels sequence
94
  | 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | False |
95
  | 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | False |
96
  | 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | False |
97
- | 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | True |
98
  | 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | False |
99
  | 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | False |
100
  | 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | False |
101
- | 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | True |
102
  | 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | False |
103
  | 28 | Manon_Lescaut_PEDRO | 71,219 tokens | False |
104
  | 29 | TOTAL | 346,579 tokens | 5 files used for cross-validation |
105
 
106
- ## PREDICTIONS CONFUSION MATRIX:
107
- | Gold Labels | PER | O | support |
108
- |---------------|-------|-----|-----------|
109
- | PER | 3,864 | 197 | 4,061 |
110
- | O | 370 | 0 | 370 |
111
-
112
  ## CONTACT:
113
  mail: antoine [dot] bourgois [at] protonmail [dot] com
 
 
1
  ---
2
  language: fr
3
  tags:
4
+ - coreference-resolution
5
+ - anaphora-resolution
6
+ - mentions-linking
7
+ - literary-texts
8
  - camembert
9
  - literary-texts
10
  - nested-entities
11
  - BookNLP-fr
12
  license: apache-2.0
13
  metrics:
14
+ - MUC
15
+ - B3
16
+ - CEAF
17
+ - CoNLL-F1
18
  base_model:
19
  - almanach/camembert-large
 
20
  ---
21
 
22
  ## INTRODUCTION:
23
+ This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.
24
 
25
+ This specific model has been trained to link entities of the following types: PER.
 
 
 
 
 
 
26
 
27
  ## MODEL PERFORMANCES (LOOCV):
28
+ Overall Coreference Resolution Performances for non-overlapping windows of different length:
29
+ | | Window width (tokens) | Document count | Sample count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 |
30
+ |----|-------------------------|------------------|----------------|----------|---------|------------|------------|
31
+ | 0 | 500 | 5 | 64 | 93.49% | 86.27% | 77.85% | 85.87% |
32
+ | 1 | 1,000 | 5 | 30 | 93.68% | 81.32% | 71.92% | 82.31% |
33
+ | 2 | 2,000 | 5 | 14 | 93.98% | 76.90% | 67.26% | 79.38% |
34
+ | 3 | 5,000 | 3 | 5 | 94.83% | 68.34% | 59.88% | 74.35% |
35
+ | 4 | 10,000 | 2 | 2 | 96.16% | 62.22% | 57.12% | 71.84% |
36
+
37
+ Coreference Resolution Performances on the fully annotated sample for each document:
38
+ | | Token count | Mention count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 |
39
+ |----|---------------|-----------------|----------|---------|------------|------------|
40
+ | 0 | 2,554 | 330 | 90.24% | 65.27% | 72.36% | 75.96% |
41
+ | 1 | 2,929 | 386 | 95.65% | 78.21% | 64.23% | 79.37% |
42
+ | 2 | 5,425 | 558 | 90.46% | 53.03% | 59.52% | 67.67% |
43
+ | 3 | 10,982 | 1,095 | 97.18% | 65.30% | 60.49% | 74.32% |
44
+ | 4 | 11,902 | 1,692 | 95.03% | 58.83% | 45.59% | 66.49% |
45
 
46
  ## TRAINING PARAMETERS:
47
+ - Entities types: PER
 
 
48
  - Split strategy: Leave-one-out cross-validation (29 files)
49
  - Train/Validation split: 0.85 / 0.15
50
+ - Batch size: 16,000
51
+ - Initial learning rate: 0.0004
52
+ - Focal loss gamma: 1
53
+ - Focal loss alpha: 0.25
54
+ - Pronoun lookup antecedents: 30
55
+ - Common and Proper nouns lookup antecedents: 300
56
 
57
  ## MODEL ARCHITECTURE:
58
+ Model Input: 2,165 dimensions vector
59
+ - Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
60
+ - Additional mentions features (106 dimensions):
61
+ - Length of mentions
62
+ - Position of the mention's start token within the sentence
63
+ - Grammatical category of the mentions (pronoun, common noun, proper noun)
64
+ - Dependency relation of the mention's head (one-hot encoded)
65
+ - Gender of the mentions (one-hot encoded)
66
+ - Number (singular/plural) of the mentions (one-hot encoded)
67
+ - Grammatical person of the mentions (one-hot encoded)
68
+ - Additional mention pairs features (11 dimensions):
69
+ - Distance between mention IDs
70
+ - Distance between start tokens of mentions
71
+ - Distance between end tokens of mentions
72
+ - Distance between sentences containing mentions
73
+ - Distance between paragraphs containing mentions
74
+ - Difference in nesting levels of mentions
75
+ - Ratio of shared tokens between mentions
76
+ - Exact text match between mentions (binary)
77
+ - Exact match of mention heads (binary)
78
+ - Match of syntactic heads between mentions (binary)
79
+ - Match of entity types between mentions (binary)
80
+
81
+ - Hidden Layers:
82
+ - Number of layers: 3
83
+ - Units per layer: 1,900 nodes
84
+ - Activation function: relu
85
+ - Dropout rate: 0.6
86
+
87
+ - Final Layer:
88
+ - Type: Linear
89
+ - Input: 1900 dimensions
90
+ - Output: 1 dimension (mention pair coreference score)
91
+
92
+ Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.
93
 
94
  ## HOW TO USE:
95
  *** IN CONSTRUCTION ***
 
106
  | 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | False |
107
  | 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | False |
108
  | 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | False |
109
+ | 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | **True** |
110
+ | 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | **True** |
111
+ | 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | **True** |
112
  | 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | False |
113
  | 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | False |
114
  | 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | False |
 
119
  | 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | False |
120
  | 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | False |
121
  | 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | False |
122
+ | 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | **True** |
123
  | 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | False |
124
  | 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | False |
125
  | 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | False |
126
+ | 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | **True** |
127
  | 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | False |
128
  | 28 | Manon_Lescaut_PEDRO | 71,219 tokens | False |
129
  | 29 | TOTAL | 346,579 tokens | 5 files used for cross-validation |
130
 
 
 
 
 
 
 
131
  ## CONTACT:
132
  mail: antoine [dot] bourgois [at] protonmail [dot] com
README.md CHANGED
@@ -1,70 +1,121 @@
1
-
2
  ---
3
  language: fr
4
  tags:
5
- - NER
 
 
 
6
  - camembert
7
  - literary-texts
8
  - nested-entities
9
  - BookNLP-fr
10
  license: apache-2.0
11
  metrics:
12
- - f1
13
- - precision
14
- - recall
 
15
  base_model:
16
  - almanach/camembert-large
17
- pipeline_tag: token-classification
18
  ---
19
 
20
  ## INTRODUCTION:
21
- This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **NER model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
22
 
23
- The predicted entities are:
24
- - mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
25
- - facilities (FAC): chatêau, sentier, chambre, couloir, ...
26
- - time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
27
- - geo-political entities (GPE): Montrouge, France, le petit hameau, ...
28
- - locations (LOC): le sud, Mars, l'océan, le bois, ...
29
- - vehicles (VEH): avion, voitures, calèche, vélos, ...
30
 
31
  ## MODEL PERFORMANCES (LOOCV):
32
- | NER_tag | precision | recall | f1_score | support | support % |
33
- |-----------|-------------|----------|------------|-----------|-------------|
34
- | PER | 91.87% | 92.63% | 92.25% | 43,427 | 100.00% |
35
- | micro_avg | 91.87% | 92.63% | 92.25% | 43,427 | 100.00% |
36
- | macro_avg | 91.87% | 92.63% | 92.25% | 43,427 | 100.00% |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ## TRAINING PARAMETERS:
39
- - Entities types: ['PER']
40
- - Tagging scheme: BIOES
41
- - Nested entities levels: [0, 1]
42
  - Split strategy: Leave-one-out cross-validation (29 files)
43
  - Train/Validation split: 0.85 / 0.15
44
- - Batch size: 16
45
- - Initial learning rate: 0.00014
 
 
 
 
46
 
47
  ## MODEL ARCHITECTURE:
48
- Model Input: Maximum context camembert-large embeddings (1024 dimensions)
49
-
50
- - Locked Dropout: 0.5
51
-
52
- - Projection layer:
53
- - layer type: highway layer
54
- - input: 1024 dimensions
55
- - output: 2048 dimensions
56
-
57
- - BiLSTM layer:
58
- - input: 2048 dimensions
59
- - output: 256 dimensions (hidden state)
60
-
61
- - Linear layer:
62
- - input: 256 dimensions
63
- - output: 5 dimensions (predicted labels with BIOES tagging scheme)
64
-
65
- - CRF layer
66
-
67
- Model Output: BIOES labels sequence
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
  ## HOW TO USE:
70
  *** IN CONSTRUCTION ***
@@ -72,42 +123,36 @@ Model Output: BIOES labels sequence
72
  ## TRAINING CORPUS:
73
  | | Document | Tokens Count | Is included in model eval |
74
  |----|----------------------------------------------------------------|----------------|------------------------------------|
75
- | 0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens | True |
76
- | 1 | 1840_Sand-George_Pauline | 12,315 tokens | True |
77
- | 2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | True |
78
- | 3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens | True |
79
- | 4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | True |
80
- | 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | True |
81
- | 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | True |
82
- | 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | True |
83
- | 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | True |
84
- | 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | True |
85
- | 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | True |
86
- | 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | True |
87
- | 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | True |
88
- | 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | True |
89
- | 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | True |
90
- | 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens | True |
91
- | 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens | True |
92
- | 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens | True |
93
- | 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens | True |
94
- | 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | True |
95
- | 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | True |
96
- | 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | True |
97
- | 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | True |
98
- | 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | True |
99
- | 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | True |
100
- | 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | True |
101
- | 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | True |
102
- | 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | True |
103
- | 28 | Manon_Lescaut_PEDRO | 71,219 tokens | True |
104
  | 29 | TOTAL | 346,579 tokens | 29 files used for cross-validation |
105
 
106
- ## PREDICTIONS CONFUSION MATRIX:
107
- | Gold Labels | PER | O | support |
108
- |---------------|--------|-------|-----------|
109
- | PER | 40,227 | 3,200 | 43,427 |
110
- | O | 3,401 | 0 | 3,401 |
111
-
112
  ## CONTACT:
113
  mail: antoine [dot] bourgois [at] protonmail [dot] com
 
 
1
  ---
2
  language: fr
3
  tags:
4
+ - coreference-resolution
5
+ - anaphora-resolution
6
+ - mentions-linking
7
+ - literary-texts
8
  - camembert
9
  - literary-texts
10
  - nested-entities
11
  - BookNLP-fr
12
  license: apache-2.0
13
  metrics:
14
+ - MUC
15
+ - B3
16
+ - CEAF
17
+ - CoNLL-F1
18
  base_model:
19
  - almanach/camembert-large
 
20
  ---
21
 
22
  ## INTRODUCTION:
23
+ This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.
24
 
25
+ This specific model has been trained to link entities of the following types: PER.
 
 
 
 
 
 
26
 
27
  ## MODEL PERFORMANCES (LOOCV):
28
+ Overall Coreference Resolution Performances for non-overlapping windows of different length:
29
+ | | Window width (tokens) | Document count | Sample count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 |
30
+ |----|-------------------------|------------------|----------------|----------|---------|------------|------------|
31
+ | 0 | 500 | 29 | 677 | 92.18% | 83.86% | 76.86% | 84.30% |
32
+ | 1 | 1,000 | 29 | 332 | 92.65% | 79.79% | 71.77% | 81.40% |
33
+ | 2 | 2,000 | 28 | 162 | 93.29% | 75.85% | 67.34% | 78.83% |
34
+ | 3 | 5,000 | 19 | 56 | 93.76% | 69.60% | 61.16% | 74.84% |
35
+ | 4 | 10,000 | 18 | 27 | 94.28% | 65.73% | 58.59% | 72.86% |
36
+ | 5 | 25,000 | 2 | 3 | 94.76% | 62.48% | 53.33% | 70.19% |
37
+ | 6 | 50,000 | 1 | 1 | 97.39% | 56.43% | 47.40% | 67.07% |
38
+
39
+ Coreference Resolution Performances on the fully annotated sample for each document:
40
+ | | Token count | Mention count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 |
41
+ |----|---------------|-----------------|----------|---------|------------|------------|
42
+ | 0 | 1,864 | 253 | 98.16% | 95.39% | 60.34% | 84.63% |
43
+ | 1 | 2,034 | 321 | 97.47% | 92.79% | 80.04% | 90.10% |
44
+ | 2 | 2,141 | 297 | 95.06% | 77.99% | 65.08% | 79.38% |
45
+ | 3 | 2,251 | 235 | 91.95% | 80.47% | 46.56% | 73.00% |
46
+ | 4 | 2,343 | 239 | 83.87% | 61.95% | 43.58% | 63.13% |
47
+ | 5 | 2,441 | 314 | 91.85% | 55.70% | 60.82% | 69.46% |
48
+ | 6 | 2,554 | 330 | 90.24% | 65.27% | 72.36% | 75.96% |
49
+ | 7 | 2,860 | 369 | 93.65% | 84.89% | 74.93% | 84.49% |
50
+ | 8 | 2,929 | 386 | 95.65% | 78.21% | 64.23% | 79.37% |
51
+ | 9 | 4,067 | 429 | 97.46% | 85.20% | 62.52% | 81.73% |
52
+ | 10 | 5,425 | 558 | 90.46% | 53.03% | 59.52% | 67.67% |
53
+ | 11 | 10,305 | 1,436 | 96.37% | 74.83% | 59.91% | 77.04% |
54
+ | 12 | 10,982 | 1,095 | 97.18% | 65.30% | 60.49% | 74.32% |
55
+ | 13 | 11,768 | 1,734 | 93.30% | 64.14% | 64.12% | 73.85% |
56
+ | 14 | 11,834 | 600 | 92.21% | 67.51% | 60.74% | 73.49% |
57
+ | 15 | 11,902 | 1,692 | 95.03% | 58.83% | 45.59% | 66.49% |
58
+ | 16 | 12,281 | 1,089 | 95.06% | 62.05% | 72.55% | 76.55% |
59
+ | 17 | 12,285 | 1,489 | 95.28% | 77.84% | 57.43% | 76.85% |
60
+ | 18 | 12,315 | 1,501 | 95.36% | 57.07% | 64.26% | 72.23% |
61
+ | 19 | 12,389 | 1,654 | 93.19% | 54.21% | 51.84% | 66.41% |
62
+ | 20 | 12,557 | 1,085 | 92.30% | 66.97% | 46.65% | 68.64% |
63
+ | 21 | 12,703 | 1,731 | 90.40% | 53.70% | 61.37% | 68.49% |
64
+ | 22 | 13,023 | 1,559 | 93.86% | 61.71% | 62.41% | 72.66% |
65
+ | 23 | 14,299 | 1,582 | 97.23% | 69.25% | 67.04% | 77.84% |
66
+ | 24 | 14,637 | 2,127 | 95.78% | 71.34% | 63.28% | 76.80% |
67
+ | 25 | 15,408 | 1,769 | 92.85% | 54.11% | 56.12% | 67.69% |
68
+ | 26 | 24,776 | 2,716 | 94.31% | 63.51% | 54.12% | 70.65% |
69
+ | 27 | 30,987 | 2,980 | 89.55% | 54.25% | 59.68% | 67.83% |
70
+ | 28 | 71,219 | 11,857 | 97.38% | 50.85% | 45.93% | 64.72% |
71
 
72
  ## TRAINING PARAMETERS:
73
+ - Entities types: PER
 
 
74
  - Split strategy: Leave-one-out cross-validation (29 files)
75
  - Train/Validation split: 0.85 / 0.15
76
+ - Batch size: 16,000
77
+ - Initial learning rate: 0.0004
78
+ - Focal loss gamma: 1
79
+ - Focal loss alpha: 0.25
80
+ - Pronoun lookup antecedents: 30
81
+ - Common and Proper nouns lookup antecedents: 300
82
 
83
  ## MODEL ARCHITECTURE:
84
+ Model Input: 2,165 dimensions vector
85
+ - Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
86
+ - Additional mentions features (106 dimensions):
87
+ - Length of mentions
88
+ - Position of the mention's start token within the sentence
89
+ - Grammatical category of the mentions (pronoun, common noun, proper noun)
90
+ - Dependency relation of the mention's head (one-hot encoded)
91
+ - Gender of the mentions (one-hot encoded)
92
+ - Number (singular/plural) of the mentions (one-hot encoded)
93
+ - Grammatical person of the mentions (one-hot encoded)
94
+ - Additional mention pairs features (11 dimensions):
95
+ - Distance between mention IDs
96
+ - Distance between start tokens of mentions
97
+ - Distance between end tokens of mentions
98
+ - Distance between sentences containing mentions
99
+ - Distance between paragraphs containing mentions
100
+ - Difference in nesting levels of mentions
101
+ - Ratio of shared tokens between mentions
102
+ - Exact text match between mentions (binary)
103
+ - Exact match of mention heads (binary)
104
+ - Match of syntactic heads between mentions (binary)
105
+ - Match of entity types between mentions (binary)
106
+
107
+ - Hidden Layers:
108
+ - Number of layers: 3
109
+ - Units per layer: 1,900 nodes
110
+ - Activation function: relu
111
+ - Dropout rate: 0.6
112
+
113
+ - Final Layer:
114
+ - Type: Linear
115
+ - Input: 1900 dimensions
116
+ - Output: 1 dimension (mention pair coreference score)
117
+
118
+ Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.
119
 
120
  ## HOW TO USE:
121
  *** IN CONSTRUCTION ***
 
123
  ## TRAINING CORPUS:
124
  | | Document | Tokens Count | Is included in model eval |
125
  |----|----------------------------------------------------------------|----------------|------------------------------------|
126
+ | 0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens | **True** |
127
+ | 1 | 1840_Sand-George_Pauline | 12,315 tokens | **True** |
128
+ | 2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | **True** |
129
+ | 3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens | **True** |
130
+ | 4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | **True** |
131
+ | 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | **True** |
132
+ | 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | **True** |
133
+ | 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | **True** |
134
+ | 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | **True** |
135
+ | 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | **True** |
136
+ | 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | **True** |
137
+ | 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | **True** |
138
+ | 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | **True** |
139
+ | 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | **True** |
140
+ | 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | **True** |
141
+ | 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens | **True** |
142
+ | 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens | **True** |
143
+ | 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens | **True** |
144
+ | 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens | **True** |
145
+ | 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | **True** |
146
+ | 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | **True** |
147
+ | 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | **True** |
148
+ | 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | **True** |
149
+ | 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | **True** |
150
+ | 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | **True** |
151
+ | 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | **True** |
152
+ | 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | **True** |
153
+ | 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | **True** |
154
+ | 28 | Manon_Lescaut_PEDRO | 71,219 tokens | **True** |
155
  | 29 | TOTAL | 346,579 tokens | 29 files used for cross-validation |
156
 
 
 
 
 
 
 
157
  ## CONTACT:
158
  mail: antoine [dot] bourgois [at] protonmail [dot] com
final_model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ded44d7c40c73481125be81d5312700a92baa8fa2bb3c6255ed6910c1f97b3ae
3
+ size 45374764