AntoineBourgois
/

BookNLP-fr_coreference-resolution_camembert-large_PER

@@ -1,70 +1,95 @@
 ---
 language: fr
 tags:
-- NER
 - camembert
 - literary-texts
 - nested-entities
 - BookNLP-fr
 license: apache-2.0
 metrics:
-- f1
-- precision
-- recall
 base_model:
 - almanach/camembert-large
-pipeline_tag: token-classification
 ---
 ## INTRODUCTION:
-This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **NER model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
-The predicted entities are:
-- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
-- facilities (FAC): chatêau, sentier, chambre, couloir, ...
-- time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
-- geo-political entities (GPE): Montrouge, France, le petit hameau, ...
-- locations (LOC): le sud, Mars, l'océan, le bois, ...
-- vehicles (VEH): avion, voitures, calèche, vélos, ...
 ## MODEL PERFORMANCES (LOOCV):
-| NER_tag   | precision   | recall   | f1_score   | support   | support %   |
-|-----------|-------------|----------|------------|-----------|-------------|
-| PER       | 91.05%      | 95.15%   | 93.05%     | 4,061     | 100.00%     |
-| micro_avg | 91.05%      | 95.15%   | 93.05%     | 4,061     | 100.00%     |
-| macro_avg | 91.05%      | 95.15%   | 93.05%     | 4,061     | 100.00%     |
 ## TRAINING PARAMETERS:
-- Entities types: ['PER']
-- Tagging scheme: BIOES
-- Nested entities levels: [0, 1]
 - Split strategy: Leave-one-out cross-validation (29 files)
 - Train/Validation split: 0.85 / 0.15
-- Batch size: 16
-- Initial learning rate: 0.00014
 ## MODEL ARCHITECTURE:
-Model Input: Maximum context camembert-large embeddings (1024 dimensions)
-- Locked Dropout: 0.5
-- Projection layer:
-  - layer type: highway layer
-  - input: 1024 dimensions
-  - output: 2048 dimensions
-- BiLSTM layer:
-  - input: 2048 dimensions
-  - output: 256 dimensions (hidden state)
-- Linear layer:
-  - input: 256 dimensions
-  - output: 5 dimensions (predicted labels with BIOES tagging scheme)
-- CRF layer
-Model Output: BIOES labels sequence
 ## HOW TO USE:
 *** IN CONSTRUCTION ***
@@ -81,9 +106,9 @@ Model Output: BIOES labels sequence
 |  6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse                   | 11,834 tokens  | False                             |
 |  7 | 1873_Zola-Emile_Le-ventre-de-Paris                             | 12,557 tokens  | False                             |
 |  8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet                      | 12,281 tokens  | False                             |
-|  9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens   | True                              |
-| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE   | 2,554 tokens   | True                              |
-| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE        | 2,929 tokens   | True                              |
 | 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA           | 4,067 tokens   | False                             |
 | 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE          | 2,251 tokens   | False                             |
 | 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE        | 2,034 tokens   | False                             |
@@ -94,20 +119,14 @@ Model Output: BIOES labels sequence
 | 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON      | 2,343 tokens   | False                             |
 | 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis   | 12,703 tokens  | False                             |
 | 21 | 1903_Conan-Laure_Elisabeth_Seton                               | 13,023 tokens  | False                             |
-| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube              | 10,982 tokens  | True                              |
 | 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin           | 10,305 tokens  | False                             |
 | 24 | 1917_Adèle-Bourgeois_Némoville                                 | 12,389 tokens  | False                             |
 | 25 | 1923_Radiguet-Raymond_Le-diable-au-corps                       | 14,637 tokens  | False                             |
-| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin                   | 11,902 tokens  | True                              |
 | 27 | 1937_Audoux-Marguerite_Douce-Lumiere                           | 12,285 tokens  | False                             |
 | 28 | Manon_Lescaut_PEDRO                                            | 71,219 tokens  | False                             |
 | 29 | TOTAL                                                          | 346,579 tokens | 5 files used for cross-validation |
-## PREDICTIONS CONFUSION MATRIX:
-| Gold Labels   | PER   |   O | support   |
-|---------------|-------|-----|-----------|
-| PER           | 3,864 | 197 | 4,061     |
-| O             | 370   |   0 | 370       |
 ## CONTACT:
 mail: antoine [dot] bourgois [at] protonmail [dot] com

 ---
 language: fr
 tags:
+- coreference-resolution
+- anaphora-resolution
+- mentions-linking
+- literary-texts
 - camembert
 - literary-texts
 - nested-entities
 - BookNLP-fr
 license: apache-2.0
 metrics:
+- MUC
+- B3
+- CEAF
+- CoNLL-F1
 base_model:
 - almanach/camembert-large
 ---
 ## INTRODUCTION:
+This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.
+This specific model has been trained to link entities of the following types: PER.
 ## MODEL PERFORMANCES (LOOCV):
+Overall Coreference Resolution Performances for non-overlapping windows of different length:
+|    | Window width (tokens)   |   Document count |   Sample count | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
+|----|-------------------------|------------------|----------------|----------|---------|------------|------------|
+|  0 | 500                     |                5 |             64 | 93.49%   | 86.27%  | 77.85%     | 85.87%     |
+|  1 | 1,000                   |                5 |             30 | 93.68%   | 81.32%  | 71.92%     | 82.31%     |
+|  2 | 2,000                   |                5 |             14 | 93.98%   | 76.90%  | 67.26%     | 79.38%     |
+|  3 | 5,000                   |                3 |              5 | 94.83%   | 68.34%  | 59.88%     | 74.35%     |
+|  4 | 10,000                  |                2 |              2 | 96.16%   | 62.22%  | 57.12%     | 71.84%     |
+Coreference Resolution Performances on the fully annotated sample for each document:
+|    | Token count   | Mention count   | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
+|----|---------------|-----------------|----------|---------|------------|------------|
+|  0 | 2,554         | 330             | 90.24%   | 65.27%  | 72.36%     | 75.96%     |
+|  1 | 2,929         | 386             | 95.65%   | 78.21%  | 64.23%     | 79.37%     |
+|  2 | 5,425         | 558             | 90.46%   | 53.03%  | 59.52%     | 67.67%     |
+|  3 | 10,982        | 1,095           | 97.18%   | 65.30%  | 60.49%     | 74.32%     |
+|  4 | 11,902        | 1,692           | 95.03%   | 58.83%  | 45.59%     | 66.49%     |
 ## TRAINING PARAMETERS:
+- Entities types: PER
 - Split strategy: Leave-one-out cross-validation (29 files)
 - Train/Validation split: 0.85 / 0.15
+- Batch size: 16,000
+- Initial learning rate: 0.0004
+- Focal loss gamma: 1
+- Focal loss alpha: 0.25
+- Pronoun lookup antecedents: 30
+- Common and Proper nouns lookup antecedents: 300
 ## MODEL ARCHITECTURE:
+Model Input: 2,165 dimensions vector
+- Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
+- Additional mentions features (106 dimensions):
+  - Length of mentions
+  - Position of the mention's start token within the sentence
+  - Grammatical category of the mentions (pronoun, common noun, proper noun)
+  - Dependency relation of the mention's head (one-hot encoded)
+  - Gender of the mentions (one-hot encoded)
+  - Number (singular/plural) of the mentions (one-hot encoded)
+  - Grammatical person of the mentions (one-hot encoded)
+- Additional mention pairs features (11 dimensions):
+  - Distance between mention IDs
+  - Distance between start tokens of mentions
+  - Distance between end tokens of mentions
+  - Distance between sentences containing mentions
+  - Distance between paragraphs containing mentions
+  - Difference in nesting levels of mentions
+  - Ratio of shared tokens between mentions
+  - Exact text match between mentions (binary)
+  - Exact match of mention heads (binary)
+  - Match of syntactic heads between mentions (binary)
+  - Match of entity types between mentions (binary)
+- Hidden Layers:
+  - Number of layers: 3
+  - Units per layer: 1,900 nodes
+  - Activation function: relu
+  - Dropout rate: 0.6
+- Final Layer:
+  - Type: Linear
+  - Input: 1900 dimensions
+  - Output: 1 dimension (mention pair coreference score)
+Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.
 ## HOW TO USE:
 *** IN CONSTRUCTION ***
 |  6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse                   | 11,834 tokens  | False                             |
 |  7 | 1873_Zola-Emile_Le-ventre-de-Paris                             | 12,557 tokens  | False                             |
 |  8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet                      | 12,281 tokens  | False                             |
+|  9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens   | **True**                          |
+| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE   | 2,554 tokens   | **True**                          |
+| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE        | 2,929 tokens   | **True**                          |
 | 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA           | 4,067 tokens   | False                             |
 | 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE          | 2,251 tokens   | False                             |
 | 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE        | 2,034 tokens   | False                             |
 | 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON      | 2,343 tokens   | False                             |
 | 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis   | 12,703 tokens  | False                             |
 | 21 | 1903_Conan-Laure_Elisabeth_Seton                               | 13,023 tokens  | False                             |
+| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube              | 10,982 tokens  | **True**                          |
 | 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin           | 10,305 tokens  | False                             |
 | 24 | 1917_Adèle-Bourgeois_Némoville                                 | 12,389 tokens  | False                             |
 | 25 | 1923_Radiguet-Raymond_Le-diable-au-corps                       | 14,637 tokens  | False                             |
+| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin                   | 11,902 tokens  | **True**                          |
 | 27 | 1937_Audoux-Marguerite_Douce-Lumiere                           | 12,285 tokens  | False                             |
 | 28 | Manon_Lescaut_PEDRO                                            | 71,219 tokens  | False                             |
 | 29 | TOTAL                                                          | 346,579 tokens | 5 files used for cross-validation |
 ## CONTACT:
 mail: antoine [dot] bourgois [at] protonmail [dot] com

README.md CHANGED Viewed

@@ -1,70 +1,121 @@
 ---
 language: fr
 tags:
-- NER
 - camembert
 - literary-texts
 - nested-entities
 - BookNLP-fr
 license: apache-2.0
 metrics:
-- f1
-- precision
-- recall
 base_model:
 - almanach/camembert-large
-pipeline_tag: token-classification
 ---
 ## INTRODUCTION:
-This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **NER model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
-The predicted entities are:
-- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
-- facilities (FAC): chatêau, sentier, chambre, couloir, ...
-- time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
-- geo-political entities (GPE): Montrouge, France, le petit hameau, ...
-- locations (LOC): le sud, Mars, l'océan, le bois, ...
-- vehicles (VEH): avion, voitures, calèche, vélos, ...
 ## MODEL PERFORMANCES (LOOCV):
-| NER_tag   | precision   | recall   | f1_score   | support   | support %   |
-|-----------|-------------|----------|------------|-----------|-------------|
-| PER       | 91.87%      | 92.63%   | 92.25%     | 43,427    | 100.00%     |
-| micro_avg | 91.87%      | 92.63%   | 92.25%     | 43,427    | 100.00%     |
-| macro_avg | 91.87%      | 92.63%   | 92.25%     | 43,427    | 100.00%     |
 ## TRAINING PARAMETERS:
-- Entities types: ['PER']
-- Tagging scheme: BIOES
-- Nested entities levels: [0, 1]
 - Split strategy: Leave-one-out cross-validation (29 files)
 - Train/Validation split: 0.85 / 0.15
-- Batch size: 16
-- Initial learning rate: 0.00014
 ## MODEL ARCHITECTURE:
-Model Input: Maximum context camembert-large embeddings (1024 dimensions)
-- Locked Dropout: 0.5
-- Projection layer:
-  - layer type: highway layer
-  - input: 1024 dimensions
-  - output: 2048 dimensions
-- BiLSTM layer:
-  - input: 2048 dimensions
-  - output: 256 dimensions (hidden state)
-- Linear layer:
-  - input: 256 dimensions
-  - output: 5 dimensions (predicted labels with BIOES tagging scheme)
-- CRF layer
-Model Output: BIOES labels sequence
 ## HOW TO USE:
 *** IN CONSTRUCTION ***
@@ -72,42 +123,36 @@ Model Output: BIOES labels sequence
 ## TRAINING CORPUS:
 |    | Document                                                       | Tokens Count   | Is included in model eval          |
 |----|----------------------------------------------------------------|----------------|------------------------------------|
-|  0 | 1836_Gautier-Theophile_La-morte-amoureuse                      | 14,299 tokens  | True                               |
-|  1 | 1840_Sand-George_Pauline                                       | 12,315 tokens  | True                               |
-|  2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote             | 24,776 tokens  | True                               |
-|  3 | 1844_Balzac-Honore-de_La-Maison-Nucingen                       | 30,987 tokens  | True                               |
-|  4 | 1844_Balzac-Honore-de_Sarrasine                                | 15,408 tokens  | True                               |
-|  5 | 1856_Cousin-Victor_Madame-de-Hautefort                         | 11,768 tokens  | True                               |
-|  6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse                   | 11,834 tokens  | True                               |
-|  7 | 1873_Zola-Emile_Le-ventre-de-Paris                             | 12,557 tokens  | True                               |
-|  8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet                      | 12,281 tokens  | True                               |
-|  9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens   | True                               |
-| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE   | 2,554 tokens   | True                               |
-| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE        | 2,929 tokens   | True                               |
-| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA           | 4,067 tokens   | True                               |
-| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE          | 2,251 tokens   | True                               |
-| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE        | 2,034 tokens   | True                               |
-| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU               | 1,864 tokens   | True                               |
-| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL            | 2,141 tokens   | True                               |
-| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE          | 2,441 tokens   | True                               |
-| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL          | 2,860 tokens   | True                               |
-| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON      | 2,343 tokens   | True                               |
-| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis   | 12,703 tokens  | True                               |
-| 21 | 1903_Conan-Laure_Elisabeth_Seton                               | 13,023 tokens  | True                               |
-| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube              | 10,982 tokens  | True                               |
-| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin           | 10,305 tokens  | True                               |
-| 24 | 1917_Adèle-Bourgeois_Némoville                                 | 12,389 tokens  | True                               |
-| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps                       | 14,637 tokens  | True                               |
-| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin                   | 11,902 tokens  | True                               |
-| 27 | 1937_Audoux-Marguerite_Douce-Lumiere                           | 12,285 tokens  | True                               |
-| 28 | Manon_Lescaut_PEDRO                                            | 71,219 tokens  | True                               |
 | 29 | TOTAL                                                          | 346,579 tokens | 29 files used for cross-validation |
-## PREDICTIONS CONFUSION MATRIX:
-| Gold Labels   | PER    | O     | support   |
-|---------------|--------|-------|-----------|
-| PER           | 40,227 | 3,200 | 43,427    |
-| O             | 3,401  | 0     | 3,401     |
 ## CONTACT:
 mail: antoine [dot] bourgois [at] protonmail [dot] com

 ---
 language: fr
 tags:
+- coreference-resolution
+- anaphora-resolution
+- mentions-linking
+- literary-texts
 - camembert
 - literary-texts
 - nested-entities
 - BookNLP-fr
 license: apache-2.0
 metrics:
+- MUC
+- B3
+- CEAF
+- CoNLL-F1
 base_model:
 - almanach/camembert-large
 ---
 ## INTRODUCTION:
+This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.
+This specific model has been trained to link entities of the following types: PER.
 ## MODEL PERFORMANCES (LOOCV):
+Overall Coreference Resolution Performances for non-overlapping windows of different length:
+|    | Window width (tokens)   |   Document count |   Sample count | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
+|----|-------------------------|------------------|----------------|----------|---------|------------|------------|
+|  0 | 500                     |               29 |            677 | 92.18%   | 83.86%  | 76.86%     | 84.30%     |
+|  1 | 1,000                   |               29 |            332 | 92.65%   | 79.79%  | 71.77%     | 81.40%     |
+|  2 | 2,000                   |               28 |            162 | 93.29%   | 75.85%  | 67.34%     | 78.83%     |
+|  3 | 5,000                   |               19 |             56 | 93.76%   | 69.60%  | 61.16%     | 74.84%     |
+|  4 | 10,000                  |               18 |             27 | 94.28%   | 65.73%  | 58.59%     | 72.86%     |
+|  5 | 25,000                  |                2 |              3 | 94.76%   | 62.48%  | 53.33%     | 70.19%     |
+|  6 | 50,000                  |                1 |              1 | 97.39%   | 56.43%  | 47.40%     | 67.07%     |
+Coreference Resolution Performances on the fully annotated sample for each document:
+|    | Token count   | Mention count   | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
+|----|---------------|-----------------|----------|---------|------------|------------|
+|  0 | 1,864         | 253             | 98.16%   | 95.39%  | 60.34%     | 84.63%     |
+|  1 | 2,034         | 321             | 97.47%   | 92.79%  | 80.04%     | 90.10%     |
+|  2 | 2,141         | 297             | 95.06%   | 77.99%  | 65.08%     | 79.38%     |
+|  3 | 2,251         | 235             | 91.95%   | 80.47%  | 46.56%     | 73.00%     |
+|  4 | 2,343         | 239             | 83.87%   | 61.95%  | 43.58%     | 63.13%     |
+|  5 | 2,441         | 314             | 91.85%   | 55.70%  | 60.82%     | 69.46%     |
+|  6 | 2,554         | 330             | 90.24%   | 65.27%  | 72.36%     | 75.96%     |
+|  7 | 2,860         | 369             | 93.65%   | 84.89%  | 74.93%     | 84.49%     |
+|  8 | 2,929         | 386             | 95.65%   | 78.21%  | 64.23%     | 79.37%     |
+|  9 | 4,067         | 429             | 97.46%   | 85.20%  | 62.52%     | 81.73%     |
+| 10 | 5,425         | 558             | 90.46%   | 53.03%  | 59.52%     | 67.67%     |
+| 11 | 10,305        | 1,436           | 96.37%   | 74.83%  | 59.91%     | 77.04%     |
+| 12 | 10,982        | 1,095           | 97.18%   | 65.30%  | 60.49%     | 74.32%     |
+| 13 | 11,768        | 1,734           | 93.30%   | 64.14%  | 64.12%     | 73.85%     |
+| 14 | 11,834        | 600             | 92.21%   | 67.51%  | 60.74%     | 73.49%     |
+| 15 | 11,902        | 1,692           | 95.03%   | 58.83%  | 45.59%     | 66.49%     |
+| 16 | 12,281        | 1,089           | 95.06%   | 62.05%  | 72.55%     | 76.55%     |
+| 17 | 12,285        | 1,489           | 95.28%   | 77.84%  | 57.43%     | 76.85%     |
+| 18 | 12,315        | 1,501           | 95.36%   | 57.07%  | 64.26%     | 72.23%     |
+| 19 | 12,389        | 1,654           | 93.19%   | 54.21%  | 51.84%     | 66.41%     |
+| 20 | 12,557        | 1,085           | 92.30%   | 66.97%  | 46.65%     | 68.64%     |
+| 21 | 12,703        | 1,731           | 90.40%   | 53.70%  | 61.37%     | 68.49%     |
+| 22 | 13,023        | 1,559           | 93.86%   | 61.71%  | 62.41%     | 72.66%     |
+| 23 | 14,299        | 1,582           | 97.23%   | 69.25%  | 67.04%     | 77.84%     |
+| 24 | 14,637        | 2,127           | 95.78%   | 71.34%  | 63.28%     | 76.80%     |
+| 25 | 15,408        | 1,769           | 92.85%   | 54.11%  | 56.12%     | 67.69%     |
+| 26 | 24,776        | 2,716           | 94.31%   | 63.51%  | 54.12%     | 70.65%     |
+| 27 | 30,987        | 2,980           | 89.55%   | 54.25%  | 59.68%     | 67.83%     |
+| 28 | 71,219        | 11,857          | 97.38%   | 50.85%  | 45.93%     | 64.72%     |
 ## TRAINING PARAMETERS:
+- Entities types: PER
 - Split strategy: Leave-one-out cross-validation (29 files)
 - Train/Validation split: 0.85 / 0.15
+- Batch size: 16,000
+- Initial learning rate: 0.0004
+- Focal loss gamma: 1
+- Focal loss alpha: 0.25
+- Pronoun lookup antecedents: 30
+- Common and Proper nouns lookup antecedents: 300
 ## MODEL ARCHITECTURE:
+Model Input: 2,165 dimensions vector
+- Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
+- Additional mentions features (106 dimensions):
+  - Length of mentions
+  - Position of the mention's start token within the sentence
+  - Grammatical category of the mentions (pronoun, common noun, proper noun)
+  - Dependency relation of the mention's head (one-hot encoded)
+  - Gender of the mentions (one-hot encoded)
+  - Number (singular/plural) of the mentions (one-hot encoded)
+  - Grammatical person of the mentions (one-hot encoded)
+- Additional mention pairs features (11 dimensions):
+  - Distance between mention IDs
+  - Distance between start tokens of mentions
+  - Distance between end tokens of mentions
+  - Distance between sentences containing mentions
+  - Distance between paragraphs containing mentions
+  - Difference in nesting levels of mentions
+  - Ratio of shared tokens between mentions
+  - Exact text match between mentions (binary)
+  - Exact match of mention heads (binary)
+  - Match of syntactic heads between mentions (binary)
+  - Match of entity types between mentions (binary)
+- Hidden Layers:
+  - Number of layers: 3
+  - Units per layer: 1,900 nodes
+  - Activation function: relu
+  - Dropout rate: 0.6
+- Final Layer:
+  - Type: Linear
+  - Input: 1900 dimensions
+  - Output: 1 dimension (mention pair coreference score)
+Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.
 ## HOW TO USE:
 *** IN CONSTRUCTION ***
 ## TRAINING CORPUS:
 |    | Document                                                       | Tokens Count   | Is included in model eval          |
 |----|----------------------------------------------------------------|----------------|------------------------------------|
+|  0 | 1836_Gautier-Theophile_La-morte-amoureuse                      | 14,299 tokens  | **True**                           |
+|  1 | 1840_Sand-George_Pauline                                       | 12,315 tokens  | **True**                           |
+|  2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote             | 24,776 tokens  | **True**                           |
+|  3 | 1844_Balzac-Honore-de_La-Maison-Nucingen                       | 30,987 tokens  | **True**                           |
+|  4 | 1844_Balzac-Honore-de_Sarrasine                                | 15,408 tokens  | **True**                           |
+|  5 | 1856_Cousin-Victor_Madame-de-Hautefort                         | 11,768 tokens  | **True**                           |
+|  6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse                   | 11,834 tokens  | **True**                           |
+|  7 | 1873_Zola-Emile_Le-ventre-de-Paris                             | 12,557 tokens  | **True**                           |
+|  8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet                      | 12,281 tokens  | **True**                           |
+|  9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens   | **True**                           |
+| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE   | 2,554 tokens   | **True**                           |
+| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE        | 2,929 tokens   | **True**                           |
+| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA           | 4,067 tokens   | **True**                           |
+| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE          | 2,251 tokens   | **True**                           |
+| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE        | 2,034 tokens   | **True**                           |
+| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU               | 1,864 tokens   | **True**                           |
+| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL            | 2,141 tokens   | **True**                           |
+| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE          | 2,441 tokens   | **True**                           |
+| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL          | 2,860 tokens   | **True**                           |
+| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON      | 2,343 tokens   | **True**                           |
+| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis   | 12,703 tokens  | **True**                           |
+| 21 | 1903_Conan-Laure_Elisabeth_Seton                               | 13,023 tokens  | **True**                           |
+| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube              | 10,982 tokens  | **True**                           |
+| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin           | 10,305 tokens  | **True**                           |
+| 24 | 1917_Adèle-Bourgeois_Némoville                                 | 12,389 tokens  | **True**                           |
+| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps                       | 14,637 tokens  | **True**                           |
+| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin                   | 11,902 tokens  | **True**                           |
+| 27 | 1937_Audoux-Marguerite_Douce-Lumiere                           | 12,285 tokens  | **True**                           |
+| 28 | Manon_Lescaut_PEDRO                                            | 71,219 tokens  | **True**                           |
 | 29 | TOTAL                                                          | 346,579 tokens | 29 files used for cross-validation |
 ## CONTACT:
 mail: antoine [dot] bourgois [at] protonmail [dot] com

final_model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ded44d7c40c73481125be81d5312700a92baa8fa2bb3c6255ed6910c1f97b3ae
+size 45374764