Update README.md
Browse files
README.md
CHANGED
@@ -50,7 +50,7 @@ The training corpus is composed of several biomedical corpora in Spanish, collec
|
|
50 |
- keep the original document boundaries
|
51 |
|
52 |
Then, the biomedical corpora are concatenated and further global deduplication among the biomedical corpora have been applied.
|
53 |
-
Eventually, the clinical corpus is concatenated to the cleaned biomedical corpus resulting in a medium-size biomedical-clinical corpus for Spanish composed of about
|
54 |
|
55 |
|
56 |
| Name | No. tokens | Description |
|
|
|
50 |
- keep the original document boundaries
|
51 |
|
52 |
Then, the biomedical corpora are concatenated and further global deduplication among the biomedical corpora have been applied.
|
53 |
+
Eventually, the clinical corpus is concatenated to the cleaned biomedical corpus resulting in a medium-size biomedical-clinical corpus for Spanish composed of about 968M tokens. The table below shows some basic statistics of the individual cleaned corpora:
|
54 |
|
55 |
|
56 |
| Name | No. tokens | Description |
|