pstroe commited on
Commit
503d71b
1 Parent(s): 7729aa8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -2
README.md CHANGED
@@ -16,8 +16,7 @@ I undertook the following preprocessing steps:
16
  - Language identification with [langid](https://github.com/saffsd/langid.py)
17
  - Compute the ratio of Latin vocabulary in each sentence (against the digital-born vocab of the corpus)
18
  - Retain only sentences with a Latin vocabulary ratio of > 85%.
19
- - Retaining only lines containing letters of the Latin alphabet, numerals, and certain punctuation (--> `grep -P '^[A-z0-9ÄÖÜäöüÆ挜ᵫĀāūōŌ.,;:?!\- Ęę]+$' la.nolorem.tok.txt`
20
- - deduplication of the corpus
21
 
22
  The result is a corpus of ~390 million tokens.
23
 
 
16
  - Language identification with [langid](https://github.com/saffsd/langid.py)
17
  - Compute the ratio of Latin vocabulary in each sentence (against the digital-born vocab of the corpus)
18
  - Retain only sentences with a Latin vocabulary ratio of > 85%.
19
+ - Exclude all lines containing '^' --> hints at the presence of OCR errors.
 
20
 
21
  The result is a corpus of ~390 million tokens.
22