Perplexity tools
1. Create samples from clean_json_3
sources
Between 1k and 1M documents. Read samples/README.md. Output files must be prefixed by doc_type
and suffixed by language code (2 letters). For example:
$ cat /nfsmounts/datastore/ncc_corpus/mimir/jsonl_2/nrk/nrk-articles.jsonl | shuf -n 100000 > samples/restricted-newspapers_nrk_no.json
2. Create the perplexity scores for each file
Example of how to create scores only for doc_type
restricted-newspapers_*
samples:
$ ls samples/restricted-newspapers_* | parallel --lb --jobs 5 python samples_scores.py {} --output_path scores/ --jobs 15
3. Create the quartiles CSV needed for segmenting and downsamplig
The different doc_type
s will be grouped together. By passing the flag --group_by_prefix_lang
, the grouping will happen on the pair doc_type
prefix and language code, e.g., wikipedia_en
.
Different downsampling ratios can be specified by using the --sampling_ratio_per_lang
flag. For mimir-base
, the downsampling by language is defined as follows: "da:0.23,en:0.21,sv:0.08,is:0.50"
.
$ python samples_quartiles.py scores/ --group_by_prefix_lang --sampling_ratio_per_lang "da:0.23,en:0.21,sv:0.08,is:0.50" --output_file csv/base-perplexity_quartiles_sampling.csv
For mimir-extended
, the downsampling by language is defined as follows: "da:0.43,en:0.81,sv:0.15,code:0.62"
.
$ python samples_quartiles.py scores/ --group_by_prefix_lang --sampling_ratio_per_lang "da:0.43,en:0.81,sv:0.15,code:0.62" --output_file csv/extended-perplexity_quartiles_sampling.csv --overwrite_prefix_lang "starcoder_en:starcode_code"
More information in the spreadsheet.