Perplexity tools

1. Create samples from clean_json_3 sources

Between 1k and 1M documents. Read samples/README.md. Output files must be prefixed by doc_type and suffixed by language code (2 letters). For example:

$ cat /nfsmounts/datastore/ncc_corpus/mimir/jsonl_2/nrk/nrk-articles.jsonl | shuf -n 100000 > samples/restricted-newspapers_nrk_no.json

2. Create the perplexity scores for each file

Example of how to create scores only for doc_type restricted-newspapers_* samples:

$ ls samples/restricted-newspapers_* | parallel --lb --jobs 5 python samples_scores.py {} --output_path scores/ --jobs 15

3. Create the quartiles CSV needed for segmenting and downsamplig

The different doc_types will be grouped together. By passing the flag --group_by_prefix_lang, the grouping will happen on the pair doc_type prefix and language code, e.g., wikipedia_en.

Different downsampling ratios can be specified by using the --sampling_ratio_per_lang flag. For mimir-base, the downsampling by language is defined as follows: "da:0.23,en:0.21,sv:0.08,is:0.50".

$ python samples_quartiles.py scores/ --group_by_prefix_lang --sampling_ratio_per_lang "da:0.23,en:0.21,sv:0.08,is:0.50" --output_file csv/base-perplexity_quartiles_sampling.csv

For mimir-extended, the downsampling by language is defined as follows: "da:0.43,en:0.81,sv:0.15,code:0.62".

$ python samples_quartiles.py scores/ --group_by_prefix_lang --sampling_ratio_per_lang "da:0.43,en:0.81,sv:0.15,code:0.62" --output_file csv/extended-perplexity_quartiles_sampling.csv  --overwrite_prefix_lang "starcoder_en:starcode_code"

More information in the spreadsheet.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.