kenhktsui
/

llm-data-textbook-quality-fasttext-classifier-v2

Text Classification

Model card Files Files and versions Community

kenhktsui commited on May 19, 2024

Commit

e1bfefc

·

verified ·

1 Parent(s): 5dccdbc

Update README.md

Files changed (1) hide show

README.md +12 -1

README.md CHANGED Viewed

@@ -88,11 +88,18 @@ predict_educational_value(["Hi"])
 # Output: [3.0000010156072676e-05]
 ```
-## Benchmark
 To make sure this classifier makes sense, it is applied to various datasets.
 Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
 |Dataset | Sampling | Average Educational Value | Type |
 |--------------------------------------|---|-------------------|-------|
@@ -115,7 +122,11 @@ Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
 |[HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)| First 10,000 | 1.058|Real|
 |[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000 | 1.017|Real|
 |[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.994|Real|
 |[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000 | 0.853|Real|
 \* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).

 # Output: [3.0000010156072676e-05]
 ```
+# Benchmark
 To make sure this classifier makes sense, it is applied to various datasets.
 Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
+The score can be interpreted as:
+|Educational Value| Category |
+|--------|----------|
+|2 | High|
+|1 | Mid|
+|0 | Low|
 |Dataset | Sampling | Average Educational Value | Type |
 |--------------------------------------|---|-------------------|-------|
 |[HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)| First 10,000 | 1.058|Real|
 |[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000 | 1.017|Real|
 |[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.994|Real|
+|[togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)| First 10,000 | 0.979|Real|
 |[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000 | 0.853|Real|
+|[tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)| First 10,000 | 0.798|Real|
 \* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).