Include avg char length for tokenizer analysis
Browse files
README.md
CHANGED
@@ -65,11 +65,13 @@ When comparing the token count statistics to another dataset, OpenWebText (OWT),
|
|
65 |
### GitHub Dataset:
|
66 |
|
67 |
* Average token length: 1,510.87
|
|
|
68 |
* Token to Character Ratio: 0.42
|
69 |
|
70 |
### OpenWebText (OWT) Dataset:
|
71 |
|
72 |
* Average token length: 76.60
|
|
|
73 |
* Token to Character Ratio: 0.36
|
74 |
|
75 |
The significantly higher average token length and token-to-character ratio for the GitHub dataset compared to OWT indicates the GitHub samples contain much longer and more verbose text. This aligns with the bimodal distribution and long tail observed in the histogram, which suggests the dataset contains a mix of both concise and more complex, lengthier text samples.
|
|
|
65 |
### GitHub Dataset:
|
66 |
|
67 |
* Average token length: 1,510.87
|
68 |
+
* Average original text length (characters): 3583.65
|
69 |
* Token to Character Ratio: 0.42
|
70 |
|
71 |
### OpenWebText (OWT) Dataset:
|
72 |
|
73 |
* Average token length: 76.60
|
74 |
+
* Average original text length (characters): 213.18
|
75 |
* Token to Character Ratio: 0.36
|
76 |
|
77 |
The significantly higher average token length and token-to-character ratio for the GitHub dataset compared to OWT indicates the GitHub samples contain much longer and more verbose text. This aligns with the bimodal distribution and long tail observed in the histogram, which suggests the dataset contains a mix of both concise and more complex, lengthier text samples.
|