Text Generation
Transformers
Safetensors
English
mistral
text-generation-inference
DarwinAnim8or commited on
Commit
4aaca45
·
verified ·
1 Parent(s): 6e4988b

Include avg char length for tokenizer analysis

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -65,11 +65,13 @@ When comparing the token count statistics to another dataset, OpenWebText (OWT),
65
  ### GitHub Dataset:
66
 
67
  * Average token length: 1,510.87
 
68
  * Token to Character Ratio: 0.42
69
 
70
  ### OpenWebText (OWT) Dataset:
71
 
72
  * Average token length: 76.60
 
73
  * Token to Character Ratio: 0.36
74
 
75
  The significantly higher average token length and token-to-character ratio for the GitHub dataset compared to OWT indicates the GitHub samples contain much longer and more verbose text. This aligns with the bimodal distribution and long tail observed in the histogram, which suggests the dataset contains a mix of both concise and more complex, lengthier text samples.
 
65
  ### GitHub Dataset:
66
 
67
  * Average token length: 1,510.87
68
+ * Average original text length (characters): 3583.65
69
  * Token to Character Ratio: 0.42
70
 
71
  ### OpenWebText (OWT) Dataset:
72
 
73
  * Average token length: 76.60
74
+ * Average original text length (characters): 213.18
75
  * Token to Character Ratio: 0.36
76
 
77
  The significantly higher average token length and token-to-character ratio for the GitHub dataset compared to OWT indicates the GitHub samples contain much longer and more verbose text. This aligns with the bimodal distribution and long tail observed in the histogram, which suggests the dataset contains a mix of both concise and more complex, lengthier text samples.