Why is merges.txt empty in DeepChem/ChemBERTa-77M-MTR?

#5
by Mafuton - opened

Hi,

I downloaded the tokenizer for DeepChem/ChemBERTa-77M-MTR( or ChemBERTa-77M-MLM) and found that the merges.txt file is empty. As this tokenizer is supposed to use Byte Pair Encoding (BPE), I expected merges.txt to contain merge rules. However, since it is empty, tokenization does not work as expected, splitting "Cl" into "C" and "l" instead of keeping "Cl" as a single token.

Could you clarify why merges.txt is empty? Should there be a proper merges.txt, or is this the intended behavior?

Thanks!

Sign up or log in to comment