[Query-ISSUE] tokenizer.vocab_size is 128000, however len(tokenizer) is 128256, which prevents me from using those other tokens.

#34

by HV-Khurdula - opened Oct 30, 2024

Oct 30, 2024

Nov 4, 2024

@HV-Khurdula The extra 256 are special tokens with token ids ranging from 128000-128255.

You can find the complete list in the tokenizer_config.json file.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment