Spaces in tokens

by johnowhitaker - opened Dec 24, 2024

Dec 24, 2024

I dug through the GLiNER codebase a while back, and while I'm still not sure, I think the default WordSplitter is used, and that it doesn't include spaces at the start of each word. Since ModernBERT uses an OLMO-style tokenizer most of the vocab has spaces before the word! When I was trying out GLiNER as an eval during training I ended up rolling my own to work around this, might be worth a look in case this gives even better performance.

johnowhitaker

Dec 24, 2024

(It seems to be working well so perhaps this isn't an issue, but it feels like the kind of thing that might result in mysterious underperformance)

Ihor

Knowledgator Engineering org Dec 24, 2024

@johnowhitaker , thank you for pointing out this issue, it can explain why we get bad results for uni-encoder token-level GLiNER and in general ModernBERT version requires more data. This bi-encoder GLiNER is span-level so maybe it mitigates the issue but it is worth investigating it more deeply.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment