One second of action decoded to >500 tokens
HI, one second of actions in my one sec of actin is with shape of 60x44 (freq x action_dim), when trying to use your universal tokenizer and when trying to retrain the tokenizer and use it, I'm getting huge outputs, for example a one second action encoding gets us: 517 tokens.
It doesnt seems reasonable to me to make the vocab larger (currently 1024, default in the code) as I have only ~1000 samples.
(I've normalized and followed your steps)
Ill be happy for any advice.
Thanks!
You can try whether increasing vocab will help you, or try decreasing "scale" which will essentially make the compression more lossy. You can also try to train with half-second chunks to make things more practical.
It is possible that for very high-dim robots like yours it's worth trying neural compression again to see whether it gets a better tradeoff (we only tried until 16-dim)