laion/clap-htsat-unfused · The output dimension of last_hidden

laion

Hello, the documentation of CLAP says the dimension of last_hidden_state in audio will be (batch_size, seq_len, embedding_dim). however, in practice I get the dimension (1, 768, 2, 32). I went through the modelling_clap.py in transformers, and turns out the code itself doesn't output the dimension in the documented shape.

can you clarify if the documentation is incorrect or if I'm doing something wrong?
in case of the first, how can I get the audio embedding in a dimension of (batch_size, seq_len, embedding_size) ? I need this to attach with text embeddings of an LLM, any help would be appreciated.

laion
/

clap-htsat-unfused

The output dimension of last_hidden_state