The output dimension of last_hidden_state
#5
by
kabir5297
- opened
Hello, the documentation of CLAP says the dimension of last_hidden_state in audio will be (batch_size, seq_len, embedding_dim). however, in practice I get the dimension (1, 768, 2, 32). I went through the modelling_clap.py in transformers, and turns out the code itself doesn't output the dimension in the documented shape.
- can you clarify if the documentation is incorrect or if I'm doing something wrong?
- in case of the first, how can I get the audio embedding in a dimension of (batch_size, seq_len, embedding_size) ? I need this to attach with text embeddings of an LLM, any help would be appreciated.