Feature Extraction
Transformers
PyTorch
clap

The output dimension of last_hidden_state

#5
by kabir5297 - opened

Hello, the documentation of CLAP says the dimension of last_hidden_state in audio will be (batch_size, seq_len, embedding_dim). however, in practice I get the dimension (1, 768, 2, 32). I went through the modelling_clap.py in transformers, and turns out the code itself doesn't output the dimension in the documented shape.

  1. can you clarify if the documentation is incorrect or if I'm doing something wrong?
  2. in case of the first, how can I get the audio embedding in a dimension of (batch_size, seq_len, embedding_size) ? I need this to attach with text embeddings of an LLM, any help would be appreciated.
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment