apple/aimv2-large-patch14-224 · Image and Text features

Nov 26, 2024

Hello , could you please give a simple example of obtaining text features and imager features. Only image feature example has been added.

michalk8

Apple org Nov 28, 2024

Currently, we only provide this for the LiT-tuned checkpoint here.

maxlun

9 days ago

•

edited 9 days ago

Doing it with MLX was kind of the point. I'm digging around in the repo, think you can get the tokenizer from from transformers:

processor = AutoProcessor.from_pretrained(
    "apple/aimv2-large-patch14-224-lit",
)

then just find that in the library code and start looking around for a way to get the input_ids that should be what you calling in
.venv/lib/python3.10/site-packages/aim/v2/mlx/models.py
It's this call:

class AIMv2LiT(nn.Module):
...
    def encode_text(
        self,
        input_ids: mx.array,
        mask: Optional[mx.array] = None,
        output_features: bool = False,
    ) -> Union[mx.array, Tuple[mx.array, Tuple[mx.array, ...]]]:
        out = self.text_encoder(input_ids, mask=mask, output_features=output_features)
        out = self.text_projector(out)
        return out

I'm gonna give up here for now, this is a sidetrack for something else. Please ping me if you fix a way to test it out in MLX!