Spaces:

broadwell
/

ma-images

Runtime error

App Files Files Community

broadwell commited on Aug 20, 2024

Commit

a6b26e3

verified ·

1 Parent(s): 02ab20e

Add viz/explanation feature for image and text activations

Browse files

Files changed (12) hide show

CLIP_Explainability/README.md +43 -0
CLIP_Explainability/auxilary.py +532 -0
CLIP_Explainability/bpe_simple_vocab_16e6.txt.gz +3 -0
CLIP_Explainability/clip_.py +305 -0
CLIP_Explainability/image_utils.py +22 -0
CLIP_Explainability/model.py +446 -0
CLIP_Explainability/simple_tokenizer.py +136 -0
CLIP_Explainability/vit_cam.py +325 -0
app.py +247 -10
requirements.txt +3 -0
resized_ja_features.npy +3 -0
resized_ml_features.npy +3 -0

CLIP_Explainability/README.md ADDED Viewed

	@@ -0,0 +1,43 @@

+# CLIP Explainability
+This repo contains the code for the [CLIP Explainability project](CLIP_Explainability.pdf).
+In this project, we conduct an in-depth study of CLIP’s learned image and text representations using saliency map visualization. We propose a modification to the existing saliency visualization method that improves its performance as shown by our qualitative evaluations. We then use this method to study CLIP’s ability in capturing similarities and dissimilarities between an input image and targets belonging to different domains including image, text, and emotion.
+## Setup
+To install the required libraries run the following command:
+```
+pip install -r requirements.txt
+```
+## Organization
+[code](code) directory contains
+- the implementation of saliency visualization methods: for [ViT](code/vit_cam.py) and [ResNet](code/rn_cam.py)-based CLIP
+- [GradCAM](code/pytorch-grad-cam) implementation based on [pytorch-grad-cam](https://github.com/jacobgil/pytorch-grad-cam/tree/e93f41104e20134e5feac2a660b343437f601ad0) slightly modified to adapt to CLIP.
+- A re-implementation of CLIP taken from [Transformer-MM-Explainability](https://github.com/hila-chefer/Transformer-MM-Explainability) repo that keeps tack of attention maps and gradients: [clip_.py](code/clip_.py)
+- [Notebooks](code/notebooks/) for the experiments explained in the report
+[Images](Images) contains images used in the experiments.
+[results](results) contains the results obtained from the experiments. Any result generated by the notebooks will be stored in this directory.
+## Experiments
+| Notebook Name |   Experiment  |      Note     |
+| ------------- | ------------- | ------------- |
+| [vit_block_vis](code/notebooks/vit_block_vis.ipynb)  | Layer-wise Attention Visualization  |   -    |
+| [saliency_method_compare](code/notebooks/saliency_method_compare.ipynb)  | ViT Explainability Method Comparison |  Qualitative comparison |
+| [affectnet_emotions](code/notebooks/affectnet_emotions.ipynb)  | ViT Explainability Method Comparison |  Bias comparison; you need to download a sample of the AffectNet dataset [here](https://drive.google.com/drive/u/1/folders/11RusPab71wGw6LTd9pUnY1Gz3JSH-N_N) and place it in [Images](Images). |
+| [pos_neg_vis](code/notebooks/pos_neg_vis.ipynb)  | Positive vs Negative Saliency | - |
+| [artemis_emotions](code/notebooks/artemis_emotions.ipynb)  |  Emotion-Image Similarity  | you need to download the pre-processed WikiArt images [here](https://drive.google.com/drive/u/1/folders/11RusPab71wGw6LTd9pUnY1Gz3JSH-N_N) and place it in [Images](Images). Note that this notebook chooses images randomly so the results may not be the same as the ones in the report. |
+| [perword_vis](code/notebooks/perword_vis.ipynb)  | Word-Wise Saliency Visualization  |
+| [global_vis](code/notebooks/global_vis.ipynb)  | - | can be used to visualize saliency maps for ViT and ResNet-based CLIP.|

CLIP_Explainability/auxilary.py ADDED Viewed

	@@ -0,0 +1,532 @@

+import torch
+import warnings
+from typing import Tuple, Optional
+import torch
+from torch import Tensor
+from torch.nn.init import xavier_uniform_
+from torch.nn.init import constant_
+from torch.nn.init import xavier_normal_
+from torch.nn.parameter import Parameter
+from torch.nn import functional as F
+# We define this function as _pad because it takes an argument
+# named pad, which clobbers the recursive reference to the pad
+# function needed for __torch_function__ support
+pad = F.pad
+# This class exists solely for Transformer; it has an annotation stating
+# that bias is never None, which appeases TorchScript
+class _LinearWithBias(torch.nn.Linear):
+    bias: Tensor
+    def __init__(self, in_features: int, out_features: int) -> None:
+        super().__init__(in_features, out_features, bias=True)
+def multi_head_attention_forward(
+    query: Tensor,
+    key: Tensor,
+    value: Tensor,
+    embed_dim_to_check: int,
+    num_heads: int,
+    in_proj_weight: Tensor,
+    in_proj_bias: Tensor,
+    bias_k: Optional[Tensor],
+    bias_v: Optional[Tensor],
+    add_zero_attn: bool,
+    dropout_p: float,
+    out_proj_weight: Tensor,
+    out_proj_bias: Tensor,
+    training: bool = True,
+    key_padding_mask: Optional[Tensor] = None,
+    need_weights: bool = True,
+    attn_mask: Optional[Tensor] = None,
+    use_separate_proj_weight: bool = False,
+    q_proj_weight: Optional[Tensor] = None,
+    k_proj_weight: Optional[Tensor] = None,
+    v_proj_weight: Optional[Tensor] = None,
+    static_k: Optional[Tensor] = None,
+    static_v: Optional[Tensor] = None,
+    attention_probs_forward_hook=None,
+    attention_probs_backwards_hook=None,
+) -> Tuple[Tensor, Optional[Tensor]]:
+    if not torch.jit.is_scripting():
+        tens_ops = (
+            query,
+            key,
+            value,
+            in_proj_weight,
+            in_proj_bias,
+            bias_k,
+            bias_v,
+            out_proj_weight,
+            out_proj_bias,
+        )
+        if any([type(t) is not Tensor for t in tens_ops]) and F.has_torch_function(
+            tens_ops
+        ):
+            return F.handle_torch_function(
+                multi_head_attention_forward,
+                tens_ops,
+                query,
+                key,
+                value,
+                embed_dim_to_check,
+                num_heads,
+                in_proj_weight,
+                in_proj_bias,
+                bias_k,
+                bias_v,
+                add_zero_attn,
+                dropout_p,
+                out_proj_weight,
+                out_proj_bias,
+                training=training,
+                key_padding_mask=key_padding_mask,
+                need_weights=need_weights,
+                attn_mask=attn_mask,
+                use_separate_proj_weight=use_separate_proj_weight,
+                q_proj_weight=q_proj_weight,
+                k_proj_weight=k_proj_weight,
+                v_proj_weight=v_proj_weight,
+                static_k=static_k,
+                static_v=static_v,
+            )
+    tgt_len, bsz, embed_dim = query.size()
+    assert embed_dim == embed_dim_to_check
+    # allow MHA to have different sizes for the feature dimension
+    assert key.size(0) == value.size(0) and key.size(1) == value.size(1)
+    head_dim = embed_dim // num_heads
+    assert head_dim * num_heads == embed_dim, "embed_dim must be divisible by num_heads"
+    scaling = float(head_dim) ** -0.5
+    if not use_separate_proj_weight:
+        if torch.equal(query, key) and torch.equal(key, value):
+            # self-attention
+            q, k, v = F.linear(query, in_proj_weight, in_proj_bias).chunk(3, dim=-1)
+        elif torch.equal(key, value):
+            # encoder-decoder attention
+            # This is inline in_proj function with in_proj_weight and in_proj_bias
+            _b = in_proj_bias
+            _start = 0
+            _end = embed_dim
+            _w = in_proj_weight[_start:_end, :]
+            if _b is not None:
+                _b = _b[_start:_end]
+            q = F.linear(query, _w, _b)
+            if key is None:
+                assert value is None
+                k = None
+                v = None
+            else:
+                # This is inline in_proj function with in_proj_weight and in_proj_bias
+                _b = in_proj_bias
+                _start = embed_dim
+                _end = None
+                _w = in_proj_weight[_start:, :]
+                if _b is not None:
+                    _b = _b[_start:]
+                k, v = F.linear(key, _w, _b).chunk(2, dim=-1)
+        else:
+            # This is inline in_proj function with in_proj_weight and in_proj_bias
+            _b = in_proj_bias
+            _start = 0
+            _end = embed_dim
+            _w = in_proj_weight[_start:_end, :]
+            if _b is not None:
+                _b = _b[_start:_end]
+            q = F.linear(query, _w, _b)
+            # This is inline in_proj function with in_proj_weight and in_proj_bias
+            _b = in_proj_bias
+            _start = embed_dim
+            _end = embed_dim * 2
+            _w = in_proj_weight[_start:_end, :]
+            if _b is not None:
+                _b = _b[_start:_end]
+            k = F.linear(key, _w, _b)
+            # This is inline in_proj function with in_proj_weight and in_proj_bias
+            _b = in_proj_bias
+            _start = embed_dim * 2
+            _end = None
+            _w = in_proj_weight[_start:, :]
+            if _b is not None:
+                _b = _b[_start:]
+            v = F.linear(value, _w, _b)
+    else:
+        q_proj_weight_non_opt = torch.jit._unwrap_optional(q_proj_weight)
+        len1, len2 = q_proj_weight_non_opt.size()
+        assert len1 == embed_dim and len2 == query.size(-1)
+        k_proj_weight_non_opt = torch.jit._unwrap_optional(k_proj_weight)
+        len1, len2 = k_proj_weight_non_opt.size()
+        assert len1 == embed_dim and len2 == key.size(-1)
+        v_proj_weight_non_opt = torch.jit._unwrap_optional(v_proj_weight)
+        len1, len2 = v_proj_weight_non_opt.size()
+        assert len1 == embed_dim and len2 == value.size(-1)
+        if in_proj_bias is not None:
+            q = F.linear(query, q_proj_weight_non_opt, in_proj_bias[0:embed_dim])
+            k = F.linear(
+                key, k_proj_weight_non_opt, in_proj_bias[embed_dim : (embed_dim * 2)]
+            )
+            v = F.linear(value, v_proj_weight_non_opt, in_proj_bias[(embed_dim * 2) :])
+        else:
+            q = F.linear(query, q_proj_weight_non_opt, in_proj_bias)
+            k = F.linear(key, k_proj_weight_non_opt, in_proj_bias)
+            v = F.linear(value, v_proj_weight_non_opt, in_proj_bias)
+    q = q * scaling
+    if attn_mask is not None:
+        assert (
+            attn_mask.dtype == torch.float32
+            or attn_mask.dtype == torch.float64
+            or attn_mask.dtype == torch.float16
+            or attn_mask.dtype == torch.uint8
+            or attn_mask.dtype == torch.bool
+        ), "Only float, byte, and bool types are supported for attn_mask, not {}".format(
+            attn_mask.dtype
+        )
+        if attn_mask.dtype == torch.uint8:
+            warnings.warn(
+                "Byte tensor for attn_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead."
+            )
+            attn_mask = attn_mask.to(torch.bool)
+        if attn_mask.dim() == 2:
+            attn_mask = attn_mask.unsqueeze(0)
+            if list(attn_mask.size()) != [1, query.size(0), key.size(0)]:
+                raise RuntimeError("The size of the 2D attn_mask is not correct.")
+        elif attn_mask.dim() == 3:
+            if list(attn_mask.size()) != [bsz * num_heads, query.size(0), key.size(0)]:
+                raise RuntimeError("The size of the 3D attn_mask is not correct.")
+        else:
+            raise RuntimeError(
+                "attn_mask's dimension {} is not supported".format(attn_mask.dim())
+            )
+        # attn_mask's dim is 3 now.
+    # convert ByteTensor key_padding_mask to bool
+    if key_padding_mask is not None and key_padding_mask.dtype == torch.uint8:
+        warnings.warn(
+            "Byte tensor for key_padding_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead."
+        )
+        key_padding_mask = key_padding_mask.to(torch.bool)
+    if bias_k is not None and bias_v is not None:
+        if static_k is None and static_v is None:
+            k = torch.cat([k, bias_k.repeat(1, bsz, 1)])
+            v = torch.cat([v, bias_v.repeat(1, bsz, 1)])
+            if attn_mask is not None:
+                attn_mask = pad(attn_mask, (0, 1))
+            if key_padding_mask is not None:
+                key_padding_mask = pad(key_padding_mask, (0, 1))
+        else:
+            assert static_k is None, "bias cannot be added to static key."
+            assert static_v is None, "bias cannot be added to static value."
+    else:
+        assert bias_k is None
+        assert bias_v is None
+    q = q.contiguous().view(tgt_len, bsz * num_heads, head_dim).transpose(0, 1)
+    if k is not None:
+        k = k.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)
+    if v is not None:
+        v = v.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)
+    if static_k is not None:
+        assert static_k.size(0) == bsz * num_heads
+        assert static_k.size(2) == head_dim
+        k = static_k
+    if static_v is not None:
+        assert static_v.size(0) == bsz * num_heads
+        assert static_v.size(2) == head_dim
+        v = static_v
+    src_len = k.size(1)
+    if key_padding_mask is not None:
+        assert key_padding_mask.size(0) == bsz
+        assert key_padding_mask.size(1) == src_len
+    if add_zero_attn:
+        src_len += 1
+        k = torch.cat(
+            [
+                k,
+                torch.zeros(
+                    (k.size(0), 1) + k.size()[2:], dtype=k.dtype, device=k.device
+                ),
+            ],
+            dim=1,
+        )
+        v = torch.cat(
+            [
+                v,
+                torch.zeros(
+                    (v.size(0), 1) + v.size()[2:], dtype=v.dtype, device=v.device
+                ),
+            ],
+            dim=1,
+        )
+        if attn_mask is not None:
+            attn_mask = pad(attn_mask, (0, 1))
+        if key_padding_mask is not None:
+            key_padding_mask = pad(key_padding_mask, (0, 1))
+    attn_output_weights = torch.bmm(q, k.transpose(1, 2))
+    assert list(attn_output_weights.size()) == [bsz * num_heads, tgt_len, src_len]
+    if attn_mask is not None:
+        if attn_mask.dtype == torch.bool:
+            attn_output_weights.masked_fill_(attn_mask, float("-inf"))
+        else:
+            attn_output_weights += attn_mask
+    if key_padding_mask is not None:
+        attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)
+        attn_output_weights = attn_output_weights.masked_fill(
+            key_padding_mask.unsqueeze(1).unsqueeze(2),
+            float("-inf"),
+        )
+        attn_output_weights = attn_output_weights.view(
+            bsz * num_heads, tgt_len, src_len
+        )
+    attn_output_weights = F.softmax(attn_output_weights, dim=-1)
+    attn_output_weights = F.dropout(attn_output_weights, p=dropout_p, training=training)
+    # use hooks for the attention weights if necessary
+    if (
+        attention_probs_forward_hook is not None
+        and attention_probs_backwards_hook is not None
+    ):
+        attention_probs_forward_hook(attn_output_weights)
+        attn_output_weights.register_hook(attention_probs_backwards_hook)
+    attn_output = torch.bmm(attn_output_weights, v)
+    assert list(attn_output.size()) == [bsz * num_heads, tgt_len, head_dim]
+    attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)
+    attn_output = F.linear(attn_output, out_proj_weight, out_proj_bias)
+    if need_weights:
+        # average attention weights over heads
+        attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)
+        return attn_output, attn_output_weights.sum(dim=1) / num_heads
+    else:
+        return attn_output, None
+class MultiheadAttention(torch.nn.Module):
+    r"""Allows the model to jointly attend to information
+    from different representation subspaces.
+    See reference: Attention Is All You Need
+    .. math::
+        \text{MultiHead}(Q, K, V) = \text{Concat}(head_1,\dots,head_h)W^O
+        \text{where} head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
+    Args:
+        embed_dim: total dimension of the model.
+        num_heads: parallel attention heads.
+        dropout: a Dropout layer on attn_output_weights. Default: 0.0.
+        bias: add bias as module parameter. Default: True.
+        add_bias_kv: add bias to the key and value sequences at dim=0.
+        add_zero_attn: add a new batch of zeros to the key and
+                       value sequences at dim=1.
+        kdim: total number of features in key. Default: None.
+        vdim: total number of features in value. Default: None.
+        Note: if kdim and vdim are None, they will be set to embed_dim such that
+        query, key, and value have the same number of features.
+    Examples::
+        >>> multihead_attn = nn.MultiheadAttention(embed_dim, num_heads)
+        >>> attn_output, attn_output_weights = multihead_attn(query, key, value)
+    """
+    bias_k: Optional[torch.Tensor]
+    bias_v: Optional[torch.Tensor]
+    def __init__(
+        self,
+        embed_dim,
+        num_heads,
+        dropout=0.0,
+        bias=True,
+        add_bias_kv=False,
+        add_zero_attn=False,
+        kdim=None,
+        vdim=None,
+    ):
+        super(MultiheadAttention, self).__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self._qkv_same_embed_dim = self.kdim == embed_dim and self.vdim == embed_dim
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.head_dim = embed_dim // num_heads
+        assert (
+            self.head_dim * num_heads == self.embed_dim
+        ), "embed_dim must be divisible by num_heads"
+        if self._qkv_same_embed_dim is False:
+            self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
+            self.k_proj_weight = Parameter(torch.Tensor(embed_dim, self.kdim))
+            self.v_proj_weight = Parameter(torch.Tensor(embed_dim, self.vdim))
+            self.register_parameter("in_proj_weight", None)
+        else:
+            self.in_proj_weight = Parameter(torch.empty(3 * embed_dim, embed_dim))
+            self.register_parameter("q_proj_weight", None)
+            self.register_parameter("k_proj_weight", None)
+            self.register_parameter("v_proj_weight", None)
+        if bias:
+            self.in_proj_bias = Parameter(torch.empty(3 * embed_dim))
+        else:
+            self.register_parameter("in_proj_bias", None)
+        self.out_proj = _LinearWithBias(embed_dim, embed_dim)
+        if add_bias_kv:
+            self.bias_k = Parameter(torch.empty(1, 1, embed_dim))
+            self.bias_v = Parameter(torch.empty(1, 1, embed_dim))
+        else:
+            self.bias_k = self.bias_v = None
+        self.add_zero_attn = add_zero_attn
+        self._reset_parameters()
+    def _reset_parameters(self):
+        if self._qkv_same_embed_dim:
+            xavier_uniform_(self.in_proj_weight)
+        else:
+            xavier_uniform_(self.q_proj_weight)
+            xavier_uniform_(self.k_proj_weight)
+            xavier_uniform_(self.v_proj_weight)
+        if self.in_proj_bias is not None:
+            constant_(self.in_proj_bias, 0.0)
+            constant_(self.out_proj.bias, 0.0)
+        if self.bias_k is not None:
+            xavier_normal_(self.bias_k)
+        if self.bias_v is not None:
+            xavier_normal_(self.bias_v)
+    def __setstate__(self, state):
+        # Support loading old MultiheadAttention checkpoints generated by v1.1.0
+        if "_qkv_same_embed_dim" not in state:
+            state["_qkv_same_embed_dim"] = True
+        super(MultiheadAttention, self).__setstate__(state)
+    def forward(
+        self,
+        query,
+        key,
+        value,
+        key_padding_mask=None,
+        need_weights=True,
+        attn_mask=None,
+        attention_probs_forward_hook=None,
+        attention_probs_backwards_hook=None,
+    ):
+        r"""
+        Args:
+            query, key, value: map a query and a set of key-value pairs to an output.
+                See "Attention Is All You Need" for more details.
+            key_padding_mask: if provided, specified padding elements in the key will
+                be ignored by the attention. When given a binary mask and a value is True,
+                the corresponding value on the attention layer will be ignored. When given
+                a byte mask and a value is non-zero, the corresponding value on the attention
+                layer will be ignored
+            need_weights: output attn_output_weights.
+            attn_mask: 2D or 3D mask that prevents attention to certain positions. A 2D mask will be broadcasted for all
+                the batches while a 3D mask allows to specify a different mask for the entries of each batch.
+        Shape:
+            - Inputs:
+            - query: :math:`(L, N, E)` where L is the target sequence length, N is the batch size, E is
+              the embedding dimension.
+            - key: :math:`(S, N, E)`, where S is the source sequence length, N is the batch size, E is
+              the embedding dimension.
+            - value: :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is
+              the embedding dimension.
+            - key_padding_mask: :math:`(N, S)` where N is the batch size, S is the source sequence length.
+              If a ByteTensor is provided, the non-zero positions will be ignored while the position
+              with the zero positions will be unchanged. If a BoolTensor is provided, the positions with the
+              value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.
+            - attn_mask: 2D mask :math:`(L, S)` where L is the target sequence length, S is the source sequence length.
+              3D mask :math:`(N*num_heads, L, S)` where N is the batch size, L is the target sequence length,
+              S is the source sequence length. attn_mask ensure that position i is allowed to attend the unmasked
+              positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend
+              while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``
+              is not allowed to attend while ``False`` values will be unchanged. If a FloatTensor
+              is provided, it will be added to the attention weight.
+            - Outputs:
+            - attn_output: :math:`(L, N, E)` where L is the target sequence length, N is the batch size,
+              E is the embedding dimension.
+            - attn_output_weights: :math:`(N, L, S)` where N is the batch size,
+              L is the target sequence length, S is the source sequence length.
+        """
+        if not self._qkv_same_embed_dim:
+            return multi_head_attention_forward(
+                query,
+                key,
+                value,
+                self.embed_dim,
+                self.num_heads,
+                self.in_proj_weight,
+                self.in_proj_bias,
+                self.bias_k,
+                self.bias_v,
+                self.add_zero_attn,
+                self.dropout,
+                self.out_proj.weight,
+                self.out_proj.bias,
+                training=self.training,
+                key_padding_mask=key_padding_mask,
+                need_weights=need_weights,
+                attn_mask=attn_mask,
+                use_separate_proj_weight=True,
+                q_proj_weight=self.q_proj_weight,
+                k_proj_weight=self.k_proj_weight,
+                v_proj_weight=self.v_proj_weight,
+                attention_probs_forward_hook=attention_probs_forward_hook,
+                attention_probs_backwards_hook=attention_probs_backwards_hook,
+            )
+        else:
+            return multi_head_attention_forward(
+                query,
+                key,
+                value,
+                self.embed_dim,
+                self.num_heads,
+                self.in_proj_weight,
+                self.in_proj_bias,
+                self.bias_k,
+                self.bias_v,
+                self.add_zero_attn,
+                self.dropout,
+                self.out_proj.weight,
+                self.out_proj.bias,
+                training=self.training,
+                key_padding_mask=key_padding_mask,
+                need_weights=need_weights,
+                attn_mask=attn_mask,
+                attention_probs_forward_hook=attention_probs_forward_hook,
+                attention_probs_backwards_hook=attention_probs_backwards_hook,
+            )

CLIP_Explainability/bpe_simple_vocab_16e6.txt.gz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:924691ac288e54409236115652ad4aa250f48203de50a9e4722a6ecd48d6804a
+size 1356917

CLIP_Explainability/clip_.py ADDED Viewed

	@@ -0,0 +1,305 @@

+"""
+taken from https://github.com/hila-chefer/Transformer-MM-Explainability
+added similarity_score
+"""
+import hashlib
+import os
+import urllib
+import warnings
+from typing import Union, List
+import re
+import html
+import torch
+from PIL import Image
+from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
+from tqdm import tqdm
+import ftfy
+from transformers import BatchFeature
+from .model import build_model
+from .simple_tokenizer import SimpleTokenizer as _Tokenizer
+__all__ = ["available_models", "load", "tokenize"]
+_tokenizer = _Tokenizer()
+_MODELS = {
+    "RN50": "https://openaipublic.azureedge.net/clip/models/afeb0e10f9e5a86da6080e35cf09123aca3b358a0c3e3b6c78a7b63bc04b6762/RN50.pt",
+    "RN101": "https://openaipublic.azureedge.net/clip/models/8fa8567bab74a42d41c5915025a8e4538c3bdbe8804a470a72f30b0d94fab599/RN101.pt",
+    "RN50x4": "https://openaipublic.azureedge.net/clip/models/7e526bd135e493cef0776de27d5f42653e6b4c8bf9e0f653bb11773263205fdd/RN50x4.pt",
+    "ViT-B/32": "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt",
+}
+def _download(url: str, root: str = os.path.expanduser("~/.cache/clip")):
+    os.makedirs(root, exist_ok=True)
+    filename = os.path.basename(url)
+    expected_sha256 = url.split("/")[-2]
+    download_target = os.path.join(root, filename)
+    if os.path.exists(download_target) and not os.path.isfile(download_target):
+        raise RuntimeError(f"{download_target} exists and is not a regular file")
+    if os.path.isfile(download_target):
+        if (
+            hashlib.sha256(open(download_target, "rb").read()).hexdigest()
+            == expected_sha256
+        ):
+            return download_target
+        else:
+            warnings.warn(
+                f"{download_target} exists, but the SHA256 checksum does not match; re-downloading the file"
+            )
+    with urllib.request.urlopen(url) as source, open(download_target, "wb") as output:
+        with tqdm(
+            total=int(source.info().get("Content-Length")),
+            ncols=80,
+            unit="iB",
+            unit_scale=True,
+        ) as loop:
+            while True:
+                buffer = source.read(8192)
+                if not buffer:
+                    break
+                output.write(buffer)
+                loop.update(len(buffer))
+    if (
+        hashlib.sha256(open(download_target, "rb").read()).hexdigest()
+        != expected_sha256
+    ):
+        raise RuntimeError(
+            f"Model has been downloaded but the SHA256 checksum does not not match"
+        )
+    return download_target
+def _transform(n_px):
+    return Compose(
+        [
+            Resize(n_px, interpolation=Image.BICUBIC),
+            CenterCrop(n_px),
+            lambda image: image.convert("RGB"),
+            ToTensor(),
+            Normalize(
+                (0.48145466, 0.4578275, 0.40821073),
+                (0.26862954, 0.26130258, 0.27577711),
+            ),
+        ]
+    )
+def available_models() -> List[str]:
+    """Returns the names of available CLIP models"""
+    return list(_MODELS.keys())
+def load(
+    name: str,
+    device: Union[str, torch.device] = "cuda" if torch.cuda.is_available() else "cpu",
+    jit=True,
+):
+    """Load a CLIP model
+    Parameters
+    ----------
+    name : str
+        A model name listed by `clip.available_models()`, or the path to a model checkpoint containing the state_dict
+    device : Union[str, torch.device]
+        The device to put the loaded model
+    jit : bool
+        Whether to load the optimized JIT model (default) or more hackable non-JIT model.
+    Returns
+    -------
+    model : torch.nn.Module
+        The CLIP model
+    preprocess : Callable[[PIL.Image], torch.Tensor]
+        A torchvision transform that converts a PIL image into a tensor that the returned model can take as its input
+    """
+    if name in _MODELS:
+        model_path = _download(_MODELS[name])
+    elif os.path.isfile(name):
+        model_path = name
+    else:
+        raise RuntimeError(
+            f"Model {name} not found; available models = {available_models()}"
+        )
+    try:
+        # loading JIT archive
+        model = torch.jit.load(model_path, map_location=device if jit else "cpu").eval()
+        state_dict = None
+    except RuntimeError:
+        # loading saved state dict
+        if jit:
+            warnings.warn(
+                f"File {model_path} is not a JIT archive. Loading as a state dict instead"
+            )
+            jit = False
+        state_dict = torch.load(model_path, map_location="cpu")
+    if not jit:
+        model = build_model(state_dict or model.state_dict()).to(device)
+        if str(device) == "cpu":
+            model.float()
+        return model, _transform(model.visual.input_resolution)
+    # patch the device names
+    device_holder = torch.jit.trace(
+        lambda: torch.ones([]).to(torch.device(device)), example_inputs=[]
+    )
+    device_node = [
+        n
+        for n in device_holder.graph.findAllNodes("prim::Constant")
+        if "Device" in repr(n)
+    ][-1]
+    def patch_device(module):
+        graphs = [module.graph] if hasattr(module, "graph") else []
+        if hasattr(module, "forward1"):
+            graphs.append(module.forward1.graph)
+        for graph in graphs:
+            for node in graph.findAllNodes("prim::Constant"):
+                if "value" in node.attributeNames() and str(node["value"]).startswith(
+                    "cuda"
+                ):
+                    node.copyAttributes(device_node)
+    model.apply(patch_device)
+    patch_device(model.encode_image)
+    patch_device(model.encode_text)
+    # patch dtype to float32 on CPU
+    if str(device) == "cpu":
+        float_holder = torch.jit.trace(
+            lambda: torch.ones([]).float(), example_inputs=[]
+        )
+        float_input = list(float_holder.graph.findNode("aten::to").inputs())[1]
+        float_node = float_input.node()
+        def patch_float(module):
+            graphs = [module.graph] if hasattr(module, "graph") else []
+            if hasattr(module, "forward1"):
+                graphs.append(module.forward1.graph)
+            for graph in graphs:
+                for node in graph.findAllNodes("aten::to"):
+                    inputs = list(node.inputs())
+                    for i in [
+                        1,
+                        2,
+                    ]:  # dtype can be the second or third argument to aten::to()
+                        if inputs[i].node()["value"] == 5:
+                            inputs[i].node().copyAttributes(float_node)
+        model.apply(patch_float)
+        patch_float(model.encode_image)
+        patch_float(model.encode_text)
+        model.float()
+    return model, _transform(model.input_resolution.item())
+def tokenize(
+    texts: Union[str, List[str]], context_length: int = 77
+) -> torch.LongTensor:
+    """
+    Returns the tokenized representation of given input string(s)
+    Parameters
+    ----------
+    texts : Union[str, List[str]]
+        An input string or a list of input strings to tokenize
+    context_length : int
+        The context length to use; all CLIP models use 77 as the context length
+    Returns
+    -------
+    A two-dimensional tensor containing the resulting tokens, shape = [number of input strings, context_length]
+    """
+    if isinstance(texts, str):
+        texts = [texts]
+    sot_token = _tokenizer.encoder["<|startoftext|>"]
+    eot_token = _tokenizer.encoder["<|endoftext|>"]
+    all_tokens = [[sot_token] + _tokenizer.encode(text) + [eot_token] for text in texts]
+    result = torch.zeros(len(all_tokens), context_length, dtype=torch.long)
+    for i, tokens in enumerate(all_tokens):
+        if len(tokens) > context_length:
+            raise RuntimeError(
+                f"Input {texts[i]} is too long for context length {context_length}"
+            )
+        result[i, : len(tokens)] = torch.tensor(tokens)
+    return result
+def basic_clean(text):
+    text = ftfy.fix_text(text)
+    text = html.unescape(html.unescape(text))
+    return text.strip()
+def whitespace_clean(text):
+    text = re.sub(r"\s+", " ", text)
+    text = text.strip()
+    return text
+def tokenize_ja(
+    tokenizer,
+    texts: Union[str, List[str]],
+    max_seq_len: int = 77,
+):
+    """
+    This is a function that have the original clip's code has.
+    https://github.com/openai/CLIP/blob/main/clip/clip.py#L195
+    """
+    if isinstance(texts, str):
+        texts = [texts]
+    texts = [whitespace_clean(basic_clean(text)) for text in texts]
+    inputs = tokenizer(
+        texts,
+        max_length=max_seq_len - 1,
+        padding="max_length",
+        truncation=True,
+        add_special_tokens=False,
+    )
+    # add bos token at first place
+    input_ids = [[tokenizer.bos_token_id] + ids for ids in inputs["input_ids"]]
+    attention_mask = [[1] + am for am in inputs["attention_mask"]]
+    position_ids = [list(range(0, len(input_ids[0])))] * len(texts)
+    return BatchFeature(
+        {
+            "input_ids": torch.tensor(input_ids, dtype=torch.long),
+            "attention_mask": torch.tensor(attention_mask, dtype=torch.long),
+            "position_ids": torch.tensor(position_ids, dtype=torch.long),
+        }
+    )
+def similarity_score(clip_model, image, target_features):
+    image_features = clip_model.encode_image(image)
+    image_features_norm = image_features.norm(dim=-1, keepdim=True)
+    image_features_new = image_features / image_features_norm
+    target_features_norm = target_features.norm(dim=-1, keepdim=True)
+    target_features_new = target_features / target_features_norm
+    return image_features_new[0].dot(target_features_new[0]) * 100

CLIP_Explainability/image_utils.py ADDED Viewed

	@@ -0,0 +1,22 @@

+import numpy as np
+import cv2
+def show_cam_on_image(img, mask, neg_saliency=False):
+    heatmap = cv2.applyColorMap(np.uint8(255 * mask), cv2.COLORMAP_JET)
+    heatmap = np.float32(heatmap) / 255
+    cam = heatmap + np.float32(img)
+    cam = cam / np.max(cam)
+    return cam
+def show_overlapped_cam(img, neg_mask, pos_mask):
+    neg_heatmap = cv2.applyColorMap(np.uint8(255 * neg_mask), cv2.COLORMAP_RAINBOW)
+    pos_heatmap = cv2.applyColorMap(np.uint8(255 * pos_mask), cv2.COLORMAP_JET)
+    neg_heatmap = np.float32(neg_heatmap) / 255
+    pos_heatmap = np.float32(pos_heatmap) / 255
+    # try different options: sum, average, ...
+    heatmap = neg_heatmap + pos_heatmap
+    cam = heatmap + np.float32(img)
+    cam = cam / np.max(cam)
+    return cam

CLIP_Explainability/model.py ADDED Viewed

	@@ -0,0 +1,446 @@

+"""
+taken from https://github.com/hila-chefer/Transformer-MM-Explainability
+"""
+from collections import OrderedDict
+from typing import Tuple, Union
+import numpy as np
+import torch
+import torch.nn.functional as F
+from torch import nn
+from .auxilary import *
+class Bottleneck(nn.Module):
+    expansion = 4
+    def __init__(self, inplanes, planes, stride=1):
+        super().__init__()
+        # all conv layers have stride 1. an avgpool is performed after the second convolution when stride > 1
+        self.conv1 = nn.Conv2d(inplanes, planes, 1, bias=False)
+        self.bn1 = nn.BatchNorm2d(planes)
+        self.conv2 = nn.Conv2d(planes, planes, 3, padding=1, bias=False)
+        self.bn2 = nn.BatchNorm2d(planes)
+        self.avgpool = nn.AvgPool2d(stride) if stride > 1 else nn.Identity()
+        self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False)
+        self.bn3 = nn.BatchNorm2d(planes * self.expansion)
+        self.relu = nn.ReLU(inplace=True)
+        self.downsample = None
+        self.stride = stride
+        if stride > 1 or inplanes != planes * Bottleneck.expansion:
+            # downsampling layer is prepended with an avgpool, and the subsequent convolution has stride 1
+            self.downsample = nn.Sequential(OrderedDict([
+                ("-1", nn.AvgPool2d(stride)),
+                ("0", nn.Conv2d(inplanes, planes * self.expansion, 1, stride=1, bias=False)),
+                ("1", nn.BatchNorm2d(planes * self.expansion))
+            ]))
+    def forward(self, x: torch.Tensor):
+        identity = x
+        out = self.relu(self.bn1(self.conv1(x)))
+        out = self.relu(self.bn2(self.conv2(out)))
+        out = self.avgpool(out)
+        out = self.bn3(self.conv3(out))
+        if self.downsample is not None:
+            identity = self.downsample(x)
+        out += identity
+        out = self.relu(out)
+        return out
+class AttentionPool2d(nn.Module):
+    def __init__(self, spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None):
+        super().__init__()
+        self.positional_embedding = nn.Parameter(torch.randn(spacial_dim ** 2 + 1, embed_dim) / embed_dim ** 0.5)
+        self.k_proj = nn.Linear(embed_dim, embed_dim)
+        self.q_proj = nn.Linear(embed_dim, embed_dim)
+        self.v_proj = nn.Linear(embed_dim, embed_dim)
+        self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim)
+        self.num_heads = num_heads
+    def forward(self, x):
+        x = x.reshape(x.shape[0], x.shape[1], x.shape[2] * x.shape[3]).permute(2, 0, 1)  # NCHW -> (HW)NC
+        x = torch.cat([x.mean(dim=0, keepdim=True), x], dim=0)  # (HW+1)NC
+        x = x + self.positional_embedding[:, None, :].to(x.dtype)  # (HW+1)NC
+        x, _ = multi_head_attention_forward(
+            query=x, key=x, value=x,
+            embed_dim_to_check=x.shape[-1],
+            num_heads=self.num_heads,
+            q_proj_weight=self.q_proj.weight,
+            k_proj_weight=self.k_proj.weight,
+            v_proj_weight=self.v_proj.weight,
+            in_proj_weight=None,
+            in_proj_bias=torch.cat([self.q_proj.bias, self.k_proj.bias, self.v_proj.bias]),
+            bias_k=None,
+            bias_v=None,
+            add_zero_attn=False,
+            dropout_p=0,
+            out_proj_weight=self.c_proj.weight,
+            out_proj_bias=self.c_proj.bias,
+            use_separate_proj_weight=True,
+            training=self.training,
+            need_weights=False
+        )
+        return x[0]
+class ModifiedResNet(nn.Module):
+    """
+    A ResNet class that is similar to torchvision's but contains the following changes:
+    - There are now 3 "stem" convolutions as opposed to 1, with an average pool instead of a max pool.
+    - Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1
+    - The final pooling layer is a QKV attention instead of an average pool
+    """
+    def __init__(self, layers, output_dim, heads, input_resolution=224, width=64):
+        super().__init__()
+        self.output_dim = output_dim
+        self.input_resolution = input_resolution
+        # the 3-layer stem
+        self.conv1 = nn.Conv2d(3, width // 2, kernel_size=3, stride=2, padding=1, bias=False)
+        self.bn1 = nn.BatchNorm2d(width // 2)
+        self.conv2 = nn.Conv2d(width // 2, width // 2, kernel_size=3, padding=1, bias=False)
+        self.bn2 = nn.BatchNorm2d(width // 2)
+        self.conv3 = nn.Conv2d(width // 2, width, kernel_size=3, padding=1, bias=False)
+        self.bn3 = nn.BatchNorm2d(width)
+        self.avgpool = nn.AvgPool2d(2)
+        self.relu = nn.ReLU(inplace=True)
+        # residual layers
+        self._inplanes = width  # this is a *mutable* variable used during construction
+        self.layer1 = self._make_layer(width, layers[0])
+        self.layer2 = self._make_layer(width * 2, layers[1], stride=2)
+        self.layer3 = self._make_layer(width * 4, layers[2], stride=2)
+        self.layer4 = self._make_layer(width * 8, layers[3], stride=2)
+        embed_dim = width * 32  # the ResNet feature dimension
+        self.attnpool = AttentionPool2d(input_resolution // 32, embed_dim, heads, output_dim)
+    def _make_layer(self, planes, blocks, stride=1):
+        layers = [Bottleneck(self._inplanes, planes, stride)]
+        self._inplanes = planes * Bottleneck.expansion
+        for _ in range(1, blocks):
+            layers.append(Bottleneck(self._inplanes, planes))
+        return nn.Sequential(*layers)
+    def forward(self, x):
+        def stem(x):
+            for conv, bn in [(self.conv1, self.bn1), (self.conv2, self.bn2), (self.conv3, self.bn3)]:
+                x = self.relu(bn(conv(x)))
+            x = self.avgpool(x)
+            return x
+        x = x.type(self.conv1.weight.dtype)
+        x = stem(x)
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+        x = self.attnpool(x)
+        return x
+class LayerNorm(nn.LayerNorm):
+    """Subclass torch's LayerNorm to handle fp16."""
+    def forward(self, x: torch.Tensor):
+        orig_type = x.dtype
+        ret = super().forward(x.type(torch.float32))
+        return ret.type(orig_type)
+class QuickGELU(nn.Module):
+    def forward(self, x: torch.Tensor):
+        return x * torch.sigmoid(1.702 * x)
+class ResidualAttentionBlock(nn.Module):
+    def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None):
+        super().__init__()
+        self.attn = MultiheadAttention(d_model, n_head)
+        self.ln_1 = LayerNorm(d_model)
+        self.mlp = nn.Sequential(OrderedDict([
+            ("c_fc", nn.Linear(d_model, d_model * 4)),
+            ("gelu", QuickGELU()),
+            ("c_proj", nn.Linear(d_model * 4, d_model))
+        ]))
+        self.ln_2 = LayerNorm(d_model)
+        self.attn_mask = attn_mask
+        self.attn_probs = None
+        self.attn_grad = None
+    def set_attn_probs(self, attn_probs):
+        self.attn_probs = attn_probs
+    def set_attn_grad(self, attn_grad):
+        self.attn_grad = attn_grad
+    def attention(self, x: torch.Tensor):
+        self.attn_mask = self.attn_mask.to(dtype=x.dtype, device=x.device) if self.attn_mask is not None else None
+        return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask, attention_probs_forward_hook=self.set_attn_probs,
+                         attention_probs_backwards_hook=self.set_attn_grad)[0]
+    def forward(self, x: torch.Tensor):
+        x = x + self.attention(self.ln_1(x))
+        x = x + self.mlp(self.ln_2(x))
+        return x
+class Transformer(nn.Module):
+    def __init__(self, width: int, layers: int, heads: int, attn_mask: torch.Tensor = None):
+        super().__init__()
+        self.width = width
+        self.layers = layers
+        self.resblocks = nn.Sequential(*[ResidualAttentionBlock(width, heads, attn_mask) for _ in range(layers)])
+    def forward(self, x: torch.Tensor):
+        return self.resblocks(x)
+class VisualTransformer(nn.Module):
+    def __init__(self, input_resolution: int, patch_size: int, width: int, layers: int, heads: int, output_dim: int):
+        super().__init__()
+        self.input_resolution = input_resolution
+        self.output_dim = output_dim
+        self.conv1 = nn.Conv2d(in_channels=3, out_channels=width, kernel_size=patch_size, stride=patch_size, bias=False)
+        scale = width ** -0.5
+        self.class_embedding = nn.Parameter(scale * torch.randn(width))
+        self.positional_embedding = nn.Parameter(scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width))
+        self.ln_pre = LayerNorm(width)
+        self.transformer = Transformer(width, layers, heads)
+        self.ln_post = LayerNorm(width)
+        self.proj = nn.Parameter(scale * torch.randn(width, output_dim))
+    def forward(self, x: torch.Tensor):
+        x = self.conv1(x)  # shape = [*, width, grid, grid]
+        x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
+        x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]
+        x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1)  # shape = [*, grid ** 2 + 1, width]
+        x = x + self.positional_embedding.to(x.dtype)
+        x = self.ln_pre(x)
+        x = x.permute(1, 0, 2)  # NLD -> LND
+        x = self.transformer(x)
+        x = x.permute(1, 0, 2)  # LND -> NLD
+        x = self.ln_post(x[:, 0, :])
+        if self.proj is not None:
+            x = x @ self.proj
+        return x
+class CLIP(nn.Module):
+    def __init__(self,
+                 embed_dim: int,
+                 # vision
+                 image_resolution: int,
+                 vision_layers: Union[Tuple[int, int, int, int], int],
+                 vision_width: int,
+                 vision_patch_size: int,
+                 # text
+                 context_length: int,
+                 vocab_size: int,
+                 transformer_width: int,
+                 transformer_heads: int,
+                 transformer_layers: int
+                 ):
+        super().__init__()
+        self.context_length = context_length
+        if isinstance(vision_layers, (tuple, list)):
+            vision_heads = vision_width * 32 // 64
+            self.visual = ModifiedResNet(
+                layers=vision_layers,
+                output_dim=embed_dim,
+                heads=vision_heads,
+                input_resolution=image_resolution,
+                width=vision_width
+            )
+        else:
+            vision_heads = vision_width // 64
+            self.visual = VisualTransformer(
+                input_resolution=image_resolution,
+                patch_size=vision_patch_size,
+                width=vision_width,
+                layers=vision_layers,
+                heads=vision_heads,
+                output_dim=embed_dim
+            )
+        self.transformer = Transformer(
+            width=transformer_width,
+            layers=transformer_layers,
+            heads=transformer_heads,
+            attn_mask=self.build_attention_mask()
+        )
+        self.vocab_size = vocab_size
+        self.token_embedding = nn.Embedding(vocab_size, transformer_width)
+        self.positional_embedding = nn.Parameter(torch.empty(self.context_length, transformer_width))
+        self.ln_final = LayerNorm(transformer_width)
+        self.text_projection = nn.Parameter(torch.empty(transformer_width, embed_dim))
+        self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
+        self.initialize_parameters()
+    def initialize_parameters(self):
+        nn.init.normal_(self.token_embedding.weight, std=0.02)
+        nn.init.normal_(self.positional_embedding, std=0.01)
+        if isinstance(self.visual, ModifiedResNet):
+            if self.visual.attnpool is not None:
+                std = self.visual.attnpool.c_proj.in_features ** -0.5
+                nn.init.normal_(self.visual.attnpool.q_proj.weight, std=std)
+                nn.init.normal_(self.visual.attnpool.k_proj.weight, std=std)
+                nn.init.normal_(self.visual.attnpool.v_proj.weight, std=std)
+                nn.init.normal_(self.visual.attnpool.c_proj.weight, std=std)
+            for resnet_block in [self.visual.layer1, self.visual.layer2, self.visual.layer3, self.visual.layer4]:
+                for name, param in resnet_block.named_parameters():
+                    if name.endswith("bn3.weight"):
+                        nn.init.zeros_(param)
+        proj_std = (self.transformer.width ** -0.5) * ((2 * self.transformer.layers) ** -0.5)
+        attn_std = self.transformer.width ** -0.5
+        fc_std = (2 * self.transformer.width) ** -0.5
+        for block in self.transformer.resblocks:
+            nn.init.normal_(block.attn.in_proj_weight, std=attn_std)
+            nn.init.normal_(block.attn.out_proj.weight, std=proj_std)
+            nn.init.normal_(block.mlp.c_fc.weight, std=fc_std)
+            nn.init.normal_(block.mlp.c_proj.weight, std=proj_std)
+        if self.text_projection is not None:
+            nn.init.normal_(self.text_projection, std=self.transformer.width ** -0.5)
+    def build_attention_mask(self):
+        # lazily create causal attention mask, with full attention between the vision tokens
+        # pytorch uses additive attention mask; fill with -inf
+        mask = torch.empty(self.context_length, self.context_length)
+        mask.fill_(float("-inf"))
+        mask.triu_(1)  # zero out the lower diagonal
+        return mask
+    @property
+    def dtype(self):
+        return self.visual.conv1.weight.dtype
+    def encode_image(self, image):
+        return self.visual(image.type(self.dtype))
+    def encode_text(self, text):
+        x = self.token_embedding(text).type(self.dtype)  # [batch_size, n_ctx, d_model]
+        x = x + self.positional_embedding.type(self.dtype)
+        x = x.permute(1, 0, 2)  # NLD -> LND
+        x = self.transformer(x)
+        x = x.permute(1, 0, 2)  # LND -> NLD
+        x = self.ln_final(x).type(self.dtype)
+        # x.shape = [batch_size, n_ctx, transformer.width]
+        # take features from the eot embedding (eot_token is the highest number in each sequence)
+        x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection
+        return x
+    def forward(self, image, text):
+        image_features = self.encode_image(image)
+        text_features = self.encode_text(text)
+        # normalized features
+        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
+        text_features = text_features / text_features.norm(dim=-1, keepdim=True)
+        # cosine similarity as logits
+        logit_scale = self.logit_scale.exp()
+        logits_per_image = logit_scale * image_features @ text_features.t()
+        logits_per_text = logit_scale * text_features @ image_features.t()
+        # shape = [global_batch_size, global_batch_size]
+        return logits_per_image, logits_per_text
+def convert_weights(model: nn.Module):
+    """Convert applicable model parameters to fp16"""
+    def _convert_weights_to_fp16(l):
+        if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):
+            l.weight.data = l.weight.data.half()
+            if l.bias is not None:
+                l.bias.data = l.bias.data.half()
+        if isinstance(l, MultiheadAttention):
+            for attr in [*[f"{s}_proj_weight" for s in ["in", "q", "k", "v"]], "in_proj_bias", "bias_k", "bias_v"]:
+                tensor = getattr(l, attr)
+                if tensor is not None:
+                    tensor.data = tensor.data.half()
+        for name in ["text_projection", "proj"]:
+            if hasattr(l, name):
+                attr = getattr(l, name)
+                if attr is not None:
+                    attr.data = attr.data.half()
+    model.apply(_convert_weights_to_fp16)
+def build_model(state_dict: dict):
+    vit = "visual.proj" in state_dict
+    if vit:
+        vision_width = state_dict["visual.conv1.weight"].shape[0]
+        vision_layers = len([k for k in state_dict.keys() if k.startswith("visual.") and k.endswith(".attn.in_proj_weight")])
+        vision_patch_size = state_dict["visual.conv1.weight"].shape[-1]
+        grid_size = round((state_dict["visual.positional_embedding"].shape[0] - 1) ** 0.5)
+        image_resolution = vision_patch_size * grid_size
+    else:
+        counts: list = [len(set(k.split(".")[2] for k in state_dict if k.startswith(f"visual.layer{b}"))) for b in [1, 2, 3, 4]]
+        vision_layers = tuple(counts)
+        vision_width = state_dict["visual.layer1.0.conv1.weight"].shape[0]
+        output_width = round((state_dict["visual.attnpool.positional_embedding"].shape[0] - 1) ** 0.5)
+        vision_patch_size = None
+        assert output_width ** 2 + 1 == state_dict["visual.attnpool.positional_embedding"].shape[0]
+        image_resolution = output_width * 32
+    embed_dim = state_dict["text_projection"].shape[1]
+    context_length = state_dict["positional_embedding"].shape[0]
+    vocab_size = state_dict["token_embedding.weight"].shape[0]
+    transformer_width = state_dict["ln_final.weight"].shape[0]
+    transformer_heads = transformer_width // 64
+    transformer_layers = len(set(k.split(".")[2] for k in state_dict if k.startswith(f"transformer.resblocks")))
+    model = CLIP(
+        embed_dim,
+        image_resolution, vision_layers, vision_width, vision_patch_size,
+        context_length, vocab_size, transformer_width, transformer_heads, transformer_layers
+    )
+    for key in ["input_resolution", "context_length", "vocab_size"]:
+        if key in state_dict:
+            del state_dict[key]
+    convert_weights(model)
+    model.load_state_dict(state_dict)
+    return model.eval()

CLIP_Explainability/simple_tokenizer.py ADDED Viewed

	@@ -0,0 +1,136 @@

+"""
+taken from https://github.com/hila-chefer/Transformer-MM-Explainability
+"""
+import gzip
+import html
+import os
+from functools import lru_cache
+import ftfy
+import regex as re
+@lru_cache()
+def default_bpe():
+    return os.path.join(os.path.dirname(os.path.abspath(__file__)), "bpe_simple_vocab_16e6.txt.gz")
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a signficant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8+n)
+            n += 1
+    cs = [chr(n) for n in cs]
+    return dict(zip(bs, cs))
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+def basic_clean(text):
+    text = ftfy.fix_text(text)
+    text = html.unescape(html.unescape(text))
+    return text.strip()
+def whitespace_clean(text):
+    text = re.sub(r'\s+', ' ', text)
+    text = text.strip()
+    return text
+class SimpleTokenizer(object):
+    def __init__(self, bpe_path: str = default_bpe()):
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+        merges = gzip.open(bpe_path).read().decode("utf-8").split('\n')
+        merges = merges[1:49152-256-2+1]
+        merges = [tuple(merge.split()) for merge in merges]
+        vocab = list(bytes_to_unicode().values())
+        vocab = vocab + [v+'</w>' for v in vocab]
+        for merge in merges:
+            vocab.append(''.join(merge))
+        vocab.extend(['<|startoftext|>', '<|endoftext|>'])
+        self.encoder = dict(zip(vocab, range(len(vocab))))
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self.bpe_ranks = dict(zip(merges, range(len(merges))))
+        self.cache = {'<|startoftext|>': '<|startoftext|>', '<|endoftext|>': '<|endoftext|>'}
+        self.pat = re.compile(r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""", re.IGNORECASE)
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token[:-1]) + ( token[-1] + '</w>',)
+        pairs = get_pairs(word)
+        if not pairs:
+            return token+'</w>'
+        while True:
+            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+                if word[i] == first and i < len(word)-1 and word[i+1] == second:
+                    new_word.append(first+second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = ' '.join(word)
+        self.cache[token] = word
+        return word
+    def encode(self, text):
+        bpe_tokens = []
+        text = whitespace_clean(basic_clean(text)).lower()
+        for token in re.findall(self.pat, text):
+            token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
+            bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
+        return bpe_tokens
+    def decode(self, tokens):
+        text = ''.join([self.decoder[token] for token in tokens])
+        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors="replace").replace('</w>', ' ')
+        return text

CLIP_Explainability/vit_cam.py ADDED Viewed

	@@ -0,0 +1,325 @@

+import torch
+import numpy as np
+from PIL import Image
+import matplotlib.pyplot as plt
+import cv2
+import regex as re
+from .image_utils import show_cam_on_image, show_overlapped_cam
+def vit_block_vis(
+    image,
+    target_features,
+    img_encoder,
+    block,
+    device,
+    grad=False,
+    neg_saliency=False,
+    img_dim=224,
+):
+    img_encoder.eval()
+    image_features = img_encoder(image)
+    image_features_norm = image_features.norm(dim=-1, keepdim=True)
+    image_features_new = image_features / image_features_norm
+    target_features_norm = target_features.norm(dim=-1, keepdim=True)
+    target_features_new = target_features / target_features_norm
+    similarity = image_features_new[0].dot(target_features_new[0])
+    image = (image - image.min()) / (image.max() - image.min())
+    img_encoder.zero_grad()
+    similarity.backward(retain_graph=True)
+    image_attn_blocks = list(
+        dict(img_encoder.transformer.resblocks.named_children()).values()
+    )
+    if grad:
+        cam = image_attn_blocks[block].attn_grad.detach()
+    else:
+        cam = image_attn_blocks[block].attn_probs.detach()
+    cam = cam.mean(dim=0)
+    image_relevance = cam[0, 1:]
+    resize_dim = int(np.sqrt(list(image_relevance.shape)[0]))
+    # image_relevance = image_relevance.reshape(1, 1, 7, 7)
+    image_relevance = image_relevance.reshape(1, 1, resize_dim, resize_dim)
+    image_relevance = torch.nn.functional.interpolate(
+        image_relevance, size=img_dim, mode="bilinear"
+    )
+    image_relevance = image_relevance.reshape(img_dim, img_dim)
+    image_relevance = (image_relevance - image_relevance.min()) / (
+        image_relevance.max() - image_relevance.min()
+    )
+    cam = image_relevance * image
+    cam = cam / torch.max(cam)
+    # TODO: maybe we can ignore this...
+    ####
+    masked_image_features = img_encoder(cam)
+    masked_image_features_norm = masked_image_features.norm(dim=-1, keepdim=True)
+    masked_image_features_new = masked_image_features / masked_image_features_norm
+    new_score = masked_image_features_new[0].dot(target_features_new[0])
+    ####
+    cam = cam[0].permute(1, 2, 0).data.cpu().numpy()
+    cam = np.float32(cam)
+    plt.imshow(cam)
+    return new_score
+def vit_relevance(
+    image,
+    target_features,
+    img_encoder,
+    device,
+    method="last grad",
+    neg_saliency=False,
+    img_dim=224,
+):
+    img_encoder.eval()
+    image_features = img_encoder(image)
+    image_features_norm = image_features.norm(dim=-1, keepdim=True)
+    image_features_new = image_features / image_features_norm
+    target_features_norm = target_features.norm(dim=-1, keepdim=True)
+    target_features_new = target_features / target_features_norm
+    similarity = image_features_new[0].dot(target_features_new[0])
+    if neg_saliency:
+        objective = 1 - similarity
+    else:
+        objective = similarity
+    img_encoder.zero_grad()
+    objective.backward(retain_graph=True)
+    image_attn_blocks = list(
+        dict(img_encoder.transformer.resblocks.named_children()).values()
+    )
+    num_tokens = image_attn_blocks[0].attn_probs.shape[-1]
+    last_attn = image_attn_blocks[-1].attn_probs.detach()
+    last_attn = last_attn.reshape(-1, last_attn.shape[-1], last_attn.shape[-1])
+    last_grad = image_attn_blocks[-1].attn_grad.detach()
+    last_grad = last_grad.reshape(-1, last_grad.shape[-1], last_grad.shape[-1])
+    if method == "gradcam":
+        cam = last_grad * last_attn
+        cam = cam.clamp(min=0).mean(dim=0)
+        image_relevance = cam[0, 1:]
+    else:
+        R = torch.eye(
+            num_tokens, num_tokens, dtype=image_attn_blocks[0].attn_probs.dtype
+        ).to(device)
+        for blk in image_attn_blocks:
+            cam = blk.attn_probs.detach()
+            cam = cam.reshape(-1, cam.shape[-1], cam.shape[-1])
+            if method == "last grad":
+                grad = last_grad
+            elif method == "all grads":
+                grad = blk.attn_grad.detach()
+            else:
+                print(
+                    "The available visualization methods are: 'gradcam', 'last grad', 'all grads'."
+                )
+                return
+            cam = grad * cam
+            cam = cam.clamp(min=0).mean(dim=0)
+            R += torch.matmul(cam, R)
+        image_relevance = R[0, 1:]
+    resize_dim = int(np.sqrt(list(image_relevance.shape)[0]))
+    # image_relevance = image_relevance.reshape(1, 1, 7, 7)
+    image_relevance = image_relevance.reshape(1, 1, resize_dim, resize_dim)
+    image_relevance = torch.nn.functional.interpolate(
+        image_relevance, size=img_dim, mode="bilinear"
+    )
+    image_relevance = image_relevance.reshape(img_dim, img_dim).data.cpu().numpy()
+    image_relevance = (image_relevance - image_relevance.min()) / (
+        image_relevance.max() - image_relevance.min()
+    )
+    image = image[0].permute(1, 2, 0).data.cpu().numpy()
+    image = (image - image.min()) / (image.max() - image.min())
+    return image_relevance, image
+def interpret_vit(
+    image,
+    target_features,
+    img_encoder,
+    device,
+    method="last grad",
+    neg_saliency=False,
+    img_dim=224,
+):
+    image_relevance, image = vit_relevance(
+        image,
+        target_features,
+        img_encoder,
+        device,
+        method=method,
+        neg_saliency=neg_saliency,
+        img_dim=img_dim,
+    )
+    vis = show_cam_on_image(image, image_relevance, neg_saliency=neg_saliency)
+    vis = np.uint8(255 * vis)
+    vis = cv2.cvtColor(np.array(vis), cv2.COLOR_RGB2BGR)
+    return vis
+    # plt.imshow(vis)
+def interpret_vit_overlapped(
+    image, target_features, img_encoder, device, method="last grad", img_dim=224
+):
+    pos_image_relevance, _ = vit_relevance(
+        image,
+        target_features,
+        img_encoder,
+        device,
+        method=method,
+        neg_saliency=False,
+        img_dim=img_dim,
+    )
+    neg_image_relevance, image = vit_relevance(
+        image,
+        target_features,
+        img_encoder,
+        device,
+        method=method,
+        neg_saliency=True,
+        img_dim=img_dim,
+    )
+    vis = show_overlapped_cam(image, neg_image_relevance, pos_image_relevance)
+    vis = np.uint8(255 * vis)
+    vis = cv2.cvtColor(np.array(vis), cv2.COLOR_RGB2BGR)
+    plt.imshow(vis)
+def vit_perword_relevance(
+    image,
+    text,
+    clip_model,
+    clip_tokenizer,
+    device,
+    masked_word="",
+    use_last_grad=True,
+    data_only=False,
+    img_dim=224,
+):
+    clip_model.eval()
+    main_text = clip_tokenizer(text).to(device)
+    # remove the word for which you want to visualize the saliency
+    masked_text = re.sub(masked_word, "", text)
+    masked_text = clip_tokenizer(masked_text).to(device)
+    image_features = clip_model.encode_image(image)
+    main_text_features = clip_model.encode_text(main_text)
+    masked_text_features = clip_model.encode_text(masked_text)
+    image_features_norm = image_features.norm(dim=-1, keepdim=True)
+    image_features_new = image_features / image_features_norm
+    main_text_features_norm = main_text_features.norm(dim=-1, keepdim=True)
+    main_text_features_new = main_text_features / main_text_features_norm
+    masked_text_features_norm = masked_text_features.norm(dim=-1, keepdim=True)
+    masked_text_features_new = masked_text_features / masked_text_features_norm
+    objective = image_features_new[0].dot(
+        main_text_features_new[0] - masked_text_features_new[0]
+    )
+    clip_model.visual.zero_grad()
+    objective.backward(retain_graph=True)
+    image_attn_blocks = list(
+        dict(clip_model.visual.transformer.resblocks.named_children()).values()
+    )
+    num_tokens = image_attn_blocks[0].attn_probs.shape[-1]
+    R = torch.eye(
+        num_tokens, num_tokens, dtype=image_attn_blocks[0].attn_probs.dtype
+    ).to(device)
+    last_grad = image_attn_blocks[-1].attn_grad.detach()
+    last_grad = last_grad.reshape(-1, last_grad.shape[-1], last_grad.shape[-1])
+    for blk in image_attn_blocks:
+        cam = blk.attn_probs.detach()
+        cam = cam.reshape(-1, cam.shape[-1], cam.shape[-1])
+        if use_last_grad:
+            grad = last_grad
+        else:
+            grad = blk.attn_grad.detach()
+        cam = grad * cam
+        cam = cam.clamp(min=0).mean(dim=0)
+        R += torch.matmul(cam, R)
+    image_relevance = R[0, 1:]
+    resize_dim = int(np.sqrt(list(image_relevance.shape)[0]))
+    image_relevance = image_relevance.reshape(1, 1, resize_dim, resize_dim)
+    image_relevance = torch.nn.functional.interpolate(
+        image_relevance, size=img_dim, mode="bilinear"
+    )
+    image_relevance = image_relevance.reshape(img_dim, img_dim).data.cpu().numpy()
+    image_relevance = (image_relevance - image_relevance.min()) / (
+        image_relevance.max() - image_relevance.min()
+    )
+    if data_only:
+        return image_relevance
+    image = image[0].permute(1, 2, 0).data.cpu().numpy()
+    image = (image - image.min()) / (image.max() - image.min())
+    return image_relevance, image
+def interpret_perword_vit(
+    image,
+    text,
+    clip_model,
+    clip_tokenizer,
+    device,
+    masked_word="",
+    use_last_grad=True,
+    img_dim=224,
+):
+    image_relevance, image = vit_perword_relevance(
+        image,
+        text,
+        clip_model,
+        clip_tokenizer,
+        device,
+        masked_word,
+        use_last_grad,
+        img_dim=img_dim,
+    )
+    vis = show_cam_on_image(image, image_relevance)
+    vis = np.uint8(255 * vis)
+    vis = cv2.cvtColor(np.array(vis), cv2.COLOR_RGB2BGR)
+    plt.imshow(vis)

app.py CHANGED Viewed

@@ -1,12 +1,26 @@
 from math import ceil
 from multilingual_clip import pt_multilingual_clip
 import numpy as np
 import pandas as pd
 import streamlit as st
 import torch
 from transformers import AutoTokenizer, AutoModel
 st.set_page_config(layout="wide")
@@ -15,16 +29,28 @@ def init():
     st.session_state.current_page = 1
     device = "cuda" if torch.cuda.is_available() else "cpu"
     # Load the open CLIP models
     ml_model_name = "M-CLIP/XLM-Roberta-Large-Vit-B-16Plus"
-    ja_model_name = "hakuhodo-tech/japanese-clip-vit-h-14-bert-wider"
     st.session_state.ml_model = pt_multilingual_clip.MultilingualCLIP.from_pretrained(
         ml_model_name
     )
     st.session_state.ml_tokenizer = AutoTokenizer.from_pretrained(ml_model_name)
     st.session_state.ja_model = AutoModel.from_pretrained(
         ja_model_name, trust_remote_code=True
     ).to(device)
@@ -32,7 +58,12 @@ def init():
         ja_model_name, trust_remote_code=True
     )
     st.session_state.search_image_ids = []
     # Load the image IDs
     st.session_state.images_info = pd.read_csv("./metadata.csv")
@@ -43,8 +74,10 @@ def init():
     )
     # Load the image feature vectors
-    ml_image_features = np.load("./multilingual_features.npy")
-    ja_image_features = np.load("./hakuhodo_features.npy")
     # Convert features to Tensors: Float32 on CPU and Float16 on GPU
     if device == "cpu":
@@ -128,16 +161,207 @@ def clip_search(search_query):
                 st.session_state.image_ids,
             )
-        result_image_ids = [match[0] for match in matches]
-        st.session_state.search_image_ids = result_image_ids
 def string_search():
     clip_search(st.session_state.search_field_value)
 st.title("Explore Japanese visual aesthetics with CLIP models")
 search_row = st.columns([45, 10, 13, 7, 25], vertical_alignment="center")
 with search_row[0]:
     search_field = st.text_input(
@@ -148,7 +372,9 @@ with search_row[0]:
         key="search_field_value",
     )
 with search_row[1]:
-    st.button("Search", on_click=string_search, use_container_width=True)
 with search_row[2]:
     st.empty()
 with search_row[3]:
@@ -163,7 +389,7 @@ with search_row[4]:
         label_visibility="collapsed",
     )
-canned_searches = st.columns([12, 22, 22, 22, 22], vertical_alignment="center")
 with canned_searches[0]:
     st.markdown("**Suggested searches:**")
 if st.session_state.active_model == "M-CLIP (multiple languages)":
@@ -257,16 +483,27 @@ for image_id in batch:
         link_text = st.session_state.images_info.loc[image_id]["permalink"].split("/")[
             2
         ]
         st.html(
             f"""<div style="display: flex; flex-direction: column; align-items: center">
-                    <img src="{st.session_state.images_info.loc[image_id]['image_url']}" style="max-width: 100%; max-height: 800px" />
-                    <div>{st.session_state.images_info.loc[image_id]['caption']}</div>
                 </div>"""
         )
         st.caption(
-            f"""<div style="display: flex; flex-direction: column; align-items: center; position: relative; top: -20px">
                     <a href="{st.session_state.images_info.loc[image_id]['permalink']}">{link_text}</a>
                 <div>""",
             unsafe_allow_html=True,
         )
     col = (col + 1) % row_size

+from base64 import b64encode
+from io import BytesIO
 from math import ceil
+import matplotlib.pyplot as plt
 from multilingual_clip import pt_multilingual_clip
 import numpy as np
 import pandas as pd
+from PIL import Image
+import requests
 import streamlit as st
 import torch
+from torchvision.transforms import ToPILImage
 from transformers import AutoTokenizer, AutoModel
+from CLIP_Explainability.clip_ import load, tokenize
+from CLIP_Explainability.vit_cam import (
+    interpret_vit,
+    vit_perword_relevance,
+)  # , interpret_vit_overlapped
+MAX_IMG_WIDTH = 450  # For small dialog
+MAX_IMG_HEIGHT = 800
 st.set_page_config(layout="wide")
     st.session_state.current_page = 1
     device = "cuda" if torch.cuda.is_available() else "cpu"
+    st.session_state.device = device
     # Load the open CLIP models
     ml_model_name = "M-CLIP/XLM-Roberta-Large-Vit-B-16Plus"
+    ml_model_path = "./models/vit_b_16_plus_240-laion400m_e32-699c4b84.pt"
+    st.session_state.ml_image_model, st.session_state.ml_image_preprocess = load(
+        ml_model_path, device=device, jit=False
+    )
     st.session_state.ml_model = pt_multilingual_clip.MultilingualCLIP.from_pretrained(
         ml_model_name
     )
     st.session_state.ml_tokenizer = AutoTokenizer.from_pretrained(ml_model_name)
+    ja_model_name = "hakuhodo-tech/japanese-clip-vit-h-14-bert-wider"
+    ja_model_path = "./models/ViT-H-14-laion2B-s32B-b79K.bin"
+    st.session_state.ja_image_model, st.session_state.ja_image_preprocess = load(
+        ja_model_path, device=device, jit=False
+    )
     st.session_state.ja_model = AutoModel.from_pretrained(
         ja_model_name, trust_remote_code=True
     ).to(device)
         ja_model_name, trust_remote_code=True
     )
+    st.session_state.active_model = "M-CLIP (multiple languages)"
     st.session_state.search_image_ids = []
+    st.session_state.search_image_scores = {}
+    st.session_state.activations_image = None
+    st.session_state.text_table_df = None
     # Load the image IDs
     st.session_state.images_info = pd.read_csv("./metadata.csv")
     )
     # Load the image feature vectors
+    # ml_image_features = np.load("./multilingual_features.npy")
+    # ja_image_features = np.load("./hakuhodo_features.npy")
+    ml_image_features = np.load("./resized_ml_features.npy")
+    ja_image_features = np.load("./resized_ja_features.npy")
     # Convert features to Tensors: Float32 on CPU and Float16 on GPU
     if device == "cpu":
                 st.session_state.image_ids,
             )
+        st.session_state.search_image_ids = [match[0] for match in matches]
+        st.session_state.search_image_scores = {match[0]: match[1] for match in matches}
 def string_search():
     clip_search(st.session_state.search_field_value)
+def visualize_gradcam(viz_image_id):
+    if not st.session_state.search_field_value:
+        return
+    header_cols = st.columns([80, 20], vertical_alignment="bottom")
+    with header_cols[0]:
+        st.title("Image + query details")
+    with header_cols[1]:
+        if st.button("Close"):
+            st.rerun()
+    st.markdown(
+        f"**Query text:** {st.session_state.search_field_value} | **Image relevance:** {round(st.session_state.search_image_scores[viz_image_id], 3)}"
+    )
+    # with st.spinner("Calculating..."):
+    info_text = st.text("Calculating activation regions...")
+    image_url = st.session_state.images_info.loc[viz_image_id]["image_url"]
+    image_response = requests.get(image_url)
+    image = Image.open(BytesIO(image_response.content), formats=["JPEG"])
+    img_dim = 224
+    if st.session_state.active_model == "M-CLIP (multiple languages)":
+        img_dim = 240
+    orig_img_dims = image.size
+    altered_image = image.resize((img_dim, img_dim), Image.LANCZOS)
+    if st.session_state.active_model == "M-CLIP (multiple languages)":
+        p_image = (
+            st.session_state.ml_image_preprocess(altered_image)
+            .unsqueeze(0)
+            .to(st.session_state.device)
+        )
+        # Sometimes used for token importance viz
+        tokenized_text = st.session_state.ml_tokenizer.tokenize(
+            st.session_state.search_field_value
+        )
+        image_model = st.session_state.ml_image_model
+        # tokenize = st.session_state.ml_tokenizer.tokenize
+        text_features = st.session_state.ml_model.forward(
+            st.session_state.search_field_value, st.session_state.ml_tokenizer
+        )
+        vis_t = interpret_vit(
+            p_image.type(st.session_state.ml_image_model.dtype),
+            text_features,
+            st.session_state.ml_image_model.visual,
+            st.session_state.device,
+            img_dim=img_dim,
+        )
+    else:
+        p_image = (
+            st.session_state.ja_image_preprocess(altered_image)
+            .unsqueeze(0)
+            .to(st.session_state.device)
+        )
+        # Sometimes used for token importance viz
+        tokenized_text = st.session_state.ja_tokenizer.tokenize(
+            st.session_state.search_field_value
+        )
+        image_model = st.session_state.ja_image_model
+        t_text = st.session_state.ja_tokenizer(
+            st.session_state.search_field_value, return_tensors="pt"
+        )
+        text_features = st.session_state.ja_model.get_text_features(**t_text)
+        vis_t = interpret_vit(
+            p_image.type(st.session_state.ja_image_model.dtype),
+            text_features,
+            st.session_state.ja_image_model.visual,
+            st.session_state.device,
+            img_dim=img_dim,
+        )
+    transform = ToPILImage()
+    vis_img = transform(vis_t)
+    if orig_img_dims[0] > orig_img_dims[1]:
+        scale_factor = MAX_IMG_WIDTH / orig_img_dims[0]
+        scaled_dims = [MAX_IMG_WIDTH, int(orig_img_dims[1] * scale_factor)]
+    else:
+        scale_factor = MAX_IMG_HEIGHT / orig_img_dims[1]
+        scaled_dims = [int(orig_img_dims[0] * scale_factor), MAX_IMG_HEIGHT]
+    st.session_state.activations_image = vis_img.resize(scaled_dims)
+    image_io = BytesIO()
+    st.session_state.activations_image.save(image_io, "PNG")
+    dataurl = "data:image/png;base64," + b64encode(image_io.getvalue()).decode("ascii")
+    st.html(
+        f"""<div style="display: flex; flex-direction: column; align-items: center">
+                <img src="{dataurl}" />
+            </div>"""
+    )
+    info_text.empty()
+    tokenized_text = [tok for tok in tokenized_text if tok != "▁"]
+    if (
+        len(tokenized_text) > 1
+        and len(tokenized_text) < 15
+        and st.button(
+            "Calculate text importance (may take some time)",
+        )
+    ):
+        search_tokens = []
+        token_scores = []
+        progress_text = f"Processing {len(tokenized_text)} text tokens"
+        progress_bar = st.progress(0.0, text=progress_text)
+        for t, tok in enumerate(tokenized_text):
+            token = tok.replace("▁", "")
+            word_rel = vit_perword_relevance(
+                p_image,
+                st.session_state.search_field_value,
+                image_model,
+                tokenize,
+                st.session_state.device,
+                token,
+                data_only=True,
+                img_dim=img_dim,
+            )
+            avg_score = np.mean(word_rel)
+            if avg_score == 0 or np.isnan(avg_score):
+                continue
+            search_tokens.append(token)
+            token_scores.append(1 / avg_score)
+            progress_bar.progress(
+                (t + 1) / len(tokenized_text),
+                text=f"Processing token {t+1} of {len(tokenized_text)} tokens",
+            )
+        progress_bar.empty()
+        normed_scores = torch.softmax(torch.tensor(token_scores), dim=0)
+        token_scores = [f"{round(score.item() * 100, 3)}%" for score in normed_scores]
+        st.session_state.text_table_df = pd.DataFrame(
+            {"token": search_tokens, "importance": token_scores}
+        )
+        st.markdown("**Importance of each text token to relevance score**")
+        st.table(st.session_state.text_table_df)
+@st.dialog(" ", width="small")
+def image_modal(vis_image_id):
+    visualize_gradcam(vis_image_id)
 st.title("Explore Japanese visual aesthetics with CLIP models")
+st.markdown(
+    """
+    <style>
+    [data-testid=stImageCaption] {
+        padding: 0 0 0 0;
+    }
+    [data-testid=stVerticalBlockBorderWrapper] {
+        line-height: 1.2;
+    }
+    [data-testid=stVerticalBlock] {
+        gap: .75rem;
+    }
+    [data-testid=baseButton-secondary] {
+        min-height: 1rem;
+        padding: 0 0.75rem;
+        margin: 0 0 1rem 0;
+    }
+    div[aria-label="dialog"]>button[aria-label="Close"] {
+        display: none;
+    }
+    [data-testid=stFullScreenFrame] {
+        display: flex;
+        flex-direction: column;
+        align-items: center;
+    }
+    </style>
+    """,
+    unsafe_allow_html=True,
+)
 search_row = st.columns([45, 10, 13, 7, 25], vertical_alignment="center")
 with search_row[0]:
     search_field = st.text_input(
         key="search_field_value",
     )
 with search_row[1]:
+    st.button(
+        "Search", on_click=string_search, use_container_width=True, type="primary"
+    )
 with search_row[2]:
     st.empty()
 with search_row[3]:
         label_visibility="collapsed",
     )
+canned_searches = st.columns([12, 22, 22, 22, 22], vertical_alignment="top")
 with canned_searches[0]:
     st.markdown("**Suggested searches:**")
 if st.session_state.active_model == "M-CLIP (multiple languages)":
         link_text = st.session_state.images_info.loc[image_id]["permalink"].split("/")[
             2
         ]
+        # st.image(
+        #     st.session_state.images_info.loc[image_id]["image_url"],
+        #     caption=st.session_state.images_info.loc[image_id]["caption"],
+        # )
         st.html(
             f"""<div style="display: flex; flex-direction: column; align-items: center">
+                    <img src="{st.session_state.images_info.loc[image_id]['image_url']}" style="max-width: 100%; max-height: {MAX_IMG_HEIGHT}px" />
+                    <div>{st.session_state.images_info.loc[image_id]['caption']} <b>[{round(st.session_state.search_image_scores[image_id], 3)}]</b></div>
                 </div>"""
         )
         st.caption(
+            f"""<div style="display: flex; flex-direction: column; align-items: center; position: relative; top: -12px">
                     <a href="{st.session_state.images_info.loc[image_id]['permalink']}">{link_text}</a>
                 <div>""",
             unsafe_allow_html=True,
         )
+        st.button(
+            "Explain this",
+            on_click=image_modal,
+            args=[image_id],
+            use_container_width=True,
+            key=image_id,
+        )
     col = (col + 1) % row_size

requirements.txt CHANGED Viewed

@@ -1,6 +1,9 @@
 multilingual_clip==1.0.10
 numpy==1.26
 pandas==2.1.2
 sentencepiece==0.2.0
 torch==2.4.0
 transformers==4.35.0

 multilingual_clip==1.0.10
 numpy==1.26
 pandas==2.1.2
+pillow==10.1.0
+requests==2.31.0
 sentencepiece==0.2.0
 torch==2.4.0
+torchvision==0.19.0
 transformers==4.35.0

resized_ja_features.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5ec1ba33ef7ffe1236ce4adbfae3d785e89ab7ce98cbc1e99ff74c2391a8a657
+size 25903232

resized_ml_features.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0b13a2171ead017721de26fe8c250b871ff4917dc573fbbe9da6b24cc348b156
+size 16189568