Spaces:

fffiloni
/

Gaze-LLE

Running on Zero

App Files Files Community

fffiloni commited on Jan 6

Commit

9c9498f

verified ·

1 Parent(s): f047660

Migrated from GitHub

Browse files

Files changed (20) hide show

.gitattributes +4 -0
LICENSE +21 -0
ORIGINAL_README.md +153 -0
assets/CBS_2.gif +3 -0
assets/MLB_1.gif +3 -0
assets/Sunny_1.gif +3 -0
assets/Titanic_1.gif +3 -0
assets/gazelle_arch.png +0 -0
assets/succession.png +0 -0
assets/the_office.png +0 -0
data_prep/preprocess_gazefollow.py +188 -0
data_prep/preprocess_vat.py +116 -0
environment.yml +16 -0
gazelle/backbone.py +55 -0
gazelle/model.py +189 -0
gazelle/utils.py +39 -0
hubconf.py +28 -0
scripts/eval_gazefollow.py +115 -0
scripts/eval_vat.py +116 -0
setup.py +9 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/CBS_2.gif filter=lfs diff=lfs merge=lfs -text
+assets/MLB_1.gif filter=lfs diff=lfs merge=lfs -text
+assets/Sunny_1.gif filter=lfs diff=lfs merge=lfs -text
+assets/Titanic_1.gif filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2024 Fiona Ryan
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

ORIGINAL_README.md ADDED Viewed

	@@ -0,0 +1,153 @@

+# Gaze-LLE
+<div style="text-align:center;">
+    <img src="./assets/the_office.png" height="100"/>
+    <img src="./assets/MLB_1.gif" height="100"/>
+    <img src="./assets/succession.png" height="100"/>
+    <img src="./assets/CBS_2.gif" height="100"/>
+</div>
+[Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders](https://arxiv.org/abs/2412.09586) \
+[Fiona Ryan](https://fkryan.github.io/), Ajay Bati, [Sangmin Lee](https://sites.google.com/view/sangmin-lee), [Daniel Bolya](https://dbolya.github.io/), [Judy Hoffman](https://faculty.cc.gatech.edu/~judy/)\*, [James M. Rehg](https://rehg.org/)\*
+This is the official implementation for Gaze-LLE, a transformer approach for estimating gaze targets that leverages the power of pretrained visual foundation models. Gaze-LLE provides a streamlined gaze architecture that learns only a lightweight gaze decoder on top of a frozen, pretrained visual encoder (DINOv2). Gaze-LLE learns 1-2 orders of magnitude fewer parameters than prior works and doesn't require any extra input modalities like depth and pose!
+<div style="text-align:center;">
+    <img src="./assets/gazelle_arch.png" height="200"/>
+</div>
+## Installation
+Clone this repo, then create the virtual environment.
+```
+conda env create -f environment.yml
+conda activate gazelle
+pip install -e .
+```
+If your system supports it, consider installing [xformers](https://github.com/facebookresearch/xformers) to speed up attention computation.
+```
+pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118
+```
+## Pretrained Models
+We provide the following pretrained models for download.
+| Name | Backbone type | Backbone name | Training data | Checkpoint |
+| ---- | ------------- | ------------- |-------------- | ---------- |
+| ```gazelle_dinov2_vitb14``` | DINOv2 ViT-B | ```dinov2_vitb14```| GazeFollow | [Download](https://github.com/fkryan/gazelle/releases/download/v1.0.0/gazelle_dinov2_vitb14.pt) |
+| ```gazelle_dinov2_vitl14``` | DINOv2 ViT-L | ```dinov2_vitl14``` | GazeFollow | [Download](https://github.com/fkryan/gazelle/releases/download/v1.0.0/gazelle_dinov2_vitl14.pt) |
+| ```gazelle_dinov2_vitb14_inout``` | DINOv2 ViT-B | ```dinov2_vitb14``` | Gazefollow -> VideoAttentionTarget | [Download](https://github.com/fkryan/gazelle/releases/download/v1.0.0/gazelle_dinov2_vitb14_inout.pt) |
+| ```gazelle_large_vitl14_inout``` | DINOv2-ViT-L | ```dinov2_vitl14```  | GazeFollow -> VideoAttentionTarget | [Download](https://github.com/fkryan/gazelle/releases/download/v1.0.0/gazelle_dinov2_vitl14_inout.pt) |
+Note that our Gaze-LLE checkpoints contain only the gaze decoder weights - the DINOv2 backbone weights are downloaded from ```facebookresearch/dinov2``` on PyTorch Hub when the Gaze-LLE model is created in our code.
+The GazeFollow-trained models output a spatial heatmap of gaze locations over the scene with values in range ```[0,1]```, where 1 represents the highest probability of the location being a gaze target. The models that are additionally finetuned on VideoAttentionTarget also predict a in/out of frame gaze score in range ```[0,1]``` where 1 represents the person's gaze target being in the frame.
+### PyTorch Hub
+The models are also available on PyTorch Hub for easy use without installing from source.
+```
+model, transform = torch.hub.load('fkryan/gazelle', 'gazelle_dinov2_vitb14')
+model, transform = torch.hub.load('fkryan/gazelle', 'gazelle_dinov2_vitl14')
+model, transform = torch.hub.load('fkryan/gazelle', 'gazelle_dinov2_vitb14_inout')
+model, transform = torch.hub.load('fkryan/gazelle', 'gazelle_dinov2_vitl14_inout')
+```
+## Usage
+### Colab Demo Notebook
+Check out our [Demo Notebook](https://colab.research.google.com/drive/1TSoyFvNs1-au9kjOZN_fo5ebdzngSPDq?usp=sharing) on Google Colab for how to detect gaze for all people in an image.
+### Gaze Prediction
+Gaze-LLE is set up for multi-person inference (e.g. for a single image, GazeLLE encodes the scene only once and then uses the features to predict the gaze of multiple people in the image). The input is a batch of image tensors and a list of bounding boxes for each image representing the heads of the people to predict gaze for in each image. The bounding boxes are tuples of form ```(xmin, ymin, xmax, ymax)``` and are in ```[0,1]``` normalized image coordinates. Below we show how to perform inference for a single person in a single image.
+```
+from PIL import Image
+import torch
+from gazelle.model import get_gazelle_model
+model, transform = get_gazelle_model("gazelle_dinov2_vitl14_inout")
+model.load_gazelle_state_dict(torch.load("/path/to/checkpoint.pt", weights_only=True))
+model.eval()
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model.to(device)
+image = Image.open("path/to/image.png").convert("RGB")
+input = {
+    "images": transform(image).unsqueeze(dim=0).to(device),    # tensor of shape [1, 3, 448, 448]
+    "bboxes": [[(0.1, 0.2, 0.5, 0.7)]]              # list of lists of bbox tuples
+}
+with torch.no_grad():
+    output = model(input)
+predicted_heatmap = output["heatmap"][0][0]        # access prediction for first person in first image. Tensor of size [64, 64]
+predicted_inout = output["inout"][0][0]            # in/out of frame score (1 = in frame) (output["inout"] will be None  for non-inout models)
+```
+We empirically find that Gaze-LLE is effective without a bounding box input for scenes with just one person. However, providing a bounding box can improve results, and is necessary for scenes with multiple people to specify which person's gaze to estimate. To inference without a bounding box, use None in place of a bounding box tuple in the bbox list (e.g. ```input["bboxes"] = [[None]]``` in the example above).
+We also provide a function to visualize the predicted heatmap for an image.
+```
+import matplotlib.pyplot as plt
+from gazelle.utils import visualize_heatmap
+viz = visualize_heatmap(image, predicted_heatmap)
+plt.imshow(viz)
+plt.show()
+```
+## Evaluate
+We provide evaluation scripts for GazeFollow and VideoAttentionTarget below to reproduce our results from our checkpoints.
+### GazeFollow
+Download the GazeFollow dataset [here](https://github.com/ejcgt/attention-target-detection?tab=readme-ov-file#dataset). We provide a preprocessing script ```data_prep/preprocess_gazefollow.py```, which preprocesses and compiles the annotations into a JSON file for each split within the dataset folder. Run the preprocessing script as
+```
+python data_prep/preprocess_gazefollow.py --data_path /path/to/gazefollow/data_new
+```
+Download the pretrained model checkpoints above and use ```--model_name``` and ```ckpt_path``` to specify the model type and checkpoint for evaluation.
+```
+python scripts/eval_gazefollow.py
+    --data_path /path/to/gazefollow/data_new \
+    --model_name gazelle_dinov2_vitl14 \
+    --ckpt_path /path/to/checkpoint.pt \
+    --batch_size 128
+```
+### VideoAttentionTarget
+Download the VideoAttentionTarget dataset [here](https://github.com/ejcgt/attention-target-detection?tab=readme-ov-file#dataset-1). We provide a preprocessing script ```data_prep/preprocess_vat.py```, which preprocesses and compiles the annotations into a JSON file for each split within the dataset folder. Run the preprocessing script as
+```
+python data_prep/preprocess_gazefollow.py --data_path /path/to/videoattentiontarget
+```
+Download the pretrained model checkpoints above and use ```--model_name``` and ```ckpt_path``` to specify the model type and checkpoint for evaluation.
+```
+python scripts/eval_vat.py
+    --data_path /path/to/videoattentiontarget \
+    --model_name gazelle_dinov2_vitl14_inout \
+    --ckpt_path /path/to/checkpoint.pt \
+    --batch_size 64
+```
+## Citation
+```
+@article{ryan2024gazelle,
+  author       = {Ryan, Fiona and Bati, Ajay and Lee, Sangmin and Bolya, Daniel and Hoffman, Judy and Rehg, James M},
+  title        = {Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders},
+  journal      = {arXiv preprint arXiv:2412.09586},
+  year         = {2024},
+}
+```
+## References
+- Our models are built on top of pretrained DINOv2 models from PyTorch Hub ([Github repo](https://github.com/facebookresearch/dinov2)).
+- Our GazeFollow and VideoAttentionTarget preprocessing code is based on [Detecting Attended Targets in Video](https://github.com/ejcgt/attention-target-detection).
+- We use [PyTorch Image Models (timm)](https://github.com/huggingface/pytorch-image-models) for our transformer implementation.
+- We use [xFormers](https://github.com/facebookresearch/xformers) for efficient multi-head attention.

assets/CBS_2.gif ADDED Viewed

Git LFS Details

SHA256: 350d577f58dc36b436cd0e900d92228b5a88c4cc35d2e4527f6262dc886dcb96
Pointer size: 134 Bytes
Size of remote file: 101 MB

assets/MLB_1.gif ADDED Viewed

Git LFS Details

SHA256: 39ae696054f546f13ae72cc72ceb88404a76a550a34f2aaedc19a7206a2bdfbb
Pointer size: 133 Bytes
Size of remote file: 26.7 MB

assets/Sunny_1.gif ADDED Viewed

Git LFS Details

SHA256: 4364a39dedd8d92f8a08ff08c3a4d80c1e03cd0bd2e22bf3e9a2782f9d9f74e1
Pointer size: 132 Bytes
Size of remote file: 7.58 MB

assets/Titanic_1.gif ADDED Viewed

Git LFS Details

SHA256: ad54e4362747c94225fe29a2beb2e9fcb10bf92b0c58f0a54dfdf007146064ff
Pointer size: 133 Bytes
Size of remote file: 18.2 MB

assets/gazelle_arch.png ADDED Viewed

assets/succession.png ADDED Viewed

assets/the_office.png ADDED Viewed

data_prep/preprocess_gazefollow.py ADDED Viewed

	@@ -0,0 +1,188 @@

+import os
+import pandas as pd
+import json
+from PIL import Image
+import argparse
+# preprocessing adapted from https://github.com/ejcgt/attention-target-detection/blob/master/dataset.py
+parser = argparse.ArgumentParser()
+parser.add_argument("--data_path", type=str, default="./data/gazefollow")
+args = parser.parse_args()
+def main(DATA_PATH):
+    # TRAIN
+    train_csv_path = os.path.join(DATA_PATH, "train_annotations_release.txt")
+    column_names = ['path', 'idx', 'body_bbox_x', 'body_bbox_y', 'body_bbox_w', 'body_bbox_h', 'eye_x', 'eye_y',
+                                'gaze_x', 'gaze_y', 'bbox_x_min', 'bbox_y_min', 'bbox_x_max', 'bbox_y_max', 'inout', 'source', 'meta']
+    df = pd.read_csv(train_csv_path, header=None, names=column_names, index_col=False)
+    df = df[df['inout'] != -1]
+    df = df.groupby("path").agg(list) # aggregate over frames
+    multiperson_ex = 0
+    TRAIN_FRAMES = []
+    for path, row in df.iterrows():
+        img_path = os.path.join(DATA_PATH, path)
+        img = Image.open(img_path)
+        width, height = img.size
+        num_people = len(row['idx'])
+        if num_people > 1:
+            multiperson_ex += 1
+        heads = []
+        crop_constraint_xs = []
+        crop_constraint_ys = []
+        for i in range(num_people):
+            xmin, ymin, xmax, ymax = row['bbox_x_min'][i], row['bbox_y_min'][i], row['bbox_x_max'][i], row['bbox_y_max'][i]
+            gazex = row['gaze_x'][i] * float(width)
+            gazey = row['gaze_y'][i] * float(height)
+            gazex_norm = row['gaze_x'][i]
+            gazey_norm = row['gaze_y'][i]
+            if xmin > xmax:
+                temp = xmin
+                xmin = xmax
+                xmax = temp
+            if ymin > ymax:
+                temp = ymin
+                ymin = ymax
+                ymax = temp
+            # move in out of frame bbox annotations
+            xmin = max(xmin, 0)
+            ymin = max(ymin, 0)
+            xmax = min(xmax, width)
+            ymax = min(ymax, height)
+            # precalculate feasible crop region (containing bbox and gaze target)
+            crop_xmin = min(xmin, gazex)
+            crop_ymin = min(ymin, gazey)
+            crop_xmax = max(xmax, gazex)
+            crop_ymax = max(ymax, gazey)
+            crop_constraint_xs.extend([crop_xmin, crop_xmax])
+            crop_constraint_ys.extend([crop_ymin, crop_ymax])
+            heads.append({
+                'bbox': [xmin, ymin, xmax, ymax],
+                'bbox_norm': [xmin / float(width), ymin / float(height), xmax / float(width), xmax / float(height)],
+                'inout': row['inout'][i],
+                'gazex': [gazex], # convert to list for consistency with multi-annotation format
+                'gazey': [gazey],
+                'gazex_norm': [gazex_norm],
+                'gazey_norm': [gazey_norm],
+                'crop_region': [crop_xmin, crop_ymin, crop_xmax, crop_ymax],
+                'crop_region_norm': [crop_xmin / float(width), crop_ymin / float(height), crop_xmin / float(width), crop_ymax / float(height)],
+                'head_id': i
+            })
+        TRAIN_FRAMES.append({
+            'path': path,
+            'heads': heads,
+            'num_heads': num_people,
+            'width': width,
+            'height': height,
+            'crop_region': [min(crop_constraint_xs), min(crop_constraint_ys), max(crop_constraint_xs), max(crop_constraint_ys)],
+        })
+    print("Train set: {} frames, {} multi-person".format(len(TRAIN_FRAMES), multiperson_ex))
+    out_file = open(os.path.join(DATA_PATH, "train_preprocessed.json"), "w")
+    json.dump(TRAIN_FRAMES, out_file)
+    # TEST
+    test_csv_path = os.path.join(DATA_PATH, "test_annotations_release.txt")
+    column_names = ['path', 'idx', 'body_bbox_x', 'body_bbox_y', 'body_bbox_w', 'body_bbox_h', 'eye_x', 'eye_y',
+                                'gaze_x', 'gaze_y', 'bbox_x_min', 'bbox_y_min', 'bbox_x_max', 'bbox_y_max', 'source', 'meta']
+    df = pd.read_csv(test_csv_path, header=None, names=column_names, index_col=False)
+    TEST_FRAME_DICT = {}
+    df = df.groupby(["path", "eye_x"]).agg(list) # aggregate over frames
+    for id, row in df.iterrows(): # aggregate by frame
+        path, _ = id
+        if path in TEST_FRAME_DICT.keys():
+            TEST_FRAME_DICT[path].append(row)
+        else:
+            TEST_FRAME_DICT[path] = [row]
+    multiperson_ex = 0
+    TEST_FRAMES = []
+    for path in TEST_FRAME_DICT.keys():
+        img_path = os.path.join(DATA_PATH, path)
+        img = Image.open(img_path)
+        width, height = img.size
+        item = TEST_FRAME_DICT[path]
+        num_people = len(item)
+        heads = []
+        crop_constraint_xs = []
+        crop_constraint_ys = []
+        for i in range(num_people):
+            row = item[i]
+            assert(row['bbox_x_min'].count(row['bbox_x_min'][0]) == len(row['bbox_x_min'])) # quick check that all bboxes are equivalent
+            xmin, ymin, xmax, ymax = row['bbox_x_min'][0], row['bbox_y_min'][0], row['bbox_x_max'][0], row['bbox_y_max'][0]
+            if xmin > xmax:
+                temp = xmin
+                xmin = xmax
+                xmax = temp
+            if ymin > ymax:
+                temp = ymin
+                ymin = ymax
+                ymax = temp
+            # move in out of frame bbox annotations
+            xmin = max(xmin, 0)
+            ymin = max(ymin, 0)
+            xmax = min(xmax, width)
+            ymax = min(ymax, height)
+            gazex_norm = [x for x in row['gaze_x']]
+            gazey_norm = [y for y in row['gaze_y']]
+            gazex = [x * float(width) for x in row['gaze_x']]
+            gazey = [y * float(height) for y in row['gaze_y']]
+            # precalculate feasible crop region (containing bbox and gaze target)
+            crop_xmin = min(xmin, *gazex)
+            crop_ymin = min(ymin, *gazey)
+            crop_xmax = max(xmax, *gazex)
+            crop_ymax = max(ymax, *gazey)
+            crop_constraint_xs.extend([crop_xmin, crop_xmax])
+            crop_constraint_ys.extend([crop_ymin, crop_ymax])
+            heads.append({
+                'bbox': [xmin, ymin, xmax, ymax],
+                'bbox_norm': [xmin / float(width), ymin / float(height), xmax / float(width), ymax / float(height)],
+                'gazex': gazex,
+                'gazey': gazey,
+                'gazex_norm': gazex_norm,
+                'gazey_norm': gazey_norm,
+                'inout': 1, # all test frames are in frame
+                'num_annot': len(gazex),
+                'crop_region': [crop_xmin, crop_ymin, crop_xmax, crop_ymax],
+                'crop_region_norm': [crop_xmin / float(width), crop_ymin / float(height), crop_xmax / float(width), crop_ymax / float(height)],
+                'head_id': i
+            })
+        # visualize_heads(img_path, heads)
+        TEST_FRAMES.append({
+            'path': path,
+            'heads': heads,
+            'num_heads': num_people,
+            'width': width,
+            'height': height,
+            'crop_region': [min(crop_constraint_xs), min(crop_constraint_ys), max(crop_constraint_xs), max(crop_constraint_ys)],
+        })
+        if num_people > 1:
+            multiperson_ex += 1
+    print("Test set: {} frames, {} multi-person".format(len(TEST_FRAMES), multiperson_ex))
+    out_file = open(os.path.join(DATA_PATH, "test_preprocessed.json"), "w")
+    json.dump(TEST_FRAMES, out_file)
+if __name__ == "__main__":
+    main(args.data_path)

data_prep/preprocess_vat.py ADDED Viewed

	@@ -0,0 +1,116 @@

+import argparse
+import glob
+from functools import reduce
+import os
+import pandas as pd
+import json
+import numpy as np
+from PIL import Image
+parser = argparse.ArgumentParser()
+parser.add_argument("--data_path", type=str, default="./data/videoattentiontarget")
+args = parser.parse_args()
+# preprocessing adapted from https://github.com/ejcgt/attention-target-detection/blob/master/dataset.py
+def merge_dfs(ls):
+    for i, df in enumerate(ls): # give columns unique names
+        df.columns = [col if col == "path" else f"{col}_df{i}" for col in df.columns]
+    merged_df = reduce(
+        lambda left, right: pd.merge(left, right, on=["path"], how="outer"), ls
+    )
+    merged_df = merged_df.sort_values(by=["path"])
+    merged_df = merged_df.reset_index(drop=True)
+    return merged_df
+def smooth_by_conv(window_size, df, col):
+    """Temporal smoothing on labels to match original VideoAttTarget evaluation.
+    Adapted from https://github.com/ejcgt/attention-target-detection/blob/acd264a3c9e6002b71244dea8c1873e5c5818500/utils/myutils.py"""
+    values = df[col].values
+    padded_track = np.concatenate([values[0].repeat(window_size // 2), values, values[-1].repeat(window_size // 2)])
+    smoothed_signals = np.convolve(
+        padded_track.squeeze(), np.ones(window_size) / window_size, mode="valid"
+    )
+    return smoothed_signals
+def smooth_df(window_size, df):
+    df["xmin"] = smooth_by_conv(window_size, df, "xmin")
+    df["ymin"] = smooth_by_conv(window_size, df, "ymin")
+    df["xmax"] = smooth_by_conv(window_size, df, "xmax")
+    df["ymax"] = smooth_by_conv(window_size, df, "ymax")
+    return df
+def main(PATH):
+    # preprocess by sequence and person track
+    splits = ["train", "test"]
+    for split in splits:
+        sequences = []
+        max_num_ppl = 0
+        seq_idx = 0
+        for seq_path in glob.glob(
+            os.path.join(PATH, "annotations", split, "*", "*")
+        ):
+            seq_img_path = os.path.join("images", *seq_path.split("/")[-2:]
+            )
+            sample_image = os.path.join(PATH, seq_img_path, os.listdir(os.path.join(PATH, seq_img_path))[0])
+            width, height = Image.open(sample_image).size
+            seq_dict = {"path": seq_img_path, "width": width, "height": height}
+            frames = []
+            person_files = glob.glob(os.path.join(seq_path, "*"))
+            num_ppl = len(person_files)
+            if num_ppl > max_num_ppl:
+                max_num_ppl = num_ppl
+            person_dfs = [
+                pd.read_csv(
+                    file,
+                    header=None,
+                    index_col=False,
+                    names=["path", "xmin", "ymin", "xmax", "ymax", "gazex", "gazey"],
+                )
+                for file in person_files
+            ]
+            # moving-avg smoothing to match original benchmark's evaluation
+            window_size = 11
+            person_dfs = [smooth_df(window_size, df) for df in person_dfs]
+            merged_df = merge_dfs(person_dfs) # merge annotations per person for same frames
+            for frame_idx, row in merged_df.iterrows():
+                frame_dict = {
+                    "path": os.path.join(seq_img_path, row["path"]),
+                    "heads": [],
+                }
+                p_idx = 0
+                for i in range(1, num_ppl * 6 + 1, 6):
+                    if not np.isnan(row.iloc[i]): # if it's nan lack of continuity (one person leaving the frame for a period of time)
+                        xmin, ymin, xmax, ymax, gazex, gazey = row[i: i+6].values.tolist()
+                        # match original benchmark's preprocessing of annotations
+                        if gazex >=0 and gazey < 0:
+                            gazey = 0
+                        elif gazey >=0 and gazex < 0:
+                            gazex = 0
+                        inout = int(gazex >= 0 and gazey >= 0)
+                        frame_dict["heads"].append({
+                            "bbox": [xmin, ymin, xmax, ymax],
+                            "bbox_norm": [xmin / float(width), ymin / float(height), xmax / float(width), ymax / float(height)],
+                            "gazex": [gazex],
+                            "gazex_norm": [gazex / float(width)],
+                            "gazey": [gazey],
+                            "gazey_norm": [gazey / float(height)],
+                            "inout": inout
+                        })
+                    p_idx = p_idx + 1
+                frames.append(frame_dict)
+            seq_dict["frames"] = frames
+            sequences.append(seq_dict)
+            seq_idx += 1
+        print("{} max people per image {}".format(split, max_num_ppl))
+        print("{} num unique video sequences {}".format(split, len(sequences)))
+        out_file = open(os.path.join(PATH, "{}_preprocessed.json".format(split)), "w")
+        json.dump(sequences, out_file)
+if __name__ == "__main__":
+    main(args.data_path)

environment.yml ADDED Viewed

	@@ -0,0 +1,16 @@

+name: gazelle
+channels:
+  - nvidia
+  - pytorch
+  - conda-forge
+  - defaults
+dependencies:
+  - python=3.9
+  - pytorch=2.5.1
+  - torchvision=0.20.1
+  - torchaudio=2.5.1
+  - pytorch-cuda=11.8
+  - timm
+  - scikit-learn
+  - matplotlib
+  - pandas

gazelle/backbone.py ADDED Viewed

	@@ -0,0 +1,55 @@

+from abc import ABC, abstractmethod
+import torch
+import torch.nn as nn
+import torchvision.transforms as transforms
+# Abstract Backbone class
+class Backbone(nn.Module, ABC):
+    def __init__(self):
+        super(Backbone, self).__init__()
+    @abstractmethod
+    def forward(self, x):
+        pass
+    @abstractmethod
+    def get_dimension(self):
+        pass
+    @abstractmethod
+    def get_out_size(self, in_size):
+        pass
+    def get_transform(self):
+        pass
+# Official DINOv2 backbones from torch hub (https://github.com/facebookresearch/dinov2#pretrained-backbones-via-pytorch-hub)
+class DinoV2Backbone(Backbone):
+    def __init__(self, model_name):
+        super(DinoV2Backbone, self).__init__()
+        self.model = torch.hub.load('facebookresearch/dinov2', model_name)
+    def forward(self, x):
+        b, c, h, w = x.shape
+        out_h, out_w = self.get_out_size((h, w))
+        x = self.model.forward_features(x)['x_norm_patchtokens']
+        x = x.view(x.size(0), out_h, out_w, -1).permute(0, 3, 1, 2) # "b (out_h out_w) c -> b c out_h out_w"
+        return x
+    def get_dimension(self):
+        return self.model.embed_dim
+    def get_out_size(self, in_size):
+        h, w = in_size
+        return (h // self.model.patch_size, w // self.model.patch_size)
+    def get_transform(self, in_size):
+        return transforms.Compose([
+            transforms.ToTensor(),
+            transforms.Normalize(
+                mean=[0.485,0.456,0.406],
+                std=[0.229,0.224,0.225]
+            ),
+            transforms.Resize(in_size),
+        ])

gazelle/model.py ADDED Viewed

	@@ -0,0 +1,189 @@

+import torch
+import torch.nn as nn
+import torchvision
+from timm.models.vision_transformer import Block
+import math
+import gazelle.utils as utils
+from gazelle.backbone import DinoV2Backbone
+class GazeLLE(nn.Module):
+    def __init__(self, backbone, inout=False, dim=256, num_layers=3, in_size=(448, 448), out_size=(64, 64)):
+        super().__init__()
+        self.backbone = backbone
+        self.dim = dim
+        self.num_layers = num_layers
+        self.featmap_h, self.featmap_w = backbone.get_out_size(in_size)
+        self.in_size = in_size
+        self.out_size = out_size
+        self.inout = inout
+        self.linear = nn.Conv2d(backbone.get_dimension(), self.dim, 1)
+        self.register_buffer("pos_embed", positionalencoding2d(self.dim, self.featmap_h, self.featmap_w).squeeze(dim=0).squeeze(dim=0))
+        self.transformer = nn.Sequential(*[
+            Block(
+                dim=self.dim,
+                num_heads=8,
+                mlp_ratio=4,
+                drop_path=0.1)
+                for i in range(num_layers)
+                ])
+        self.heatmap_head = nn.Sequential(
+            nn.ConvTranspose2d(dim, dim, kernel_size=2, stride=2),
+            nn.Conv2d(dim, 1, kernel_size=1, bias=False),
+            nn.Sigmoid()
+        )
+        self.head_token = nn.Embedding(1, self.dim)
+        if self.inout:
+            self.inout_head = nn.Sequential(
+                nn.Linear(self.dim, 128),
+                nn.ReLU(),
+                nn.Dropout(0.1),
+                nn.Linear(128, 1),
+                nn.Sigmoid()
+            )
+            self.inout_token = nn.Embedding(1, self.dim)
+    def forward(self, input):
+        # input["images"]: [B, 3, H, W] tensor of images
+        # input["bboxes"]: list of lists of bbox tuples [[(xmin, ymin, xmax, ymax)]] per image in normalized image coords
+        num_ppl_per_img = [len(bbox_list) for bbox_list in input["bboxes"]]
+        x = self.backbone.forward(input["images"])
+        x = self.linear(x)
+        x = x + self.pos_embed
+        x = utils.repeat_tensors(x, num_ppl_per_img) # repeat image features along people dimension per image
+        head_maps = torch.cat(self.get_input_head_maps(input["bboxes"]), dim=0).to(x.device) # [sum(N_p), 32, 32]
+        head_map_embeddings = head_maps.unsqueeze(dim=1) * self.head_token.weight.unsqueeze(-1).unsqueeze(-1)
+        x = x + head_map_embeddings
+        x = x.flatten(start_dim=2).permute(0, 2, 1) # "b c h w -> b (h w) c"
+        if self.inout:
+            x = torch.cat([self.inout_token.weight.unsqueeze(dim=0).repeat(x.shape[0], 1, 1), x], dim=1)
+        x = self.transformer(x)
+        if self.inout:
+            inout_tokens = x[:, 0, :]
+            inout_preds = self.inout_head(inout_tokens).squeeze(dim=-1)
+            inout_preds = utils.split_tensors(inout_preds, num_ppl_per_img)
+            x = x[:, 1:, :] # slice off inout tokens from scene tokens
+        x = x.reshape(x.shape[0], self.featmap_h, self.featmap_w, x.shape[2]).permute(0, 3, 1, 2) # b (h w) c -> b c h w
+        x = self.heatmap_head(x).squeeze(dim=1)
+        x = torchvision.transforms.functional.resize(x, self.out_size)
+        heatmap_preds = utils.split_tensors(x, num_ppl_per_img) # resplit per image
+        return {"heatmap": heatmap_preds, "inout": inout_preds if self.inout else None}
+    def get_input_head_maps(self, bboxes):
+        # bboxes: [[(xmin, ymin, xmax, ymax)]] - list of list of head bboxes per image
+        head_maps = []
+        for bbox_list in bboxes:
+            img_head_maps = []
+            for bbox in bbox_list:
+                if bbox is None: # no bbox provided, use empty head map
+                    img_head_maps.append(torch.zeros(self.featmap_h, self.featmap_w))
+                else:
+                    xmin, ymin, xmax, ymax = bbox
+                    width, height = self.featmap_w, self.featmap_h
+                    xmin = round(xmin * width)
+                    ymin = round(ymin * height)
+                    xmax = round(xmax * width)
+                    ymax = round(ymax * height)
+                    head_map = torch.zeros((height, width))
+                    head_map[ymin:ymax, xmin:xmax] = 1
+                    img_head_maps.append(head_map)
+            head_maps.append(torch.stack(img_head_maps))
+        return head_maps
+    def get_gazelle_state_dict(self, include_backbone=False):
+        if include_backbone:
+            return self.state_dict()
+        else:
+            return {k: v for k, v in self.state_dict().items() if not k.startswith("backbone")}
+    def load_gazelle_state_dict(self, ckpt_state_dict, include_backbone=False):
+        current_state_dict = self.state_dict()
+        keys1 = current_state_dict.keys()
+        keys2 = ckpt_state_dict.keys()
+        if not include_backbone:
+            keys1 = set([k for k in keys1 if not k.startswith("backbone")])
+            keys2 = set([k for k in keys2 if not k.startswith("backbone")])
+        else:
+            keys1 = set(keys1)
+            keys2 = set(keys2)
+        if len(keys2 - keys1) > 0:
+            print("WARNING unused keys in provided state dict: ", keys2 - keys1)
+        if len(keys1 - keys2) > 0:
+            print("WARNING provided state dict does not have values for keys: ", keys1 - keys2)
+        for k in list(keys1 & keys2):
+            current_state_dict[k] = ckpt_state_dict[k]
+        self.load_state_dict(current_state_dict, strict=False)
+# From https://github.com/wzlxjtu/PositionalEncoding2D/blob/master/positionalembedding2d.py
+def positionalencoding2d(d_model, height, width):
+    """
+    :param d_model: dimension of the model
+    :param height: height of the positions
+    :param width: width of the positions
+    :return: d_model*height*width position matrix
+    """
+    if d_model % 4 != 0:
+        raise ValueError("Cannot use sin/cos positional encoding with "
+                         "odd dimension (got dim={:d})".format(d_model))
+    pe = torch.zeros(d_model, height, width)
+    # Each dimension use half of d_model
+    d_model = int(d_model / 2)
+    div_term = torch.exp(torch.arange(0., d_model, 2) *
+                         -(math.log(10000.0) / d_model))
+    pos_w = torch.arange(0., width).unsqueeze(1)
+    pos_h = torch.arange(0., height).unsqueeze(1)
+    pe[0:d_model:2, :, :] = torch.sin(pos_w * div_term).transpose(0, 1).unsqueeze(1).repeat(1, height, 1)
+    pe[1:d_model:2, :, :] = torch.cos(pos_w * div_term).transpose(0, 1).unsqueeze(1).repeat(1, height, 1)
+    pe[d_model::2, :, :] = torch.sin(pos_h * div_term).transpose(0, 1).unsqueeze(2).repeat(1, 1, width)
+    pe[d_model + 1::2, :, :] = torch.cos(pos_h * div_term).transpose(0, 1).unsqueeze(2).repeat(1, 1, width)
+    return pe
+# models
+def get_gazelle_model(model_name):
+    factory = {
+        "gazelle_dinov2_vitb14": gazelle_dinov2_vitb14,
+        "gazelle_dinov2_vitl14": gazelle_dinov2_vitl14,
+        "gazelle_dinov2_vitb14_inout": gazelle_dinov2_vitb14_inout,
+        "gazelle_dinov2_vitl14_inout": gazelle_dinov2_vitl14_inout,
+    }
+    assert model_name in factory.keys(), "invalid model name"
+    return factory[model_name]()
+def gazelle_dinov2_vitb14():
+    backbone = DinoV2Backbone('dinov2_vitb14')
+    transform = backbone.get_transform((448, 448))
+    model = GazeLLE(backbone)
+    return model, transform
+def gazelle_dinov2_vitl14():
+    backbone = DinoV2Backbone('dinov2_vitl14')
+    transform = backbone.get_transform((448, 448))
+    model = GazeLLE(backbone)
+    return model, transform
+def gazelle_dinov2_vitb14_inout():
+    backbone = DinoV2Backbone('dinov2_vitb14')
+    transform = backbone.get_transform((448, 448))
+    model = GazeLLE(backbone, inout=True)
+    return model, transform
+def gazelle_dinov2_vitl14_inout():
+    backbone = DinoV2Backbone('dinov2_vitl14')
+    transform = backbone.get_transform((448, 448))
+    model = GazeLLE(backbone, inout=True)
+    return model, transform

gazelle/utils.py ADDED Viewed

	@@ -0,0 +1,39 @@

+import torch
+from PIL import Image, ImageDraw
+import numpy as np
+import matplotlib.pyplot as plt
+def repeat_tensors(tensor, repeat_counts):
+    repeated_tensors = [tensor[i:i+1].repeat(repeat, *[1] * (tensor.ndim - 1)) for i, repeat in enumerate(repeat_counts)]
+    return torch.cat(repeated_tensors, dim=0)
+def split_tensors(tensor, split_counts):
+    indices = torch.cumsum(torch.tensor([0] + split_counts), dim=0)
+    return [tensor[indices[i]:indices[i+1]] for i in range(len(split_counts))]
+def visualize_heatmap(pil_image, heatmap, bbox=None):
+    if isinstance(heatmap, torch.Tensor):
+        heatmap = heatmap.detach().cpu().numpy()
+    heatmap = Image.fromarray((heatmap * 255).astype(np.uint8)).resize(pil_image.size, Image.Resampling.BILINEAR)
+    heatmap = plt.cm.jet(np.array(heatmap) / 255.)
+    heatmap = (heatmap[:, :, :3] * 255).astype(np.uint8)
+    heatmap = Image.fromarray(heatmap).convert("RGBA")
+    heatmap.putalpha(128)
+    overlay_image = Image.alpha_composite(pil_image.convert("RGBA"), heatmap)
+    if bbox is not None:
+        width, height = pil_image.size
+        xmin, ymin, xmax, ymax = bbox
+        draw = ImageDraw.Draw(overlay_image)
+        draw.rectangle([xmin * width, ymin * height, xmax * width, ymax * height], outline="green", width=3)
+    return overlay_image
+def stack_and_pad(tensor_list):
+    max_size = max([t.shape[0] for t in tensor_list])
+    padded_list = []
+    for t in tensor_list:
+        if t.shape[0] == max_size:
+            padded_list.append(t)
+        else:
+            padded_list.append(torch.cat([t, torch.zeros(max_size - t.shape[0], *t.shape[1:])], dim=0))
+    return torch.stack(padded_list)

hubconf.py ADDED Viewed

	@@ -0,0 +1,28 @@

+dependencies = ['torch', 'timm']
+import torch
+from gazelle.model import get_gazelle_model
+def gazelle_dinov2_vitb14():
+    model, transform = get_gazelle_model('gazelle_dinov2_vitb14')
+    ckpt_path = "https://github.com/fkryan/gazelle/releases/download/v1.0.0/gazelle_dinov2_vitb14_hub.pt"
+    model.load_gazelle_state_dict(torch.hub.load_state_dict_from_url(ckpt_path))
+    return model, transform
+def gazelle_dinov2_vitl14():
+    model, transform = get_gazelle_model('gazelle_dinov2_vitl14')
+    ckpt_path = "https://github.com/fkryan/gazelle/releases/download/v1.0.0/gazelle_dinov2_vitl14.pt"
+    model.load_gazelle_state_dict(torch.hub.load_state_dict_from_url(ckpt_path))
+    return model, transform
+def gazelle_dinov2_vitb14_inout():
+    model, transform = get_gazelle_model('gazelle_dinov2_vitb14_inout')
+    ckpt_path = "https://github.com/fkryan/gazelle/releases/download/v1.0.0/gazelle_dinov2_vitb14_inout.pt"
+    model.load_gazelle_state_dict(torch.hub.load_state_dict_from_url(ckpt_path))
+    return model, transform
+def gazelle_dinov2_vitl14_inout():
+    model, transform = get_gazelle_model('gazelle_dinov2_vitl14_inout')
+    ckpt_path = "https://github.com/fkryan/gazelle/releases/download/v1.0.0/gazelle_dinov2_vitl14_inout.pt"
+    model.load_gazelle_state_dict(torch.hub.load_state_dict_from_url(ckpt_path))
+    return model, transform

scripts/eval_gazefollow.py ADDED Viewed

	@@ -0,0 +1,115 @@

+import argparse
+import torch
+from PIL import Image
+import json
+import os
+import numpy as np
+from sklearn.metrics import roc_auc_score
+from tqdm import tqdm
+from gazelle.model import get_gazelle_model
+from gazelle.model import GazeLLE
+from gazelle.backbone import DinoV2Backbone
+parser = argparse.ArgumentParser()
+parser.add_argument("--data_path", type=str, default="./data/gazefollow")
+parser.add_argument("--model_name", type=str, default="gazelle_dinov2_vitl14_inout")
+parser.add_argument("--ckpt_path", type=str, default="./checkpoints/gazelle_dinov2_vitl14_inout.pt")
+parser.add_argument("--batch_size", type=int, default=128)
+args = parser.parse_args()
+class GazeFollow(torch.utils.data.Dataset):
+    def __init__(self, path, img_transform):
+        self.images = json.load(open(os.path.join(path, "test_preprocessed.json"), "rb"))
+        self.path = path
+        self.transform = img_transform
+    def __getitem__(self, idx):
+        item = self.images[idx]
+        image = self.transform(Image.open(os.path.join(self.path, item['path'])).convert("RGB"))
+        height = item['height']
+        width = item['width']
+        bboxes = [head['bbox_norm'] for head in item['heads']]
+        gazex = [head['gazex_norm'] for head in item['heads']]
+        gazey = [head['gazey_norm'] for head in item['heads']]
+        return image, bboxes, gazex, gazey, height, width
+    def __len__(self):
+        return len(self.images)
+def collate(batch):
+    images, bboxes, gazex, gazey, height, width = zip(*batch)
+    return torch.stack(images), list(bboxes), list(gazex), list(gazey), list(height), list(width)
+# GazeFollow calculates AUC using original image size with GT (x,y) coordinates set to 1 and everything else as 0
+# References:
+    # https://github.com/ejcgt/attention-target-detection/blob/acd264a3c9e6002b71244dea8c1873e5c5818500/eval_on_gazefollow.py#L78
+    # https://github.com/ejcgt/attention-target-detection/blob/acd264a3c9e6002b71244dea8c1873e5c5818500/utils/imutils.py#L67
+    # https://github.com/ejcgt/attention-target-detection/blob/acd264a3c9e6002b71244dea8c1873e5c5818500/utils/evaluation.py#L7
+def gazefollow_auc(heatmap, gt_gazex, gt_gazey, height, width):
+    target_map = np.zeros((height, width))
+    for point in zip(gt_gazex, gt_gazey):
+        if point[0] >= 0:
+            x, y = map(int, [point[0]*float(width), point[1]*float(height)])
+            x = min(x, width - 1)
+            y = min(y, height - 1)
+            target_map[y, x] = 1
+    resized_heatmap = torch.nn.functional.interpolate(heatmap.unsqueeze(dim=0).unsqueeze(dim=0), (height, width), mode='bilinear').squeeze()
+    auc = roc_auc_score(target_map.flatten(), resized_heatmap.cpu().flatten())
+    return auc
+# Reference: https://github.com/ejcgt/attention-target-detection/blob/acd264a3c9e6002b71244dea8c1873e5c5818500/eval_on_gazefollow.py#L81
+def gazefollow_l2(heatmap, gt_gazex, gt_gazey):
+    argmax = heatmap.flatten().argmax().item()
+    pred_y, pred_x = np.unravel_index(argmax, (64, 64))
+    pred_x = pred_x / 64.
+    pred_y = pred_y / 64.
+    gazex = np.array(gt_gazex)
+    gazey = np.array(gt_gazey)
+    avg_l2 = np.sqrt((pred_x - gazex.mean())**2 + (pred_y - gazey.mean())**2)
+    all_l2s = np.sqrt((pred_x - gazex)**2 + (pred_y - gazey)**2)
+    min_l2 = all_l2s.min().item()
+    return avg_l2, min_l2
+@torch.no_grad()
+def main():
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    print("Running on {}".format(device))
+    model, transform = get_gazelle_model(args.model_name)
+    model.load_gazelle_state_dict(torch.load(args.ckpt_path, weights_only=True))
+    model.to(device)
+    model.eval()
+    dataset = GazeFollow(args.data_path, transform)
+    dataloader = torch.utils.data.DataLoader(dataset, batch_size=args.batch_size, collate_fn=collate)
+    aucs = []
+    min_l2s = []
+    avg_l2s = []
+    for _, (images, bboxes, gazex, gazey, height, width) in tqdm(enumerate(dataloader), desc="Evaluating", total=len(dataloader)):
+        preds = model.forward({"images": images.to(device), "bboxes": bboxes})
+        # eval each instance (head)
+        for i in range(images.shape[0]): # per image
+            for j in range(len(bboxes[i])): # per head
+                auc = gazefollow_auc(preds['heatmap'][i][j], gazex[i][j], gazey[i][j], height[i], width[i])
+                avg_l2, min_l2 = gazefollow_l2(preds['heatmap'][i][j], gazex[i][j], gazey[i][j])
+                aucs.append(auc)
+                avg_l2s.append(avg_l2)
+                min_l2s.append(min_l2)
+    print("AUC: {}".format(np.array(aucs).mean()))
+    print("Avg L2: {}".format(np.array(avg_l2s).mean()))
+    print("Min L2: {}".format(np.array(min_l2s).mean()))
+if __name__ == "__main__":
+    main()

scripts/eval_vat.py ADDED Viewed

	@@ -0,0 +1,116 @@

+import argparse
+import torch
+from PIL import Image
+import json
+import os
+import numpy as np
+from sklearn.metrics import roc_auc_score, average_precision_score
+from tqdm import tqdm
+from gazelle.model import get_gazelle_model
+parser = argparse.ArgumentParser()
+parser.add_argument("--data_path", type=str, default="./data/videoattentiontarget")
+parser.add_argument("--model_name", type=str, default="gazelle_dinov2_vitl14_inout")
+parser.add_argument("--ckpt_path", type=str, default="./checkpoints/gazelle_dinov2_vitl14_inout.pt")
+parser.add_argument("--batch_size", type=int, default=64)
+args = parser.parse_args()
+class VideoAttentionTarget(torch.utils.data.Dataset):
+    def __init__(self, path, img_transform):
+        self.sequences = json.load(open(os.path.join(path, "test_preprocessed.json"), "rb"))
+        self.frames = []
+        for i in range(len(self.sequences)):
+            for j in range(len(self.sequences[i]['frames'])):
+                self.frames.append((i, j))
+        self.path = path
+        self.transform = img_transform
+    def __getitem__(self, idx):
+        seq_idx, frame_idx = self.frames[idx]
+        seq = self.sequences[seq_idx]
+        frame = seq['frames'][frame_idx]
+        image = self.transform(Image.open(os.path.join(self.path, frame['path'])).convert("RGB"))
+        bboxes = [head['bbox_norm'] for head in frame['heads']]
+        gazex = [head['gazex_norm'] for head in frame['heads']]
+        gazey = [head['gazey_norm'] for head in frame['heads']]
+        inout = [head['inout'] for head in frame['heads']]
+        return image, bboxes, gazex, gazey, inout
+    def __len__(self):
+        return len(self.frames)
+def collate(batch):
+    images, bboxes, gazex, gazey, inout = zip(*batch)
+    return torch.stack(images), list(bboxes), list(gazex), list(gazey), list(inout)
+# VideoAttentionTarget calculates AUC on 64x64 heatmap, defining a rectangular tolerance region of 6*(sigma=3) + 1 (uses 2D Gaussian code but binary thresholds > 0 resulting in rectangle)
+# References:
+    # https://github.com/ejcgt/attention-target-detection/blob/acd264a3c9e6002b71244dea8c1873e5c5818500/eval_on_videoatttarget.py#L106
+    # https://github.com/ejcgt/attention-target-detection/blob/acd264a3c9e6002b71244dea8c1873e5c5818500/utils/imutils.py#L31
+def vat_auc(heatmap, gt_gazex, gt_gazey):
+    res = 64
+    sigma = 3
+    assert heatmap.shape[0] == res and heatmap.shape[1] == res
+    target_map = np.zeros((res, res))
+    gazex = gt_gazex * res
+    gazey = gt_gazey * res
+    ul = [max(0, int(gazex - 3 * sigma)), max(0, int(gazey - 3 * sigma))]
+    br = [min(int(gazex + 3 * sigma + 1), res-1), min(int(gazey + 3 * sigma + 1), res-1)]
+    target_map[ul[1]:br[1], ul[0]:br[0]] = 1
+    auc = roc_auc_score(target_map.flatten(), heatmap.cpu().flatten())
+    return auc
+# Reference: https://github.com/ejcgt/attention-target-detection/blob/acd264a3c9e6002b71244dea8c1873e5c5818500/eval_on_videoatttarget.py#L118
+def vat_l2(heatmap, gt_gazex, gt_gazey):
+    argmax = heatmap.flatten().argmax().item()
+    pred_y, pred_x = np.unravel_index(argmax, (64, 64))
+    pred_x = pred_x / 64.
+    pred_y = pred_y / 64.
+    l2 = np.sqrt((pred_x - gt_gazex)**2 + (pred_y - gt_gazey)**2)
+    return l2
+@torch.no_grad()
+def main():
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    print("Running on {}".format(device))
+    model, transform = get_gazelle_model(args.model_name)
+    model.load_gazelle_state_dict(torch.load(args.ckpt_path, weights_only=True))
+    model.to(device)
+    model.eval()
+    dataset = VideoAttentionTarget(args.data_path, transform)
+    dataloader = torch.utils.data.DataLoader(dataset, batch_size=args.batch_size, collate_fn=collate)
+    aucs = []
+    l2s = []
+    inout_preds = []
+    inout_gts = []
+    for _, (images, bboxes, gazex, gazey, inout) in tqdm(enumerate(dataloader), desc="Evaluating", total=len(dataloader)):
+        preds = model.forward({"images": images.to(device), "bboxes": bboxes})
+        # eval each instance (head)
+        for i in range(images.shape[0]): # per image
+            for j in range(len(bboxes[i])): # per head
+                if inout[i][j] == 1: # in frame
+                    auc = vat_auc(preds['heatmap'][i][j], gazex[i][j][0], gazey[i][j][0])
+                    l2 = vat_l2(preds['heatmap'][i][j], gazex[i][j][0], gazey[i][j][0])
+                    aucs.append(auc)
+                    l2s.append(l2)
+                inout_preds.append(preds['inout'][i][j].item())
+                inout_gts.append(inout[i][j])
+    print("AUC: {}".format(np.array(aucs).mean()))
+    print("Avg L2: {}".format(np.array(l2s).mean()))
+    print("Inout AP: {}".format(average_precision_score(inout_gts, inout_preds)))
+if __name__ == "__main__":
+    main()

setup.py ADDED Viewed

	@@ -0,0 +1,9 @@

+import setuptools
+setuptools.setup(
+    name="gazelle",
+    version="0.0.1",
+    author="Fiona Ryan",
+    description="Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders",
+    packages=setuptools.find_packages()
+)