jfang/mars-vit-base-ctx2m

Model Card for Mars ViT Base Model

Model Architecture

Architecture: Vision Transformer (ViT) Base
Input Channels: 1 (grayscale images)
Number of Classes: 0 (features extraction)

Training Method

Method: Masked Autoencoder (MAE)
Dataset: 2 million CTX images

Usage Examples

Using timm (suggested now)

First download checkpoint-1199.pth (backbone only)

import timm
import torch

model = timm.create_model(
    'vit_base_patch16_224',
    in_chans=1,
    num_classes=0,
    global_pool='',
    checkpoint_path="./checkpoint-1199.pth" # must use local path
)

model.eval()

# for images, need to convert to single channel, 224, and normalize 

# transform example:
# transform = transforms.Compose([
#     transforms.ToTensor(),
#     transforms.Resize((224, 224)),
#     transforms.Grayscale(num_output_channels=1),
#     transforms.Normalize(mean=[0.5], std=[0.5])
# ])
x = torch.randn(1, 1, 224, 224)
with torch.no_grad():
    features = model.forward_features(x)  # shape [1, tokens, embed_dim]
print(features.shape)

cls_token = features[:, 0]
patch_tokens = features[:, 1:]

Using transformers

from transformers import AutoModel, AutoImageProcessor

model = AutoModel.from_pretrained("jfang/mars-vit-base-ctx2m")
image_processor = AutoImageProcessor.from_pretrained("jfang/mars-vit-base-ctx2m")

# Example usage
from PIL import Image
image = Image.open("some_image.png").convert("L")  # 1-channel
inputs = image_processor(image, return_tensors="pt")


outputs = model(**inputs)

MAE reconstruction

Under ./mae folder, there is full encoder-decoder MAE model and a notebook for visualization.

Limitations

The model is trained specifically on CTX images and may not generalize well to other types of images without further fine-tuning. The model is designed for feature extraction and does not include a classification head.