Model Card for Mars ViT Base Model
Model Architecture
- Architecture: Vision Transformer (ViT) Base
- Input Channels: 1 (grayscale images)
- Number of Classes: 0 (features extraction)
Training Method
- Method: Masked Autoencoder (MAE)
- Dataset: 2 million CTX images
Usage Examples
Using timm (suggested now)
First download checkpoint-1199.pth (backbone only)
import timm
import torch
model = timm.create_model(
'vit_base_patch16_224',
in_chans=1,
num_classes=0,
global_pool='',
checkpoint_path="./checkpoint-1199.pth" # must use local path
)
model.eval()
# for images, need to convert to single channel, 224, and normalize
# transform example:
# transform = transforms.Compose([
# transforms.ToTensor(),
# transforms.Resize((224, 224)),
# transforms.Grayscale(num_output_channels=1),
# transforms.Normalize(mean=[0.5], std=[0.5])
# ])
x = torch.randn(1, 1, 224, 224)
with torch.no_grad():
features = model.forward_features(x) # shape [1, tokens, embed_dim]
print(features.shape)
cls_token = features[:, 0]
patch_tokens = features[:, 1:]
Using transformers
from transformers import AutoModel, AutoImageProcessor
model = AutoModel.from_pretrained("jfang/mars-vit-base-ctx2m")
image_processor = AutoImageProcessor.from_pretrained("jfang/mars-vit-base-ctx2m")
# Example usage
from PIL import Image
image = Image.open("some_image.png").convert("L") # 1-channel
inputs = image_processor(image, return_tensors="pt")
outputs = model(**inputs)
MAE reconstruction
Under ./mae folder, there is full encoder-decoder MAE model and a notebook for visualization.
Limitations
The model is trained specifically on CTX images and may not generalize well to other types of images without further fine-tuning. The model is designed for feature extraction and does not include a classification head.
- Downloads last month
- 6
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.