Synchformer Hugging Face Model

This repository contains a Synchformer model for audio-visual synchronization. The model predicts the offset between audio and video tracks.

Usage

import torch
import os
import sys

# Add the current directory to the path
current_dir = os.path.dirname(os.path.abspath(__file__))
if current_dir not in sys.path:
    sys.path.insert(0, current_dir)

# Import the auto factory to register the model
import auto_factory

# Now you can use the model with transformers
from transformers import AutoConfig, AutoModel

# Load the model
model = AutoModel.from_pretrained("AmrMKayid/synchformer-hf")
model.to("cuda" if torch.cuda.is_available() else "cpu")

# Predict offset for a video
results = model.predict_offset(
    "path/to/your/video.mp4",
    offset_sec=0.0,  # Ground truth offset (if known)
    v_start_i_sec=0.0  # Start time in seconds for video
)

# Print results
print("\nPrediction Results:")
for pred in results["predictions"]:
    print(f'p={pred["probability"]:.4f}, "{pred["offset_sec"]:.2f}" (class {pred["class_idx"]})')

Model Details

This model is based on the Synchformer architecture, which uses a transformer to predict the offset between audio and video tracks.

Requirements

torch
torchaudio
torchvision
omegaconf
ffmpeg (for video processing)