Speed Benchmarks with MPS Backend
I'm running a few speed benchmarks with an MPS backend. The task is zero-shot classification through the pipe on a set of 5,000 documents. Repeat 30 times and take the average.
I know ModernBERT was explicitly optimized for different hardware, and that MPS cannot currently utilize flash attention. So this is less a problem/request and just me trying to figure out expected behavior.
The basic benchmark is simple:
model = "mlburnham/Political_DEBATE_ModernBERT_base_v1.0"
pipe = pipeline("zero-shot-classification", model = model,
device = torch.device("mps"), batch_size = 8, torch_dtype = torch.bfloat16)
start_time = time.time()
results = pipe(list(test['premise']), 'This text is about politics.', hypothesis_template='{}', multi_label=False)
# Stop timer
end_time = time.time()
ModernBERT is noticeably slower than DeBERTa-v3 on MPS. About 120 documents per seconds vs. 150. Is this at all surprising?
ModernBERT is about 2x as fast with a batch size of 8 vs. a batch size of 64. I expect this from DeBERTa because of padding to the longest sequence in a batch. But ModernBERT doesn't pad as I understand it, and the pipe doesn't add any padding by default. So I would expect ModernBERT to speed up with larger batches. That doesn't seem to be the case though. Maybe I'm missing something here?
But ModernBERT doesn't pad as I understand it, and the pipe doesn't add any padding by default
ModernBERT uses Flash Attention to support full model unpadding, so any backend which doesn't have a Flash Attention equivalent will require padding tokens and won't benefit from those speedups.
As for your other performance questions, I will have to defer someone who has an Apple Silicon device.