Speed Benchmarks with MPS Backend

#47
by mlburnham - opened

I'm running a few speed benchmarks with an MPS backend. The task is zero-shot classification through the pipe on a set of 5,000 documents. Repeat 30 times and take the average.

I know ModernBERT was explicitly optimized for different hardware, and that MPS cannot currently utilize flash attention. So this is less a problem/request and just me trying to figure out expected behavior.

The basic benchmark is simple:

model = "mlburnham/Political_DEBATE_ModernBERT_base_v1.0"
pipe = pipeline("zero-shot-classification", model = model, 
                device = torch.device("mps"), batch_size = 8, torch_dtype = torch.bfloat16)

start_time = time.time()
results = pipe(list(test['premise']), 'This text is about politics.', hypothesis_template='{}', multi_label=False)
# Stop timer
end_time = time.time()
  1. ModernBERT is noticeably slower than DeBERTa-v3 on MPS. About 120 documents per seconds vs. 150. Is this at all surprising?

  2. ModernBERT is about 2x as fast with a batch size of 8 vs. a batch size of 64. I expect this from DeBERTa because of padding to the longest sequence in a batch. But ModernBERT doesn't pad as I understand it, and the pipe doesn't add any padding by default. So I would expect ModernBERT to speed up with larger batches. That doesn't seem to be the case though. Maybe I'm missing something here?

Answer.AI org

But ModernBERT doesn't pad as I understand it, and the pipe doesn't add any padding by default

ModernBERT uses Flash Attention to support full model unpadding, so any backend which doesn't have a Flash Attention equivalent will require padding tokens and won't benefit from those speedups.

As for your other performance questions, I will have to defer someone who has an Apple Silicon device.

Sign up or log in to comment