facebook/hf-seamless-m4t-medium · Memory usage spiking abnormally high during load test

The model is deployed on an AWS EC2 instance with the following configuration:

Instance type: r5.8xlarge
Resources:
- Limits:
  - CPU: 16
  - Memory: 60Gi
- Requests:
  - CPU: 8
  - Memory: 8Gi

During load testing using Locust with the number of users set to 1, the model responds with an RPS of 1 and an average response time of around 970 ms (quite good). The metrics usage snapshot is attached below.

(Note: Load testing was done from 17:00 to around 18:00 )

However, when the number of users is set to 2, it results in almost 100% failure responses, with the application ultimately shutting down. The snapshot of memory and CPU usage is provided below.

(Note: Load testing was done around 20:00)

The audios are sent in base64 encoding in the request. On the server side, they are decoded back to audio data and then processed through the model. Here is the code snippet.

Note: Throughout the load testing, the batch size = 1 i.e, only a single audio file (base64 encoding format) was sent through the request.

Code snippet:-

seamless_processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-medium")
seamless_model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-medium")

model_lock = threading.Lock() # to prevent errors due to concurrent access to the model

# for transcribing audios
def seamless(dec_audio, orig_freq=16_000, new_freq=16_000):
    audio = torchaudio.functional.resample(dec_audio, orig_freq=orig_freq, new_freq=16_000)
    audio_inputs = seamless_processor(audios=audio, return_tensors="pt")
    output_tokens = seamless_model.generate(**audio_inputs, tgt_lang="eng", generate_speech=False)
    translated_text_from_audio = seamless_processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
    return translated_text_from_audio

def get_transcriptions(requestId, audio_records, model_type, modelHash):
    batch_size = len(audio_records)
    audio_ids = []
    decoded_audios = []
    transcriptions = []
    audio_decoding_times = []
    transcription_times = []

    # base64 decoding starts here
    for record in audio_records:
        audio_ids.append(record['audioId'])
        decode_start_time = time.time()
        decoded_audio = base64.b64decode(record['base64Encoding'])
        audio_data = io.BytesIO(decoded_audio)
        data, sr = librosa.load(audio_data)

        if sr != record['sampleRate']:
            resampled_audio = librosa.resample(data, orig_sr=sr, target_sr=record['sampleRate'])
            decoded_audios.append(resampled_audio)
        else:
            decoded_audios.append(data)
        
        audio_decoding_times.append(round(time.time() - decode_start_time, 2))

        # transcription starts here
        for decoded_audio in decoded_audios:
            transcribe_start_time = time.time()
            with model_lock:
                transcriptions.append(seamless(decoded_audio))
            transcription_times.append(round(time.time() - transcribe_start_time, 2))

    return audio_decoding_times, transcription_times, transcriptions