Cardionix/cardionet-v2 · Hugging Face

CardioNetV2

The latest multi-modal model in the Cardio Sonix line. Built on the basis of models whose architectures were originally intended for computer vision tasks (like a modified ResNet) or for NLP (like LSTM). The model works with audio signal and tabular data. The model works with the input audio signal as with tokens: a mel-kesprogram with time samples is extracted from the audio, where each time sample has N-mel-cepstral coefficients. At the very beginning, the LSTM takes a mel-cepstrogram as input and produces an output tensor that goes into ResNet (Residual Neural Network). ResNet is a modified audio signal processing model from the family of residual networks. In this implementation, residual blocks with pre-activation were used. The data then goes to the DenseMixer input. This model performs inference separately for audio and tabular features, then concatenates the outputs into a dense feature vector and performs inference on it, after which we get a prediction based on audio and tabular data