{"cells":[{"cell_type":"markdown","id":"63b2fd12-a1a3-45a7-949e-64b903c5d2d5","metadata":{"id":"63b2fd12-a1a3-45a7-949e-64b903c5d2d5"},"source":["# Automatic speech recognition using Distil-Whisper and OpenVINO\n","\n","[Distil-Whisper](https://huggingface.co/distil-whisper/distil-large-v2) is a distilled variant of the [Whisper](https://huggingface.co/openai/whisper-large-v2) model by OpenAI. The Distil-Whisper is proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430). According to authors, compared to Whisper, Distil-Whisper runs in several times faster with 50% fewer parameters, while performing to within 1% word error rate (WER) on out-of-distribution evaluation data.\n","\n","Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It maps a sequence of audio spectrogram features to a sequence of text tokens. First, the raw audio inputs are converted to a log-Mel spectrogram by action of the feature extractor. Then, the Transformer encoder encodes the spectrogram to form a sequence of encoder hidden states. Finally, the decoder autoregressively predicts text tokens, conditional on both the previous tokens and the encoder hidden states.\n","\n","You can see the model architecture in the diagram below:\n","\n","![whisper_architecture.svg](https://user-images.githubusercontent.com/29454499/204536571-8f6d8d77-5fbd-4c6d-8e29-14e734837860.svg)\n","\n","In this tutorial, we consider how to run Distil-Whisper using OpenVINO. We will use the pre-trained model from the [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) library. To simplify the user experience, the [Hugging Face Optimum](https://huggingface.co/docs/optimum) library is used to convert the model to OpenVINO™ IR format. To further improve OpenVINO Distil-Whisper model performance `INT8` post-training quantization from [NNCF](https://github.com/openvinotoolkit/nncf/) is applied.\n","\n","#### Table of contents:\n","\n","- [Prerequisites](#Prerequisites)\n","- [Load PyTorch model](#Load-PyTorch-model)\n"," - [Prepare input sample](#Prepare-input-sample)\n"," - [Run model inference](#Run-model-inference)\n","- [Load OpenVINO model using Optimum library](#Load-OpenVINO-model-using-Optimum-library)\n"," - [Select Inference device](#Select-Inference-device)\n"," - [Compile OpenVINO model](#Compile-OpenVINO-model)\n"," - [Run OpenVINO model inference](#Run-OpenVINO-model-inference)\n","- [Compare performance PyTorch vs OpenVINO](#Compare-performance-PyTorch-vs-OpenVINO)\n","- [Usage OpenVINO model with HuggingFace pipelines](#Usage-OpenVINO-model-with-HuggingFace-pipelines)\n","- [Quantization](#Quantization)\n"," - [Prepare calibration datasets](#Prepare-calibration-datasets)\n"," - [Quantize Distil-Whisper encoder and decoder models](#Quantize-Distil-Whisper-encoder-and-decoder-models)\n"," - [Run quantized model inference](#Run-quantized-model-inference)\n"," - [Compare performance and accuracy of the original and quantized models](#Compare-performance-and-accuracy-of-the-original-and-quantized-models)\n","- [Interactive demo](#Interactive-demo)\n","\n"]},{"cell_type":"markdown","id":"22bf06fc-5988-4e3d-9d81-7fe23ff18131","metadata":{"id":"22bf06fc-5988-4e3d-9d81-7fe23ff18131"},"source":["## Prerequisites\n","[back to top ⬆️](#Table-of-contents:)\n"]},{"cell_type":"code","execution_count":null,"id":"bb9fc7f3-cea0-4adf-9ee6-4a3d15931db7","metadata":{"ExecuteTime":{"end_time":"2023-11-08T15:05:38.239262600Z","start_time":"2023-11-08T15:05:38.138403800Z"},"id":"bb9fc7f3-cea0-4adf-9ee6-4a3d15931db7","outputId":"6afab297-3c02-494d-aef2-0f5183b2f61b"},"outputs":[{"name":"stdout","output_type":"stream","text":["Note: you may need to restart the kernel to use updated packages.\n","Note: you may need to restart the kernel to use updated packages.\n","Note: you may need to restart the kernel to use updated packages.\n"]}],"source":["%pip install -q \"transformers>=4.35\" \"torch>=2.1\" onnx \"git+https://github.com/huggingface/optimum-intel.git\" \"peft==0.6.2\" --extra-index-url https://download.pytorch.org/whl/cpu\n","%pip install -q \"openvino>=2023.2.0\" datasets \"gradio>=4.0\" \"librosa\" \"soundfile\"\n","%pip install -q \"nncf>=2.6.0\" \"jiwer\""]},{"cell_type":"markdown","id":"34bbdf5e-0e4c-482c-a08a-395972c8b56f","metadata":{"id":"34bbdf5e-0e4c-482c-a08a-395972c8b56f"},"source":["## Load PyTorch model\n","[back to top ⬆️](#Table-of-contents:)\n","\n","The `AutoModelForSpeechSeq2Seq.from_pretrained` method is used for the initialization of PyTorch Whisper model using the transformers library. By default, we will use the `distil-whisper/distil-large-v2` model as an example in this tutorial. The model will be downloaded once during first run and this process may require some time.\n","\n","You may also choose other models from [Distil-Whisper hugging face collection](https://huggingface.co/collections/distil-whisper/distil-whisper-models-65411987e6727569748d2eb6) such as `distil-whisper/distil-medium.en` or `distil-whisper/distil-small.en`. Models of the original Whisper architecture are also available, more on them [here](https://huggingface.co/openai).\n","\n","Preprocessing and post-processing are important in this model use. `AutoProcessor` class used for initialization `WhisperProcessor` is responsible for preparing audio input data for the model, converting it to Mel-spectrogram and decoding predicted output token_ids into string using tokenizer."]},{"cell_type":"code","execution_count":null,"id":"756cd923","metadata":{"jupyter":{"outputs_hidden":false},"id":"756cd923","outputId":"0e488fb2-861b-4875-a0c7-ebfdfe9e34d8","colab":{"referenced_widgets":["c6fb7194b8a646818f053dad155e482f"]}},"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"c6fb7194b8a646818f053dad155e482f","version_major":2,"version_minor":0},"text/plain":["Dropdown(description='Model type:', options=('Distil-Whisper', 'Whisper'), value='Distil-Whisper')"]},"execution_count":2,"metadata":{},"output_type":"execute_result"}],"source":["import ipywidgets as widgets\n","\n","model_ids = {\n"," \"Distil-Whisper\": [\n"," \"distil-whisper/distil-large-v2\",\n"," \"distil-whisper/distil-medium.en\",\n"," \"distil-whisper/distil-small.en\",\n"," ],\n"," \"Whisper\": [\n"," \"openai/whisper-large-v3\",\n"," \"openai/whisper-large-v2\",\n"," \"openai/whisper-large\",\n"," \"openai/whisper-medium\",\n"," \"openai/whisper-small\",\n"," \"openai/whisper-base\",\n"," \"openai/whisper-tiny\",\n"," \"openai/whisper-medium.en\",\n"," \"openai/whisper-small.en\",\n"," \"openai/whisper-base.en\",\n"," \"openai/whisper-tiny.en\",\n"," ],\n","}\n","\n","model_type = widgets.Dropdown(\n"," options=model_ids.keys(),\n"," value=\"Distil-Whisper\",\n"," description=\"Model type:\",\n"," disabled=False,\n",")\n","\n","model_type"]},{"cell_type":"code","execution_count":null,"id":"de1107e4","metadata":{"jupyter":{"outputs_hidden":false},"id":"de1107e4","outputId":"2464f082-9850-4c39-ab39-7636f8858513","colab":{"referenced_widgets":["984534f124ac4de0abc76ccb34514ef6"]}},"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"984534f124ac4de0abc76ccb34514ef6","version_major":2,"version_minor":0},"text/plain":["Dropdown(description='Model:', options=('distil-whisper/distil-large-v2', 'distil-whisper/distil-medium.en', '…"]},"execution_count":59,"metadata":{},"output_type":"execute_result"}],"source":["model_id = widgets.Dropdown(\n"," options=model_ids[model_type.value],\n"," value=model_ids[model_type.value][0],\n"," description=\"Model:\",\n"," disabled=False,\n",")\n","\n","model_id"]},{"cell_type":"code","execution_count":null,"id":"e5382431-497e-4688-b4ec-8958a92163e7","metadata":{"ExecuteTime":{"end_time":"2023-11-08T15:05:45.226409400Z","start_time":"2023-11-08T15:05:38.138403800Z"},"id":"e5382431-497e-4688-b4ec-8958a92163e7","outputId":"96b0a01c-9716-4c23-e9ff-c49b4ccd840b","colab":{"referenced_widgets":["42046e72986e433e986263033f103363","df303ca75b99467fa70dbde06c2e49db","8fe0747ac4bc40308a8ce54035e1e9dd","ea1ca0665b3b4ccc977f4ff5d57ff857","e3b4bcddfd094b7da472ef42c26e2f67","04103cd904b647109d82f2aac2eb720a","e9e103d9ac03499db274cc6f6c16f277","1ea798a2d66f437a91332b21188cdada","265fff8085184d4a8e7cea6f7970ad58","8273e654cd89491d966fd902c917c583","fb7a086baddb4f4caeb086a8ecb1ce3e"]}},"outputs":[{"name":"stderr","output_type":"stream","text":["C:\\Users\\intelaipc\\miniforge3\\envs\\ov\\Lib\\site-packages\\huggingface_hub\\file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n"," warnings.warn(\n"]},{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"42046e72986e433e986263033f103363","version_major":2,"version_minor":0},"text/plain":["preprocessor_config.json: 0%| | 0.00/339 [00:00\n"," \n"," Your browser does not support the audio element.\n"," \n"," "],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"name":"stdout","output_type":"stream","text":["Reference: MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL\n","Result: Mr. Quilter is the Apostle of the Middle Classes, and we are glad to welcome his Gospel.\n"]}],"source":["import IPython.display as ipd\n","\n","predicted_ids = pt_model.generate(input_features)\n","transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)\n","\n","display(ipd.Audio(sample[\"audio\"][\"array\"], rate=sample[\"audio\"][\"sampling_rate\"]))\n","print(f\"Reference: {sample['text']}\")\n","print(f\"Result: {transcription[0]}\")"]},{"cell_type":"markdown","id":"219ed303-d323-4a07-8a92-66a2e96e1ec5","metadata":{"id":"219ed303-d323-4a07-8a92-66a2e96e1ec5"},"source":["## Load OpenVINO model using Optimum library\n","[back to top ⬆️](#Table-of-contents:)\n","\n","The Hugging Face Optimum API is a high-level API that enables us to convert and quantize models from the Hugging Face Transformers library to the OpenVINO™ IR format. For more details, refer to the [Hugging Face Optimum documentation](https://huggingface.co/docs/optimum/intel/inference).\n","\n","Optimum Intel can be used to load optimized models from the [Hugging Face Hub](https://huggingface.co/docs/optimum/intel/hf.co/models) and create pipelines to run an inference with OpenVINO Runtime using Hugging Face APIs. The Optimum Inference models are API compatible with Hugging Face Transformers models. This means we just need to replace the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.\n","\n","Below is an example of the distil-whisper model\n","\n","```diff\n","-from transformers import AutoModelForSpeechSeq2Seq\n","+from optimum.intel.openvino import OVModelForSpeechSeq2Seq\n","from transformers import AutoTokenizer, pipeline\n","\n","-model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id)\n","+model = OVModelForSpeechSeq2Seq.from_pretrained(model_id, export=True)\n","```\n","\n","Model class initialization starts with calling the `from_pretrained` method. When downloading and converting the Transformers model, the parameter `export=True` should be added. We can save the converted model for the next usage with the `save_pretrained` method.\n","Tokenizers and Processors are distributed with models also compatible with the OpenVINO model. It means that we can reuse initialized early processor."]},{"cell_type":"code","execution_count":null,"id":"7ef523e8-b70f-4d86-a7d1-81f761c3eac0","metadata":{"ExecuteTime":{"end_time":"2023-11-08T15:05:52.840159900Z","start_time":"2023-11-08T15:05:51.638930400Z"},"id":"7ef523e8-b70f-4d86-a7d1-81f761c3eac0","outputId":"5b5aa940-4607-477d-d130-f8706a14802b"},"outputs":[{"name":"stderr","output_type":"stream","text":["Exception ignored in: \n","Traceback (most recent call last):\n"," File \"C:\\Users\\intelaipc\\miniforge3\\envs\\ov\\Lib\\weakref.py\", line 590, in __call__\n"," return info.func(*info.args, **(info.kwargs or {}))\n"," ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n"," File \"C:\\Users\\intelaipc\\miniforge3\\envs\\ov\\Lib\\tempfile.py\", line 933, in _cleanup\n"," cls._rmtree(name, ignore_errors=ignore_errors)\n"," File \"C:\\Users\\intelaipc\\miniforge3\\envs\\ov\\Lib\\tempfile.py\", line 929, in _rmtree\n"," _shutil.rmtree(name, onerror=onerror)\n"," File \"C:\\Users\\intelaipc\\miniforge3\\envs\\ov\\Lib\\shutil.py\", line 787, in rmtree\n"," return _rmtree_unsafe(path, onerror)\n"," ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n"," File \"C:\\Users\\intelaipc\\miniforge3\\envs\\ov\\Lib\\shutil.py\", line 634, in _rmtree_unsafe\n"," onerror(os.unlink, fullname, sys.exc_info())\n"," File \"C:\\Users\\intelaipc\\miniforge3\\envs\\ov\\Lib\\tempfile.py\", line 893, in onerror\n"," _os.unlink(path)\n","PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\\\Users\\\\INTELA~1\\\\AppData\\\\Local\\\\Temp\\\\tmp29ao2f8l\\\\openvino_decoder_model.bin'\n","Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n","Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n","Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41.\n","Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 357, 366, 438, 532, 685, 705, 796, 930, 1058, 1220, 1267, 1279, 1303, 1343, 1377, 1391, 1635, 1782, 1875, 2162, 2361, 2488, 3467, 4008, 4211, 4600, 4808, 5299, 5855, 6329, 7203, 9609, 9959, 10563, 10786, 11420, 11709, 11907, 13163, 13697, 13700, 14808, 15306, 16410, 16791, 17992, 19203, 19510, 20724, 22305, 22935, 27007, 30109, 30420, 33409, 34949, 40283, 40493, 40549, 47282, 49146, 50257, 50357, 50358, 50359, 50360, 50361], 'begin_suppress_tokens': [220, 50256]}\n","Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n","Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n","C:\\Users\\intelaipc\\miniforge3\\envs\\ov\\Lib\\site-packages\\transformers\\models\\whisper\\modeling_whisper.py:1162: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n"," if input_features.shape[-1] != expected_seq_length:\n","C:\\Users\\intelaipc\\miniforge3\\envs\\ov\\Lib\\site-packages\\transformers\\models\\whisper\\modeling_whisper.py:341: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n"," if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):\n","C:\\Users\\intelaipc\\miniforge3\\envs\\ov\\Lib\\site-packages\\transformers\\models\\whisper\\modeling_whisper.py:380: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n"," if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):\n","C:\\Users\\intelaipc\\miniforge3\\envs\\ov\\Lib\\site-packages\\transformers\\modeling_attn_mask_utils.py:86: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n"," if input_shape[-1] > 1 or self.sliding_window is not None:\n","C:\\Users\\intelaipc\\miniforge3\\envs\\ov\\Lib\\site-packages\\transformers\\modeling_attn_mask_utils.py:162: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n"," if past_key_values_length > 0:\n","C:\\Users\\intelaipc\\miniforge3\\envs\\ov\\Lib\\site-packages\\transformers\\models\\whisper\\modeling_whisper.py:348: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n"," if attention_mask.size() != (bsz, 1, tgt_len, src_len):\n","Export model to OpenVINO directly failed with: \n","Requested input shape 3 rank is not equal to provided example_input rank 0.\n","Model will be exported to ONNX\n","C:\\Users\\intelaipc\\miniforge3\\envs\\ov\\Lib\\site-packages\\transformers\\models\\whisper\\modeling_whisper.py:303: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n"," and past_key_value[0].shape[2] == key_value_states.shape[1]\n","Export model to OpenVINO directly failed with: \n","Requested input shape 3 rank is not equal to provided example_input rank 0.\n","Model will be exported to ONNX\n","Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41.\n","Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 357, 366, 438, 532, 685, 705, 796, 930, 1058, 1220, 1267, 1279, 1303, 1343, 1377, 1391, 1635, 1782, 1875, 2162, 2361, 2488, 3467, 4008, 4211, 4600, 4808, 5299, 5855, 6329, 7203, 9609, 9959, 10563, 10786, 11420, 11709, 11907, 13163, 13697, 13700, 14808, 15306, 16410, 16791, 17992, 19203, 19510, 20724, 22305, 22935, 27007, 30109, 30420, 33409, 34949, 40283, 40493, 40549, 47282, 49146, 50257, 50357, 50358, 50359, 50360, 50361], 'begin_suppress_tokens': [220, 50256]}\n"]}],"source":["from pathlib import Path\n","from optimum.intel.openvino import OVModelForSpeechSeq2Seq\n","\n","model_path = Path(model_id.value.replace(\"/\", \"_\"))\n","ov_config = {\"CACHE_DIR\": \"\"}\n","\n","if not model_path.exists():\n"," ov_model = OVModelForSpeechSeq2Seq.from_pretrained(\n"," model_id.value,\n"," ov_config=ov_config,\n"," export=True,\n"," compile=False,\n"," load_in_8bit=False,\n"," )\n"," ov_model.half()\n"," ov_model.save_pretrained(model_path)\n","else:\n"," ov_model = OVModelForSpeechSeq2Seq.from_pretrained(model_path, ov_config=ov_config, compile=False)"]},{"cell_type":"markdown","id":"99a3dffb-5476-4a5d-843f-c7a7cbbf2154","metadata":{"id":"99a3dffb-5476-4a5d-843f-c7a7cbbf2154"},"source":["### Select Inference device\n","[back to top ⬆️](#Table-of-contents:)\n"]},{"cell_type":"code","execution_count":null,"id":"4b1bd73b-bcc8-4f72-b896-63e11f33f607","metadata":{"ExecuteTime":{"end_time":"2023-11-08T15:05:53.210751600Z","start_time":"2023-11-08T15:05:53.179009600Z"},"id":"4b1bd73b-bcc8-4f72-b896-63e11f33f607","outputId":"4d8a2eef-478a-4075-d24d-81281a1f4c8a","colab":{"referenced_widgets":["813eaa70e834487c9c8e4b733a63d377"]}},"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"813eaa70e834487c9c8e4b733a63d377","version_major":2,"version_minor":0},"text/plain":["Dropdown(description='Device:', index=3, options=('CPU', 'GPU', 'NPU', 'AUTO'), value='AUTO')"]},"execution_count":61,"metadata":{},"output_type":"execute_result"}],"source":["import openvino as ov\n","import ipywidgets as widgets\n","\n","core = ov.Core()\n","\n","device = widgets.Dropdown(\n"," options=core.available_devices + [\"AUTO\"],\n"," value=\"AUTO\",\n"," description=\"Device:\",\n"," disabled=False,\n",")\n","\n","device"]},{"cell_type":"markdown","id":"45cd85e8-63e4-402c-86bc-2023ed5775a8","metadata":{"id":"45cd85e8-63e4-402c-86bc-2023ed5775a8"},"source":["### Compile OpenVINO model\n","[back to top ⬆️](#Table-of-contents:)\n"]},{"cell_type":"code","execution_count":null,"id":"057328a7-dc25-4c54-a438-85467e0076de","metadata":{"ExecuteTime":{"end_time":"2023-11-08T15:05:55.485452300Z","start_time":"2023-11-08T15:05:53.211466100Z"},"id":"057328a7-dc25-4c54-a438-85467e0076de","outputId":"74955d8e-e82d-445f-ad01-3aed29a08197"},"outputs":[{"name":"stdout","output_type":"stream","text":["AUTO\n"]}],"source":["ov_model.to(device.value)\n","ov_model.compile()\n"]},{"cell_type":"code","execution_count":null,"id":"74e2a224-6305-47a7-a961-d841312e422f","metadata":{"id":"74e2a224-6305-47a7-a961-d841312e422f","outputId":"e3e148ce-d395-4870-eab2-d512766e4598"},"outputs":[{"ename":"AttributeError","evalue":"'_OVModelForWhisper' object has no attribute '_modules'","output_type":"error","traceback":["\u001b[1;31m---------------------------------------------------------------------------\u001b[0m","\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)","Cell \u001b[1;32mIn[71], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;43mprint\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mov_model\u001b[49m\u001b[43m)\u001b[49m\n","File \u001b[1;32m~\\miniforge3\\envs\\ov\\Lib\\site-packages\\torch\\nn\\modules\\module.py:2525\u001b[0m, in \u001b[0;36mModule.__repr__\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 2523\u001b[0m extra_lines \u001b[38;5;241m=\u001b[39m extra_repr\u001b[38;5;241m.\u001b[39msplit(\u001b[38;5;124m'\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124m'\u001b[39m)\n\u001b[0;32m 2524\u001b[0m child_lines \u001b[38;5;241m=\u001b[39m []\n\u001b[1;32m-> 2525\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m key, module \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_modules\u001b[49m\u001b[38;5;241m.\u001b[39mitems():\n\u001b[0;32m 2526\u001b[0m mod_str \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mrepr\u001b[39m(module)\n\u001b[0;32m 2527\u001b[0m mod_str \u001b[38;5;241m=\u001b[39m _addindent(mod_str, \u001b[38;5;241m2\u001b[39m)\n","File \u001b[1;32m~\\miniforge3\\envs\\ov\\Lib\\site-packages\\torch\\nn\\modules\\module.py:1709\u001b[0m, in \u001b[0;36mModule.__getattr__\u001b[1;34m(self, name)\u001b[0m\n\u001b[0;32m 1707\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m name \u001b[38;5;129;01min\u001b[39;00m modules:\n\u001b[0;32m 1708\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m modules[name]\n\u001b[1;32m-> 1709\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mAttributeError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mtype\u001b[39m(\u001b[38;5;28mself\u001b[39m)\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__name__\u001b[39m\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m object has no attribute \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mname\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n","\u001b[1;31mAttributeError\u001b[0m: '_OVModelForWhisper' object has no attribute '_modules'"]}],"source":["print(ov_model)"]},{"cell_type":"markdown","id":"3590030d-5149-4f83-9e78-f6a582e1511a","metadata":{"id":"3590030d-5149-4f83-9e78-f6a582e1511a"},"source":["### Run OpenVINO model inference\n","[back to top ⬆️](#Table-of-contents:)\n"]},{"cell_type":"code","execution_count":null,"id":"68a94f38-09e9-48fc-9df0-6c954a82f2fb","metadata":{"ExecuteTime":{"end_time":"2023-11-08T15:05:57.755227300Z","start_time":"2023-11-08T15:05:55.486017200Z"},"id":"68a94f38-09e9-48fc-9df0-6c954a82f2fb","outputId":"3ed6a371-185d-4f8d-d651-d1b82f472e2f"},"outputs":[{"data":{"text/html":["\n"," \n"," "],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"name":"stdout","output_type":"stream","text":["Reference: MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL\n","Result: Mr. Quilter is the Apostle of the Middle Classes, and we are glad to welcome his Gospel.\n"]}],"source":["predicted_ids = ov_model.generate(input_features)\n","transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)\n","\n","display(ipd.Audio(sample[\"audio\"][\"array\"], rate=sample[\"audio\"][\"sampling_rate\"]))\n","print(f\"Reference: {sample['text']}\")\n","print(f\"Result: {transcription[0]}\")"]},{"cell_type":"code","execution_count":null,"id":"6d6dff19-8829-406f-96cd-8c2a9e4979f4","metadata":{"id":"6d6dff19-8829-406f-96cd-8c2a9e4979f4","outputId":"d4ce7271-4098-4e95-d483-7e1544e2c54e"},"outputs":[{"name":"stderr","output_type":"stream","text":["Measuring accuracy: 100%|█████████████████████████████████████████████| 50/50 [02:15<00:00, 2.70s/it]"]},{"name":"stdout","output_type":"stream","text":["Original model transcription word accuracy: 84.33%.\n"]},{"name":"stderr","output_type":"stream","text":["\n"]}],"source":["from contextlib import contextmanager\n","from jiwer import wer, wer_standardize\n","from tqdm import tqdm\n","\n","TEST_DATASET_SIZE = 50\n","\n","def calculate_accuracy(ov_model, test_samples):\n"," ground_truths = []\n"," predictions = []\n","\n"," for data_item in tqdm(test_samples, desc=\"Measuring accuracy\"):\n"," input_features = extract_input_features(data_item)\n","\n"," predicted_ids = ov_model.generate(input_features)\n"," transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)\n","\n"," ground_truths.append(data_item[\"text\"])\n"," predictions.append(transcription[0])\n","\n"," word_accuracy = (1 - wer(ground_truths, predictions, reference_transform=wer_standardize,\n"," hypothesis_transform=wer_standardize)) * 100\n"," return word_accuracy\n","\n","# Load test dataset\n","test_dataset = load_dataset(\"librispeech_asr\", \"clean\", split=\"test\", streaming=True)\n","test_dataset = test_dataset.shuffle(seed=42).take(TEST_DATASET_SIZE)\n","test_samples = [sample for sample in test_dataset]\n","\n","# Calculate accuracy for the original model\n","accuracy_original = calculate_accuracy(ov_model, test_samples)\n","\n","# Print original model accuracy\n","print(f\"Original model transcription word accuracy: {accuracy_original:.2f}%.\")\n"]},{"cell_type":"markdown","id":"8d69046e-707f-4c07-af24-389b125b3abd","metadata":{"id":"8d69046e-707f-4c07-af24-389b125b3abd"},"source":["## Perform FineTuning\n","[back to top ⬆️](#Table-of-contents:)\n"]},{"cell_type":"code","execution_count":null,"id":"93adb473-95ca-41ba-ba8e-f5df4bb1ec00","metadata":{"scrolled":true,"id":"93adb473-95ca-41ba-ba8e-f5df4bb1ec00","outputId":"8c20d654-1465-4d60-ba75-cba07acb5f2e","colab":{"referenced_widgets":["57f384ffa868456b861618daa3f42439"]}},"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"57f384ffa868456b861618daa3f42439","version_major":2,"version_minor":0},"text/plain":["VBox(children=(HTML(value='
\n"," \n"," \n"," [100/100 17:15, Epoch 1/9223372036854775807]\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
StepTraining Loss

"],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"data":{"text/plain":["TrainOutput(global_step=100, training_loss=2.028716278076172, metrics={'train_runtime': 1048.7083, 'train_samples_per_second': 0.763, 'train_steps_per_second': 0.095, 'total_flos': 1.9695108096e+16, 'train_loss': 2.028716278076172, 'epoch': 1.0})"]},"execution_count":88,"metadata":{},"output_type":"execute_result"}],"source":["trainer.train()"]},{"cell_type":"code","execution_count":null,"id":"e0c5be81-4702-4f50-9158-09a0c709f0e0","metadata":{"id":"e0c5be81-4702-4f50-9158-09a0c709f0e0","outputId":"195d3b93-c964-4735-8aa9-835d2a47cc8d","colab":{"referenced_widgets":["056676797b7741f5bba271372586f647"]}},"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"056676797b7741f5bba271372586f647","version_major":2,"version_minor":0},"text/plain":["Measuring accuracy: 0%| | 0/50 [00:00= 0, \"non-negative timestamp expected\"\n"," milliseconds = round(seconds * 1000.0)\n","\n"," hours = milliseconds // 3_600_000\n"," milliseconds -= hours * 3_600_000\n","\n"," minutes = milliseconds // 60_000\n"," milliseconds -= minutes * 60_000\n","\n"," seconds = milliseconds // 1_000\n"," milliseconds -= seconds * 1_000\n","\n"," return (f\"{hours}:\" if hours > 0 else \"00:\") + f\"{minutes:02d}:{seconds:02d},{milliseconds:03d}\"\n","\n","\n","def prepare_srt(transcription):\n"," \"\"\"\n"," Format transcription into srt file format\n"," \"\"\"\n"," segment_lines = []\n"," for idx, segment in enumerate(transcription[\"chunks\"]):\n"," segment_lines.append(str(idx + 1) + \"\\n\")\n"," timestamps = segment[\"timestamp\"]\n"," time_start = format_timestamp(timestamps[0])\n"," time_end = format_timestamp(timestamps[1])\n"," time_str = f\"{time_start} --> {time_end}\\n\"\n"," segment_lines.append(time_str)\n"," segment_lines.append(segment[\"text\"] + \"\\n\\n\")\n"," return segment_lines"]},{"cell_type":"markdown","id":"4fdb0ad8-e083-4e63-aeb1-c15566d945a7","metadata":{"id":"4fdb0ad8-e083-4e63-aeb1-c15566d945a7"},"source":["`return_timestamps` argument allows getting timestamps of start and end of speech associated with each processed chunk. It could be useful in tasks like speech separation or generation of video subtitles. In this example, we provide output formatting in SRT format, one of the popular subtitles format."]},{"cell_type":"code","execution_count":null,"id":"3219bb35-955a-4032-acf1-d83e5dab09bd","metadata":{"ExecuteTime":{"end_time":"2023-11-08T15:08:40.300007100Z","start_time":"2023-11-08T15:08:28.543311300Z"},"id":"3219bb35-955a-4032-acf1-d83e5dab09bd"},"outputs":[],"source":["result = pipe(sample_long[\"audio\"].copy(), return_timestamps=True)"]},{"cell_type":"code","execution_count":null,"id":"bd7ef03b-3c71-4f3a-9a9c-40549256b447","metadata":{"ExecuteTime":{"end_time":"2023-11-08T15:08:42.981577100Z","start_time":"2023-11-08T15:08:40.300007100Z"},"id":"bd7ef03b-3c71-4f3a-9a9c-40549256b447","outputId":"fcfafe87-2da0-43fb-8e29-805e38e96455"},"outputs":[{"data":{"text/html":["\n"," \n"," "],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"name":"stdout","output_type":"stream","text":["1\n","00:00:00,000 --> 00:00:06,560\n"," Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.\n","\n","2\n","00:00:06,560 --> 00:00:11,280\n"," Nor is Mr. Quilter's manner less interesting than his matter.\n","\n","3\n","00:00:11,280 --> 00:00:16,840\n"," He tells us that at this festive season of the year, with Christmas and roast beef looming\n","\n","4\n","00:00:16,840 --> 00:00:23,760\n"," before us, similes drawn from eating and its results occur most readily to the mind.\n","\n","5\n","00:00:23,760 --> 00:00:29,360\n"," He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and\n","\n","6\n","00:00:29,360 --> 00:00:33,640\n"," can discover in it but little of Rocky Ithaca.\n","\n","7\n","00:00:33,640 --> 00:00:39,760\n"," Lennel's pictures are a sort of upgards and Adam paintings, and Mason's exquisite\n","\n","8\n","00:00:39,760 --> 00:00:44,720\n"," idles are as national as a jingo poem.\n","\n","9\n","00:00:44,720 --> 00:00:50,320\n"," Mr. Burkett Foster's landscapes smile at one much in the same way that Mr. Carker used\n","\n","10\n","00:00:50,320 --> 00:00:52,920\n"," to flash his teeth.\n","\n","11\n","00:00:52,920 --> 00:00:58,680\n"," And Mr. John Collier gives his sitter a cheerful slap on the back, before he says, like\n","\n","12\n","00:00:58,680 --> 00:01:01,120\n"," a shampooer and a Turkish bath,\n","\n","13\n","00:01:01,120 --> 00:01:02,000\n"," Next man!\n","\n","\n"]}],"source":["srt_lines = prepare_srt(result)\n","\n","display(ipd.Audio(sample_long[\"audio\"][\"array\"], rate=sample_long[\"audio\"][\"sampling_rate\"]))\n","print(\"\".join(srt_lines))"]},{"cell_type":"markdown","id":"b36d31bc","metadata":{"collapsed":false,"jupyter":{"outputs_hidden":false},"id":"b36d31bc"},"source":["## Quantization\n","[back to top ⬆️](#Table-of-contents:)\n","\n","[NNCF](https://github.com/openvinotoolkit/nncf/) enables post-training quantization by adding the quantization layers into the model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers. The framework is designed so that modifications to your original training code are minor.\n","\n","The optimization process contains the following steps:\n","\n","1. Create a calibration dataset for quantization.\n","2. Run `nncf.quantize` to obtain quantized encoder and decoder models.\n","3. Serialize the `INT8` model using `openvino.save_model` function.\n","\n",">**Note**: Quantization is time and memory consuming operation. Running quantization code below may take some time.\n","\n","Please select below whether you would like to run Distil-Whisper quantization."]},{"cell_type":"code","execution_count":null,"id":"58d361c3","metadata":{"ExecuteTime":{"end_time":"2023-11-08T15:08:42.981577100Z","start_time":"2023-11-08T15:08:42.981577100Z"},"jupyter":{"outputs_hidden":false},"id":"58d361c3","outputId":"5a410784-ce3e-4484-89d5-dc08c30e38c8","colab":{"referenced_widgets":["df9429d236094d6fa1a8a9c7a8734626"]}},"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"df9429d236094d6fa1a8a9c7a8734626","version_major":2,"version_minor":0},"text/plain":["Checkbox(value=True, description='Quantization')"]},"execution_count":21,"metadata":{},"output_type":"execute_result"}],"source":["to_quantize = widgets.Checkbox(\n"," value=True,\n"," description=\"Quantization\",\n"," disabled=False,\n",")\n","\n","to_quantize"]},{"cell_type":"code","execution_count":null,"id":"46cc97d3","metadata":{"ExecuteTime":{"end_time":"2023-11-08T15:08:42.981577100Z","start_time":"2023-11-08T15:08:42.981577100Z"},"jupyter":{"outputs_hidden":false},"id":"46cc97d3"},"outputs":[],"source":["# Fetch `skip_kernel_extension` module\n","import requests\n","\n","r = requests.get(\n"," url=\"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/skip_kernel_extension.py\",\n",")\n","open(\"skip_kernel_extension.py\", \"w\").write(r.text)\n","\n","%load_ext skip_kernel_extension"]},{"cell_type":"markdown","id":"cfb2c2a7","metadata":{"collapsed":false,"jupyter":{"outputs_hidden":false},"id":"cfb2c2a7"},"source":["### Prepare calibration datasets\n","[back to top ⬆️](#Table-of-contents:)\n","\n","First step is to prepare calibration datasets for quantization. Since we quantize whisper encoder and decoder separately, we need to prepare a calibration dataset for each of the models. We import an `InferRequestWrapper` class that will intercept model inputs and collect them to a list. Then we run model inference on some small amount of audio samples. Generally, increasing the calibration dataset size improves quantization quality."]},{"cell_type":"code","execution_count":null,"id":"96d6b01e","metadata":{"ExecuteTime":{"end_time":"2023-11-08T16:08:47.608131500Z","start_time":"2023-11-08T16:08:47.567321700Z"},"jupyter":{"outputs_hidden":false},"id":"96d6b01e"},"outputs":[],"source":["%%skip not $to_quantize.value\n","\n","from itertools import islice\n","from optimum.intel.openvino.quantization import InferRequestWrapper\n","\n","\n","def collect_calibration_dataset(ov_model: OVModelForSpeechSeq2Seq, calibration_dataset_size: int):\n"," # Overwrite model request properties, saving the original ones for restoring later\n"," encoder_calibration_data = []\n"," decoder_calibration_data = []\n"," ov_model.encoder.request = InferRequestWrapper(ov_model.encoder.request, encoder_calibration_data, apply_caching=True)\n"," ov_model.decoder_with_past.request = InferRequestWrapper(ov_model.decoder_with_past.request,\n"," decoder_calibration_data,\n"," apply_caching=True)\n","\n"," try:\n"," calibration_dataset = load_dataset(\"librispeech_asr\", \"clean\", split=\"validation\", streaming=True)\n"," for sample in tqdm(islice(calibration_dataset, calibration_dataset_size), desc=\"Collecting calibration data\",\n"," total=calibration_dataset_size):\n"," input_features = extract_input_features(sample)\n"," ov_model.generate(input_features)\n"," finally:\n"," ov_model.encoder.request = ov_model.encoder.request.request\n"," ov_model.decoder_with_past.request = ov_model.decoder_with_past.request.request\n","\n"," return encoder_calibration_data, decoder_calibration_data"]},{"cell_type":"markdown","id":"023f2eff","metadata":{"collapsed":false,"jupyter":{"outputs_hidden":false},"id":"023f2eff"},"source":["### Quantize Distil-Whisper encoder and decoder models\n","[back to top ⬆️](#Table-of-contents:)\n","\n","Below we run the `quantize` function which calls `nncf.quantize` on Distil-Whisper encoder and decoder-with-past models. We don't quantize first-step-decoder because its share in whole inference time is negligible."]},{"cell_type":"code","execution_count":null,"id":"0de8bd26","metadata":{"ExecuteTime":{"end_time":"2023-11-08T16:20:21.666837100Z","start_time":"2023-11-08T16:20:19.667042200Z"},"jupyter":{"outputs_hidden":false},"test_replace":{"CALIBRATION_DATASET_SIZE = 50":"CALIBRATION_DATASET_SIZE = 1"},"id":"0de8bd26","outputId":"33f3248e-ee67-4623-a2a5-9289c1d66c03","colab":{"referenced_widgets":["7b6c8bad9f71453f99aed8d59298ae7a","88066e3fd2d4400cbe7df8ed0e5fd08e","d634c038189245f78a9ac8944813e25a","a7a671fba9284e95b2fcd56d7af91701","b93b44da700141828a35741b9fd8b83d"]}},"outputs":[{"name":"stdout","output_type":"stream","text":["distil-whisper_distil-large-v2_quantized\n"]},{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"7b6c8bad9f71453f99aed8d59298ae7a","version_major":2,"version_minor":0},"text/plain":["Collecting calibration data: 0%| | 0/20 [00:00\n"],"text/plain":[]},"metadata":{},"output_type":"display_data"},{"data":{"text/html":["

\n","
\n"],"text/plain":["\n"]},"metadata":{},"output_type":"display_data"},{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"d634c038189245f78a9ac8944813e25a","version_major":2,"version_minor":0},"text/plain":["Output()"]},"metadata":{},"output_type":"display_data"},{"data":{"text/html":["
\n"],"text/plain":[]},"metadata":{},"output_type":"display_data"},{"data":{"text/html":["
\n","
\n"],"text/plain":["\n"]},"metadata":{},"output_type":"display_data"},{"name":"stdout","output_type":"stream","text":["INFO:nncf:96 ignored nodes were found by name in the NNCFGraph\n","INFO:nncf:128 ignored nodes were found by name in the NNCFGraph\n"]},{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"a7a671fba9284e95b2fcd56d7af91701","version_major":2,"version_minor":0},"text/plain":["Output()"]},"metadata":{},"output_type":"display_data"},{"data":{"text/html":["
\n"],"text/plain":[]},"metadata":{},"output_type":"display_data"},{"data":{"text/html":["
\n","
\n"],"text/plain":["\n"]},"metadata":{},"output_type":"display_data"},{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"b93b44da700141828a35741b9fd8b83d","version_major":2,"version_minor":0},"text/plain":["Output()"]},"metadata":{},"output_type":"display_data"},{"data":{"text/html":["
\n"],"text/plain":[]},"metadata":{},"output_type":"display_data"},{"data":{"text/html":["
\n","
\n"],"text/plain":["\n"]},"metadata":{},"output_type":"display_data"},{"ename":"RuntimeError","evalue":"Check 'bin_file' failed at src\\core\\src\\pass\\serialize.cpp:1210:\nCan't open bin file: \"distil-whisper_distil-large-v2_quantized\\openvino_encoder_model.bin\"\n","output_type":"error","traceback":["\u001b[1;31m---------------------------------------------------------------------------\u001b[0m","\u001b[1;31mRuntimeError\u001b[0m Traceback (most recent call last)","Cell \u001b[1;32mIn[32], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[43mget_ipython\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mrun_cell_magic\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mskip\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mnot $to_quantize.value\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43mimport gc\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43mimport shutil\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43mimport nncf\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43mCALIBRATION_DATASET_SIZE = 20\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43mquantized_model_path = Path(f\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;132;43;01m{model_path}\u001b[39;49;00m\u001b[38;5;124;43m_quantized\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m)\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43mprint(quantized_model_path)\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43mdef quantize(ov_model: OVModelForSpeechSeq2Seq, calibration_dataset_size: int):\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m # if not quantized_model_path.exists():\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m encoder_calibration_data, decoder_calibration_data = collect_calibration_dataset(\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m ov_model, calibration_dataset_size\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m )\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m print(\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mQuantizing encoder\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m)\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m quantized_encoder = nncf.quantize(\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m ov_model.encoder.model,\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m nncf.Dataset(encoder_calibration_data),\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m subset_size=len(encoder_calibration_data),\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m model_type=nncf.ModelType.TRANSFORMER,\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m # Smooth Quant algorithm reduces activation quantization error; optimal alpha value was obtained through grid search\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.50)\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m )\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m ov.save_model(quantized_encoder, quantized_model_path / \u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mopenvino_encoder_model.xml\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m)\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m del quantized_encoder\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m del encoder_calibration_data\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m gc.collect()\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m print(\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mQuantizing decoder with past\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m)\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m quantized_decoder_with_past = nncf.quantize(\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m ov_model.decoder_with_past.model,\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m nncf.Dataset(decoder_calibration_data),\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m subset_size=len(decoder_calibration_data),\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m model_type=nncf.ModelType.TRANSFORMER,\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m # Smooth Quant algorithm reduces activation quantization error; optimal alpha value was obtained through grid search\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.95)\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m )\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m ov.save_model(quantized_decoder_with_past, quantized_model_path / \u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mopenvino_decoder_with_past_model.xml\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m)\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m del quantized_decoder_with_past\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m del decoder_calibration_data\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m gc.collect()\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m # Copy the config file and the first-step-decoder manually\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m shutil.copy(model_path / \u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mconfig.json\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m, quantized_model_path / \u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mconfig.json\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m)\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m shutil.copy(model_path / \u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mopenvino_decoder_model.xml\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m, quantized_model_path / \u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mopenvino_decoder_model.xml\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m)\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m shutil.copy(model_path / \u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mopenvino_decoder_model.bin\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m, quantized_model_path / \u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mopenvino_decoder_model.bin\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m)\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m quantized_ov_model = OVModelForSpeechSeq2Seq.from_pretrained(quantized_model_path, ov_config=ov_config, compile=False)\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m quantized_ov_model.to(device.value)\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m quantized_ov_model.compile()\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m return quantized_ov_model\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43mov_quantized_model = quantize(ov_model, CALIBRATION_DATASET_SIZE)\u001b[39;49m\u001b[38;5;130;43;01m\\n\u001b[39;49;00m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m\n","File \u001b[1;32m~\\miniforge3\\envs\\ov\\Lib\\site-packages\\IPython\\core\\interactiveshell.py:2541\u001b[0m, in \u001b[0;36mInteractiveShell.run_cell_magic\u001b[1;34m(self, magic_name, line, cell)\u001b[0m\n\u001b[0;32m 2539\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mbuiltin_trap:\n\u001b[0;32m 2540\u001b[0m args \u001b[38;5;241m=\u001b[39m (magic_arg_s, cell)\n\u001b[1;32m-> 2541\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[43mfn\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 2543\u001b[0m \u001b[38;5;66;03m# The code below prevents the output from being displayed\u001b[39;00m\n\u001b[0;32m 2544\u001b[0m \u001b[38;5;66;03m# when using magics with decorator @output_can_be_silenced\u001b[39;00m\n\u001b[0;32m 2545\u001b[0m \u001b[38;5;66;03m# when the last Python token in the expression is a ';'.\u001b[39;00m\n\u001b[0;32m 2546\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mgetattr\u001b[39m(fn, magic\u001b[38;5;241m.\u001b[39mMAGIC_OUTPUT_CAN_BE_SILENCED, \u001b[38;5;28;01mFalse\u001b[39;00m):\n","File \u001b[1;32mC:\\workshops\\samples\\openvino_notebooks\\notebooks\\distil-whisper-asr\\skip_kernel_extension.py:17\u001b[0m, in \u001b[0;36mskip\u001b[1;34m(line, cell)\u001b[0m\n\u001b[0;32m 11\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28meval\u001b[39m(line):\n\u001b[0;32m 13\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m\n\u001b[1;32m---> 17\u001b[0m \u001b[43mget_ipython\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mex\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcell\u001b[49m\u001b[43m)\u001b[49m\n","File \u001b[1;32m~\\miniforge3\\envs\\ov\\Lib\\site-packages\\IPython\\core\\interactiveshell.py:2878\u001b[0m, in \u001b[0;36mInteractiveShell.ex\u001b[1;34m(self, cmd)\u001b[0m\n\u001b[0;32m 2876\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"Execute a normal python statement in user namespace.\"\"\"\u001b[39;00m\n\u001b[0;32m 2877\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mbuiltin_trap:\n\u001b[1;32m-> 2878\u001b[0m exec(cmd, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39muser_global_ns, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39muser_ns)\n","File \u001b[1;32m:54\u001b[0m\n","File \u001b[1;32m:24\u001b[0m, in \u001b[0;36mquantize\u001b[1;34m(ov_model, calibration_dataset_size)\u001b[0m\n","\u001b[1;31mRuntimeError\u001b[0m: Check 'bin_file' failed at src\\core\\src\\pass\\serialize.cpp:1210:\nCan't open bin file: \"distil-whisper_distil-large-v2_quantized\\openvino_encoder_model.bin\"\n"]}],"source":["%%skip not $to_quantize.value\n","\n","import gc\n","import shutil\n","import nncf\n","\n","CALIBRATION_DATASET_SIZE = 20\n","quantized_model_path = Path(f\"{model_path}_quantized\")\n","print(quantized_model_path)\n","\n","def quantize(ov_model: OVModelForSpeechSeq2Seq, calibration_dataset_size: int):\n"," # if not quantized_model_path.exists():\n"," encoder_calibration_data, decoder_calibration_data = collect_calibration_dataset(\n"," ov_model, calibration_dataset_size\n"," )\n"," print(\"Quantizing encoder\")\n"," quantized_encoder = nncf.quantize(\n"," ov_model.encoder.model,\n"," nncf.Dataset(encoder_calibration_data),\n"," subset_size=len(encoder_calibration_data),\n"," model_type=nncf.ModelType.TRANSFORMER,\n"," # Smooth Quant algorithm reduces activation quantization error; optimal alpha value was obtained through grid search\n"," advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.50)\n"," )\n"," ov.save_model(quantized_encoder, quantized_model_path / \"openvino_encoder_model.xml\")\n"," del quantized_encoder\n"," del encoder_calibration_data\n"," gc.collect()\n","\n"," print(\"Quantizing decoder with past\")\n"," quantized_decoder_with_past = nncf.quantize(\n"," ov_model.decoder_with_past.model,\n"," nncf.Dataset(decoder_calibration_data),\n"," subset_size=len(decoder_calibration_data),\n"," model_type=nncf.ModelType.TRANSFORMER,\n"," # Smooth Quant algorithm reduces activation quantization error; optimal alpha value was obtained through grid search\n"," advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.95)\n"," )\n"," ov.save_model(quantized_decoder_with_past, quantized_model_path / \"openvino_decoder_with_past_model.xml\")\n"," del quantized_decoder_with_past\n"," del decoder_calibration_data\n"," gc.collect()\n","\n"," # Copy the config file and the first-step-decoder manually\n"," shutil.copy(model_path / \"config.json\", quantized_model_path / \"config.json\")\n"," shutil.copy(model_path / \"openvino_decoder_model.xml\", quantized_model_path / \"openvino_decoder_model.xml\")\n"," shutil.copy(model_path / \"openvino_decoder_model.bin\", quantized_model_path / \"openvino_decoder_model.bin\")\n","\n"," quantized_ov_model = OVModelForSpeechSeq2Seq.from_pretrained(quantized_model_path, ov_config=ov_config, compile=False)\n"," quantized_ov_model.to(device.value)\n"," quantized_ov_model.compile()\n"," return quantized_ov_model\n","\n","\n","ov_quantized_model = quantize(ov_model, CALIBRATION_DATASET_SIZE)"]},{"cell_type":"markdown","id":"b06ca107","metadata":{"collapsed":false,"jupyter":{"outputs_hidden":false},"id":"b06ca107"},"source":["### Run quantized model inference\n","[back to top ⬆️](#Table-of-contents:)\n","\n","Let's compare the transcription results for original and quantized models."]},{"cell_type":"code","execution_count":null,"id":"7b6eed2a","metadata":{"ExecuteTime":{"end_time":"2023-11-08T16:12:18.722391800Z","start_time":"2023-11-08T16:12:13.634791500Z"},"jupyter":{"outputs_hidden":false},"id":"7b6eed2a","outputId":"e85f5ef5-42c4-4df0-cd7c-05d1b73b468b"},"outputs":[{"data":{"text/html":["\n"," \n"," "],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"name":"stdout","output_type":"stream","text":["Original : Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.\n","Quantized: Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.\n"]}],"source":["%%skip not $to_quantize.value\n","\n","dataset = load_dataset(\n"," \"hf-internal-testing/librispeech_asr_dummy\", \"clean\", split=\"validation\"\n",")\n","sample = dataset[0]\n","input_features = extract_input_features(sample)\n","\n","predicted_ids = ov_model.generate(input_features)\n","transcription_original = processor.batch_decode(predicted_ids, skip_special_tokens=True)\n","\n","predicted_ids = ov_quantized_model.generate(input_features)\n","transcription_quantized = processor.batch_decode(predicted_ids, skip_special_tokens=True)\n","\n","display(ipd.Audio(sample[\"audio\"][\"array\"], rate=sample[\"audio\"][\"sampling_rate\"]))\n","print(f\"Original : {transcription_original[0]}\")\n","print(f\"Quantized: {transcription_quantized[0]}\")"]},{"cell_type":"markdown","id":"3228cf53","metadata":{"collapsed":false,"jupyter":{"outputs_hidden":false},"id":"3228cf53"},"source":["Results are the same!"]},{"cell_type":"markdown","id":"c68cb960","metadata":{"collapsed":false,"jupyter":{"outputs_hidden":false},"id":"c68cb960"},"source":["### Compare performance and accuracy of the original and quantized models\n","[back to top ⬆️](#Table-of-contents:)\n","\n","Finally, we compare original and quantized Distil-Whisper models from accuracy and performance stand-points.\n","\n","To measure accuracy, we use `1 - WER` as a metric, where WER stands for Word Error Rate.\n","\n","When measuring inference time, we do it separately for encoder and decoder-with-past model forwards, and for the whole model inference too."]},{"cell_type":"code","execution_count":null,"id":"7133f52f","metadata":{"ExecuteTime":{"end_time":"2023-11-08T16:15:20.910568900Z","start_time":"2023-11-08T16:12:18.721295800Z"},"jupyter":{"outputs_hidden":false},"test_replace":{"TEST_DATASET_SIZE = 50":"TEST_DATASET_SIZE = 1"},"id":"7133f52f","outputId":"1825f347-2137-4bbe-f6b6-b4f9979e4ef4","colab":{"referenced_widgets":["3660265ca7ce40838bfda651d40aa42e","62cf5d99b7e24c9abeb7d0615dd564ee"]}},"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"3660265ca7ce40838bfda651d40aa42e","version_major":2,"version_minor":0},"text/plain":["Measuring performance and accuracy: 0%| | 0/50 [00:00 MAX_AUDIO_MINS:\n"," raise gr.Error(\n"," f\"To ensure fair usage of the Space, the maximum audio length permitted is {MAX_AUDIO_MINS} minutes.\"\n"," f\"Got an audio of length {round(audio_length_mins, 3)} minutes.\"\n"," )\n","\n"," inputs = {\"array\": inputs, \"sampling_rate\": pipe.feature_extractor.sampling_rate}\n","\n"," def _forward_ov_time(*args, **kwargs):\n"," global ov_time\n"," start_time = time.time()\n"," result = pipe_forward(*args, **kwargs)\n"," ov_time = time.time() - start_time\n"," ov_time = round(ov_time, 2)\n"," return result\n","\n"," pipe._forward = _forward_ov_time\n"," ov_text = pipe(inputs.copy(), batch_size=BATCH_SIZE)[\"text\"]\n"," return ov_text, ov_time\n","\n","\n","with gr.Blocks() as demo:\n"," gr.HTML(\n"," \"\"\"\n","
\n"," \n","

\n"," OpenVINO Distil-Whisper demo\n","

\n","
\n"," \n"," \"\"\"\n"," )\n"," audio = gr.components.Audio(type=\"filepath\", label=\"Audio input\")\n"," with gr.Row():\n"," button = gr.Button(\"Transcribe\")\n"," if to_quantize.value:\n"," button_q = gr.Button(\"Transcribe quantized\")\n"," with gr.Row():\n"," infer_time = gr.components.Textbox(label=\"OpenVINO Distil-Whisper Transcription Time (s)\")\n"," if to_quantize.value:\n"," infer_time_q = gr.components.Textbox(label=\"OpenVINO Quantized Distil-Whisper Transcription Time (s)\")\n"," with gr.Row():\n"," transcription = gr.components.Textbox(label=\"OpenVINO Distil-Whisper Transcription\", show_copy_button=True)\n"," if to_quantize.value:\n"," transcription_q = gr.components.Textbox(\n"," label=\"OpenVINO Quantized Distil-Whisper Transcription\",\n"," show_copy_button=True,\n"," )\n"," button.click(\n"," fn=transcribe,\n"," inputs=audio,\n"," outputs=[transcription, infer_time],\n"," )\n"," if to_quantize.value:\n"," button_q.click(\n"," fn=transcribe,\n"," inputs=[audio, gr.Number(value=1, visible=False)],\n"," outputs=[transcription_q, infer_time_q],\n"," )\n"," gr.Markdown(\"## Examples\")\n"," gr.Examples(\n"," [[\"./example_1.wav\"]],\n"," audio,\n"," outputs=[transcription, infer_time],\n"," fn=transcribe,\n"," cache_examples=False,\n"," )\n","# if you are launching remotely, specify server_name and server_port\n","# demo.launch(server_name='your server name', server_port='server port in int')\n","# Read more in the docs: https://gradio.app/docs/\n","try:\n"," demo.launch(debug=True)\n","except Exception:\n"," demo.launch(share=True, debug=True)"]},{"cell_type":"code","execution_count":null,"id":"d8529f39-6cdc-477a-82aa-ee53b8549f02","metadata":{"id":"d8529f39-6cdc-477a-82aa-ee53b8549f02"},"outputs":[],"source":[]}],"metadata":{"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.11.9"},"openvino_notebooks":{"imageUrl":"https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/52c58b58-7730-48d2-803d-4af0b6115499","tags":{"categories":["Model Demos","AI Trends"],"libraries":[],"other":[],"tasks":["Speech Recognition"]}},"widgets":{"application/vnd.jupyter.widget-state+json":{}},"colab":{"provenance":[]}},"nbformat":4,"nbformat_minor":5}