Spaces:
Runtime error
Runtime error
Initializing audio transcriptor
Browse files- README.md +190 -13
- demo.py +47 -0
- pyscript/__init__.py +3 -0
- pyscript/audio_processing.py +193 -0
- pyscript/audio_recording.py +77 -0
- pyscript/transcription.py +110 -0
- pyscript/transcriptor.py +163 -0
- requirements.txt +15 -0
README.md
CHANGED
@@ -1,13 +1,190 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Audio Transcription and Diarization Tool
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
|
5 |
+
This project provides a robust set of tools for transcribing audio files using the Whisper model and performing speaker diarization with PyAnnote. Users can process audio files, record audio, and save transcriptions with speaker identification.
|
6 |
+
|
7 |
+
## Table of Contents
|
8 |
+
- [Features](#features)
|
9 |
+
- [Requirements](#requirements)
|
10 |
+
- [Setup](#setup)
|
11 |
+
- [Usage](#usage)
|
12 |
+
- [Basic Example](#basic-example)
|
13 |
+
- [Audio Processing Example](#audio-processing-example)
|
14 |
+
- [Transcribing an Existing Audio File or Recording](#transcribing-an-existing-audio-file-or-recording)
|
15 |
+
- [Key Components](#key-components)
|
16 |
+
- [Transcriptor](#transcriptor)
|
17 |
+
- [AudioProcessor](#audioprocessor)
|
18 |
+
- [AudioRecording](#audiorecording)
|
19 |
+
- [Contributing](#contributing)
|
20 |
+
- [Acknowledgments](#acknowledgments)
|
21 |
+
|
22 |
+
## Features
|
23 |
+
|
24 |
+
- **Transcription**: Convert audio files in various formats to text (automatically converts to WAV).
|
25 |
+
- **Speaker Diarization**: Identify different speakers in the audio.
|
26 |
+
- **Speaker Retrieval**: Name speakers during transcription.
|
27 |
+
- **Audio Recording**: Record audio directly from a microphone.
|
28 |
+
- **Audio Preprocessing**: Includes resampling, format conversion, and audio enhancement.
|
29 |
+
- **Multiple Model Support**: Choose from various Whisper model sizes.
|
30 |
+
|
31 |
+
## Supported Whisper Models
|
32 |
+
|
33 |
+
This tool supports various Whisper model sizes, allowing you to balance accuracy and computational resources:
|
34 |
+
|
35 |
+
- **`tiny`**: Fastest, lowest accuracy
|
36 |
+
- **`base`**: Fast, good accuracy
|
37 |
+
- **`small`**: Balanced speed and accuracy
|
38 |
+
- **`medium`**: High accuracy, slower
|
39 |
+
- **`large`**: High accuracy, resource-intensive
|
40 |
+
- **`large-v1`**: Improved large model
|
41 |
+
- **`large-v2`**: Further improved large model
|
42 |
+
- **`large-v3`**: Latest and most accurate
|
43 |
+
- **`large-v3-turbo`**: Optimized for faster processing
|
44 |
+
|
45 |
+
Specify the model size when initializing the Transcriptor:
|
46 |
+
|
47 |
+
```python
|
48 |
+
transcriptor = Transcriptor(model_size="base")
|
49 |
+
```
|
50 |
+
|
51 |
+
The default model size is "base" if not specified.
|
52 |
+
|
53 |
+
## Requirements
|
54 |
+
|
55 |
+
To run this project, you need Python 3.7+ and the following packages:
|
56 |
+
|
57 |
+
```plaintext
|
58 |
+
- openai-whisper
|
59 |
+
- pyannote.audio
|
60 |
+
- librosa
|
61 |
+
- tqdm
|
62 |
+
- python-dotenv
|
63 |
+
- termcolor
|
64 |
+
- pydub
|
65 |
+
- SpeechRecognition
|
66 |
+
- pyaudio
|
67 |
+
- tabulate
|
68 |
+
- soundfile
|
69 |
+
- torch
|
70 |
+
- numpy
|
71 |
+
- transformers
|
72 |
+
- gradio
|
73 |
+
```
|
74 |
+
|
75 |
+
Install the required packages using:
|
76 |
+
|
77 |
+
```bash
|
78 |
+
pip install -r requirements.txt
|
79 |
+
```
|
80 |
+
|
81 |
+
## Setup
|
82 |
+
|
83 |
+
1. **Clone the repository**:
|
84 |
+
```bash
|
85 |
+
git clone https://github.com/your-username/audio-transcription-tool.git
|
86 |
+
cd audio-transcription-tool
|
87 |
+
```
|
88 |
+
|
89 |
+
2. **Install the required packages**:
|
90 |
+
```bash
|
91 |
+
pip install -r requirements.txt
|
92 |
+
```
|
93 |
+
|
94 |
+
3. **Set up your environment variables**:
|
95 |
+
- Create a `.env` file in the root directory.
|
96 |
+
- Add your Hugging Face token:
|
97 |
+
```plaintext
|
98 |
+
HF_TOKEN=your_hugging_face_token_here
|
99 |
+
```
|
100 |
+
|
101 |
+
## Usage
|
102 |
+
|
103 |
+
### Basic Example
|
104 |
+
|
105 |
+
Here's how to use the Transcriptor class to transcribe an audio file:
|
106 |
+
|
107 |
+
```python
|
108 |
+
from pyscript import Transcriptor
|
109 |
+
|
110 |
+
# Initialize the Transcriptor
|
111 |
+
transcriptor = Transcriptor()
|
112 |
+
|
113 |
+
# Transcribe an audio file
|
114 |
+
transcription = transcriptor.transcribe_audio("/path/to/audio")
|
115 |
+
|
116 |
+
# Interactively name speakers
|
117 |
+
transcription.get_name_speakers()
|
118 |
+
|
119 |
+
# Save the transcription
|
120 |
+
transcription.save()
|
121 |
+
```
|
122 |
+
|
123 |
+
### Audio Processing Example
|
124 |
+
|
125 |
+
Use the AudioProcessor class to preprocess your audio files:
|
126 |
+
|
127 |
+
```python
|
128 |
+
from pyscript import AudioProcessor
|
129 |
+
|
130 |
+
# Load an audio file
|
131 |
+
audio = AudioProcessor("/path/to/audio.mp3")
|
132 |
+
|
133 |
+
# Display audio details
|
134 |
+
audio.display_details()
|
135 |
+
|
136 |
+
# Convert to WAV format and resample to 16000 Hz
|
137 |
+
audio.convert_to_wav()
|
138 |
+
|
139 |
+
# Display updated audio details
|
140 |
+
audio.display_changes()
|
141 |
+
```
|
142 |
+
|
143 |
+
### Transcribing an Existing Audio File or Recording
|
144 |
+
|
145 |
+
To transcribe an audio file or record and transcribe audio, use the demo application provided in `demo.py`:
|
146 |
+
|
147 |
+
```bash
|
148 |
+
python demo.py
|
149 |
+
```
|
150 |
+
|
151 |
+
## Key Components
|
152 |
+
|
153 |
+
### Transcriptor
|
154 |
+
|
155 |
+
The `Transcriptor` class (in `pyscript/transcriptor.py`) is the core of the transcription process. It handles:
|
156 |
+
|
157 |
+
- Loading the Whisper model
|
158 |
+
- Setting up the diarization pipeline
|
159 |
+
- Processing audio files
|
160 |
+
- Performing transcription and diarization
|
161 |
+
|
162 |
+
### AudioProcessor
|
163 |
+
|
164 |
+
The `AudioProcessor` class (in `pyscript/audio_processing.py`) manages audio file preprocessing, including:
|
165 |
+
|
166 |
+
- Loading audio files
|
167 |
+
- Resampling
|
168 |
+
- Converting to WAV format
|
169 |
+
- Displaying audio file details and changes
|
170 |
+
- Audio enhancement (noise reduction, voice enhancement, volume boost)
|
171 |
+
|
172 |
+
### AudioRecording
|
173 |
+
|
174 |
+
The `audio_recording.py` module provides functions for recording audio from a microphone, checking input devices, and saving audio files.
|
175 |
+
|
176 |
+
## Contributing
|
177 |
+
|
178 |
+
Contributions are welcome! Please follow these steps:
|
179 |
+
|
180 |
+
1. Fork the repository
|
181 |
+
2. Create a new branch: `git checkout -b feature-branch-name`
|
182 |
+
3. Make your changes and commit them: `git commit -m 'Add some feature'`
|
183 |
+
4. Push to the branch: `git push origin feature-branch-name`
|
184 |
+
5. Submit a pull request
|
185 |
+
|
186 |
+
## Acknowledgments
|
187 |
+
|
188 |
+
- OpenAI for the Whisper model
|
189 |
+
- PyAnnote for the speaker diarization pipeline
|
190 |
+
- All contributors and users of this project
|
demo.py
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import gradio as gr
|
2 |
+
from pyscript import Transcriptor
|
3 |
+
|
4 |
+
demo = gr.Blocks()
|
5 |
+
transcriptor = Transcriptor(model_size="large-v3-turbo")
|
6 |
+
|
7 |
+
microphone_transcribe = gr.Interface(
|
8 |
+
fn=transcriptor.transcribe_audio,
|
9 |
+
inputs=[
|
10 |
+
gr.Audio(sources="microphone", type="filepath", label="Microphone"),
|
11 |
+
gr.Radio([True, False], value=True, label="Enable audio enhancement"),
|
12 |
+
],
|
13 |
+
outputs=[
|
14 |
+
gr.Textbox(label="Transcription"),
|
15 |
+
# gr.File(label="Download Transcription"),
|
16 |
+
# gr.Textbox(label="Console Output", lines=10)
|
17 |
+
],
|
18 |
+
title="Audio-Transcription leveraging Whisper Model",
|
19 |
+
description=(
|
20 |
+
"Transcribe microphone recording or audio inputs and return the transcription with speaker diarization."
|
21 |
+
),
|
22 |
+
allow_flagging="never",
|
23 |
+
)
|
24 |
+
|
25 |
+
file_transcribe = gr.Interface(
|
26 |
+
fn=transcriptor.transcribe_audio,
|
27 |
+
inputs=[
|
28 |
+
gr.Audio(sources="upload", type="filepath", label="Audio file"),
|
29 |
+
gr.Radio([True, False], value=True, label="Enable audio enhancement"),
|
30 |
+
],
|
31 |
+
outputs=[
|
32 |
+
gr.Textbox(label="Transcription"),
|
33 |
+
# gr.File(label="Download Transcription"),
|
34 |
+
# gr.Textbox(label="Console Output", lines=10)
|
35 |
+
],
|
36 |
+
title="Audio-Transcription leveraging Whisper Model",
|
37 |
+
description=(
|
38 |
+
"Transcribe microphone recording or audio inputs and return the transcription with speaker diarization."
|
39 |
+
),
|
40 |
+
allow_flagging="never",
|
41 |
+
)
|
42 |
+
|
43 |
+
|
44 |
+
with demo:
|
45 |
+
gr.TabbedInterface([microphone_transcribe, file_transcribe], ["Microphone", "Audio file"])
|
46 |
+
|
47 |
+
demo.queue().launch()
|
pyscript/__init__.py
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
from .transcriptor import Transcriptor
|
2 |
+
from .audio_processing import AudioProcessor
|
3 |
+
__all__ = ["Transcriptor", "AudioProcessor"]
|
pyscript/audio_processing.py
ADDED
@@ -0,0 +1,193 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import librosa
|
3 |
+
import numpy as np
|
4 |
+
from tabulate import tabulate
|
5 |
+
import soundfile as sf
|
6 |
+
import scipy.ndimage
|
7 |
+
import itertools
|
8 |
+
from scipy.stats import pearsonr
|
9 |
+
from tqdm import tqdm
|
10 |
+
|
11 |
+
class AudioProcessor:
|
12 |
+
|
13 |
+
def __init__(self, audio_file):
|
14 |
+
self.path = audio_file
|
15 |
+
self.name = os.path.splitext(os.path.basename(audio_file))[0]
|
16 |
+
self.format = os.path.splitext(os.path.basename(audio_file))[1]
|
17 |
+
self.duration = librosa.get_duration(path=audio_file)
|
18 |
+
self.sample_rate = librosa.get_samplerate(audio_file)
|
19 |
+
self.changes = []
|
20 |
+
self.optimized_params = None
|
21 |
+
self.load_details()
|
22 |
+
|
23 |
+
# File information methods
|
24 |
+
def load_details(self):
|
25 |
+
"""Save the attributes of the audio file."""
|
26 |
+
data = [
|
27 |
+
["File Name", self.name],
|
28 |
+
["File Format", self.format],
|
29 |
+
["Duration", f"{self.duration} seconds"],
|
30 |
+
["Sample Rate", f"{self.sample_rate} Hz"]
|
31 |
+
]
|
32 |
+
table = tabulate(data, headers=["Attribute", "Value"], tablefmt="outline")
|
33 |
+
self.changes.append(table)
|
34 |
+
return table
|
35 |
+
|
36 |
+
def display_details(self):
|
37 |
+
"""Display the details of the audio file."""
|
38 |
+
print(self.changes[-1])
|
39 |
+
|
40 |
+
def display_changes(self):
|
41 |
+
"""Display the changes made to the audio file side by side."""
|
42 |
+
self._clean_duplicates_changes()
|
43 |
+
if len(self.changes) == 1:
|
44 |
+
self.display_details()
|
45 |
+
else:
|
46 |
+
table1 = self.changes[0].split('\n')
|
47 |
+
table2 = self.changes[-1].split('\n')
|
48 |
+
|
49 |
+
combined_table = []
|
50 |
+
for line1, line2 in zip(table1, table2):
|
51 |
+
combined_table.append([line1, '===>', line2])
|
52 |
+
|
53 |
+
print(tabulate(combined_table, tablefmt="plain"))
|
54 |
+
|
55 |
+
def _clean_duplicates_changes(self):
|
56 |
+
"""Remove duplicate consecutive changes from the audio file."""
|
57 |
+
self.changes = [change for i, change in enumerate(self.changes)
|
58 |
+
if i == 0 or change != self.changes[i-1]]
|
59 |
+
|
60 |
+
# Audio processing methods
|
61 |
+
def load_as_array(self, sample_rate: int = 16000) -> np.ndarray:
|
62 |
+
"""
|
63 |
+
Load an audio file and convert it into a NumPy array.
|
64 |
+
|
65 |
+
Parameters
|
66 |
+
----------
|
67 |
+
sample_rate : int, optional
|
68 |
+
The sample rate to which the audio will be resampled (default is 16000 Hz).
|
69 |
+
|
70 |
+
Returns
|
71 |
+
-------
|
72 |
+
np.ndarray
|
73 |
+
A NumPy array containing the audio data.
|
74 |
+
"""
|
75 |
+
try:
|
76 |
+
audio, sr = librosa.load(self.path, sr=sample_rate)
|
77 |
+
self.sample_rate = sr
|
78 |
+
return audio
|
79 |
+
except Exception as e:
|
80 |
+
raise RuntimeError(f"Failed to load audio file: {e}")
|
81 |
+
|
82 |
+
def resample_wav(self) -> str:
|
83 |
+
output_path = os.path.join('resampled_files', f'{self.name}.wav')
|
84 |
+
try:
|
85 |
+
audio, sr = librosa.load(self.path)
|
86 |
+
resampled_audio = librosa.resample(y=audio, orig_sr=sr, target_sr=16000)
|
87 |
+
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
88 |
+
sf.write(output_path, resampled_audio, 16000)
|
89 |
+
self._update_file_info(output_path)
|
90 |
+
return output_path
|
91 |
+
except Exception as e:
|
92 |
+
raise RuntimeError(f"Failed to resample audio file: {e}")
|
93 |
+
|
94 |
+
def convert_to_wav(self):
|
95 |
+
"""
|
96 |
+
Converts an audio file to WAV format.
|
97 |
+
|
98 |
+
Returns
|
99 |
+
-------
|
100 |
+
str
|
101 |
+
The path to the converted audio file.
|
102 |
+
"""
|
103 |
+
output_path = os.path.join('converted_files', f'{self.name}.wav')
|
104 |
+
try:
|
105 |
+
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
106 |
+
audio, sr = librosa.load(self.path, sr=16000)
|
107 |
+
sf.write(output_path, audio, 16000)
|
108 |
+
self._update_file_info(output_path)
|
109 |
+
return output_path
|
110 |
+
except Exception as e:
|
111 |
+
raise RuntimeError(f"Failed to convert audio file to WAV: {e}")
|
112 |
+
|
113 |
+
def enhance_audio(self, noise_reduce_strength=0.5, voice_enhance_strength=1.5, volume_boost=1.2):
|
114 |
+
"""
|
115 |
+
Enhance audio quality by reducing noise and clarifying voices.
|
116 |
+
"""
|
117 |
+
try:
|
118 |
+
y, sr = librosa.load(self.path, sr=16000)
|
119 |
+
y_enhanced = self._enhance_audio_sample(y, noise_reduce_strength, voice_enhance_strength, volume_boost)
|
120 |
+
|
121 |
+
output_path = os.path.join('enhanced_files', f'{self.name}_enhanced.wav')
|
122 |
+
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
123 |
+
sf.write(output_path, y_enhanced, sr)
|
124 |
+
|
125 |
+
self._update_file_info(output_path)
|
126 |
+
return output_path
|
127 |
+
except Exception as e:
|
128 |
+
raise RuntimeError(f"Failed to enhance audio: {e}")
|
129 |
+
|
130 |
+
def optimize_enhancement_parameters(self, step=0.25, max_iterations=50, sample_duration=30):
|
131 |
+
"""
|
132 |
+
Find optimal parameters for audio enhancement using grid search on a sample.
|
133 |
+
"""
|
134 |
+
y_orig, sr = librosa.load(self.path, duration=sample_duration)
|
135 |
+
|
136 |
+
param_ranges = [
|
137 |
+
np.arange(0.25, 1.5, step), # noise_reduce_strength
|
138 |
+
np.arange(1.0, 3.0, step), # voice_enhance_strength
|
139 |
+
np.arange(1.0, 2.0, step) # volume_boost
|
140 |
+
]
|
141 |
+
|
142 |
+
best_score = float('-inf')
|
143 |
+
best_params = None
|
144 |
+
|
145 |
+
total_iterations = min(max_iterations, len(list(itertools.product(*param_ranges))))
|
146 |
+
|
147 |
+
for params in tqdm(itertools.islice(itertools.product(*param_ranges), max_iterations),
|
148 |
+
total=total_iterations,
|
149 |
+
desc="Searching for optimal parameters"):
|
150 |
+
y_enhanced = self._enhance_audio_sample(y_orig, *params)
|
151 |
+
|
152 |
+
min_length = min(len(y_orig), len(y_enhanced))
|
153 |
+
y_orig_trimmed = y_orig[:min_length]
|
154 |
+
y_enhanced_trimmed = y_enhanced[:min_length]
|
155 |
+
|
156 |
+
correlation, _ = pearsonr(y_orig_trimmed, y_enhanced_trimmed)
|
157 |
+
|
158 |
+
S_orig = np.abs(librosa.stft(y_orig_trimmed))
|
159 |
+
S_enhanced = np.abs(librosa.stft(y_enhanced_trimmed))
|
160 |
+
contrast_improvement = np.mean(librosa.feature.spectral_contrast(S=S_enhanced)) - np.mean(librosa.feature.spectral_contrast(S=S_orig))
|
161 |
+
|
162 |
+
score = correlation + 0.5 * contrast_improvement
|
163 |
+
|
164 |
+
if score > best_score:
|
165 |
+
best_score = score
|
166 |
+
best_params = params
|
167 |
+
|
168 |
+
self.optimized_params = best_params
|
169 |
+
return best_params
|
170 |
+
|
171 |
+
def _enhance_audio_sample(self, y, noise_reduce_strength=0.5, voice_enhance_strength=1.5, volume_boost=1.2):
|
172 |
+
S = librosa.stft(y)
|
173 |
+
S_mag, S_phase = np.abs(S), np.angle(S)
|
174 |
+
S_filtered = scipy.ndimage.median_filter(S_mag, size=(1, 31))
|
175 |
+
|
176 |
+
mask = np.clip((S_mag - S_filtered) / (S_mag + 1e-10), 0, 1) ** noise_reduce_strength
|
177 |
+
S_denoised = S_mag * mask * np.exp(1j * S_phase)
|
178 |
+
|
179 |
+
y_denoised = librosa.istft(S_denoised)
|
180 |
+
|
181 |
+
y_harmonic, y_percussive = librosa.effects.hpss(y_denoised)
|
182 |
+
y_enhanced = (y_harmonic * voice_enhance_strength + y_percussive) * volume_boost
|
183 |
+
|
184 |
+
return librosa.util.normalize(y_enhanced, norm=np.inf, threshold=1.0)
|
185 |
+
|
186 |
+
# Helper method
|
187 |
+
def _update_file_info(self, new_path):
|
188 |
+
"""Update file information after processing."""
|
189 |
+
self.path = new_path
|
190 |
+
self.sample_rate = librosa.get_samplerate(new_path)
|
191 |
+
self.format = os.path.splitext(new_path)[1]
|
192 |
+
self.duration = librosa.get_duration(path=new_path)
|
193 |
+
self.load_details()
|
pyscript/audio_recording.py
ADDED
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import speech_recognition as sr
|
2 |
+
import os
|
3 |
+
import datetime
|
4 |
+
from termcolor import colored
|
5 |
+
from tabulate import tabulate
|
6 |
+
|
7 |
+
def micro_recording(save_folder_path: str = "audio_files", file_name: str = None, device_index: int = 0) -> str:
|
8 |
+
"""Records audio from a microphone and saves it to a designated file."""
|
9 |
+
r = sr.Recognizer()
|
10 |
+
mic = sr.Microphone(device_index=device_index)
|
11 |
+
|
12 |
+
print_colored_separator("Starting microphone recording...", "green")
|
13 |
+
|
14 |
+
with mic as source:
|
15 |
+
print_colored("Recording...", "yellow")
|
16 |
+
audio = r.listen(source)
|
17 |
+
print_colored("Recording finished.", "green")
|
18 |
+
|
19 |
+
saved_path = save_audio_file(audio, save_folder_path, file_name)
|
20 |
+
|
21 |
+
print_colored_separator(f"Audio file saved to: {saved_path}", "green")
|
22 |
+
return saved_path
|
23 |
+
|
24 |
+
def check_input_device(test_duration: int = 1) -> dict:
|
25 |
+
"""Checks the available microphone devices."""
|
26 |
+
devices = sr.Microphone.list_microphone_names()
|
27 |
+
available_devices, non_working_devices = [], []
|
28 |
+
|
29 |
+
for i, device in enumerate(devices):
|
30 |
+
try:
|
31 |
+
with sr.Microphone(device_index=i) as source:
|
32 |
+
sr.Recognizer().listen(source, timeout=test_duration)
|
33 |
+
available_devices.append(device)
|
34 |
+
except sr.WaitTimeoutError:
|
35 |
+
non_working_devices.append(device)
|
36 |
+
except Exception as e:
|
37 |
+
print(f"An error occurred while testing device {device}: {e}")
|
38 |
+
|
39 |
+
print_device_table("Available Devices", available_devices)
|
40 |
+
print_device_table("Non-Working Devices", non_working_devices)
|
41 |
+
|
42 |
+
return {'available_devices': available_devices, 'non_working_devices': non_working_devices}
|
43 |
+
|
44 |
+
def save_audio_file(audio, save_folder_path: str, file_name: str = None) -> str:
|
45 |
+
"""Saves the audio file to the specified path."""
|
46 |
+
os.makedirs(save_folder_path, exist_ok=True)
|
47 |
+
|
48 |
+
if not file_name:
|
49 |
+
timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
|
50 |
+
file_name = f"recording_{timestamp}.wav"
|
51 |
+
else:
|
52 |
+
file_name = f"{file_name}.wav"
|
53 |
+
|
54 |
+
saved_path = os.path.join(save_folder_path, file_name)
|
55 |
+
|
56 |
+
with open(saved_path, "wb") as f:
|
57 |
+
f.write(audio.get_wav_data())
|
58 |
+
|
59 |
+
print_colored("Saving audio file...", "yellow")
|
60 |
+
return saved_path
|
61 |
+
|
62 |
+
def print_colored(message: str, color: str):
|
63 |
+
"""Prints a colored message."""
|
64 |
+
print(colored(message, color))
|
65 |
+
|
66 |
+
def print_colored_separator(message: str, color: str):
|
67 |
+
"""Prints a colored message with separators."""
|
68 |
+
print("--------------------------------")
|
69 |
+
print_colored(message, color)
|
70 |
+
print("--------------------------------")
|
71 |
+
|
72 |
+
def print_device_table(title: str, devices: list):
|
73 |
+
"""Prints a table of devices."""
|
74 |
+
device_table = [[i+1, device] for i, device in enumerate(devices)]
|
75 |
+
print(f"\n{title}:")
|
76 |
+
print(tabulate(device_table, headers=["Index", "Device Name"]))
|
77 |
+
|
pyscript/transcription.py
ADDED
@@ -0,0 +1,110 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
from itertools import cycle
|
3 |
+
from termcolor import colored
|
4 |
+
|
5 |
+
class Transcription:
|
6 |
+
"""
|
7 |
+
A class for storing and saving transcriptions.
|
8 |
+
|
9 |
+
Attributes:
|
10 |
+
-----------
|
11 |
+
audio_file_path : str
|
12 |
+
The path to the audio file that was transcribed.
|
13 |
+
filename : str
|
14 |
+
The name of the audio file, without the extension.
|
15 |
+
transcriptions : list[str]
|
16 |
+
A list of tuples containing the speaker's label and their corresponding transcription, grouped by speaker.
|
17 |
+
speaker_names : dict
|
18 |
+
A dictionary mapping speaker labels to their assigned names.
|
19 |
+
segments : list
|
20 |
+
A list of segments from diarization.
|
21 |
+
|
22 |
+
"""
|
23 |
+
|
24 |
+
def __init__(self, audio_file_path: str, transcriptions: list[str], segments: list[str]):
|
25 |
+
self.audio_file_path = audio_file_path
|
26 |
+
self.filename = os.path.splitext(os.path.basename(audio_file_path))[0]
|
27 |
+
self.transcriptions = self.group_by_speaker(transcriptions)
|
28 |
+
self.speaker_names = {}
|
29 |
+
self.segments = segments
|
30 |
+
self.colors = cycle(['red', 'green', 'blue', 'magenta', 'cyan', 'yellow'])
|
31 |
+
|
32 |
+
def __repr__(self) -> str:
|
33 |
+
result = []
|
34 |
+
for speaker, text in self.transcriptions:
|
35 |
+
speaker_name = self.speaker_names.get(speaker, speaker)
|
36 |
+
result.append(f"{speaker_name}:\n{text}")
|
37 |
+
return "\n\n".join(result)
|
38 |
+
|
39 |
+
def group_by_speaker(self, transcriptions: list[str]) -> list[str]:
|
40 |
+
"""
|
41 |
+
Groups transcriptions by speaker.
|
42 |
+
|
43 |
+
Parameters
|
44 |
+
----------
|
45 |
+
transcriptions : list[str]
|
46 |
+
A list of tuples containing the speaker's label and their corresponding transcription.
|
47 |
+
|
48 |
+
Returns
|
49 |
+
-------
|
50 |
+
list[str]
|
51 |
+
A list of tuples containing the speaker's label and their corresponding transcription, grouped by speaker.
|
52 |
+
"""
|
53 |
+
speaker_transcriptions = []
|
54 |
+
previous_speaker = transcriptions[0][0]
|
55 |
+
speaker_text = ""
|
56 |
+
for speaker, text in transcriptions:
|
57 |
+
if speaker == previous_speaker:
|
58 |
+
speaker_text += text
|
59 |
+
else:
|
60 |
+
speaker_transcriptions.append((previous_speaker, speaker_text))
|
61 |
+
speaker_text = text
|
62 |
+
previous_speaker = speaker
|
63 |
+
speaker_transcriptions.append((previous_speaker, speaker_text))
|
64 |
+
return speaker_transcriptions
|
65 |
+
|
66 |
+
def save(self, directory: str = "transcripts") -> None:
|
67 |
+
"""
|
68 |
+
Saves the transcription to a text file.
|
69 |
+
|
70 |
+
Parameters
|
71 |
+
----------
|
72 |
+
directory : str, optional
|
73 |
+
The directory to save the transcription to. Defaults to "transcripts".
|
74 |
+
"""
|
75 |
+
if not self.transcriptions:
|
76 |
+
raise ValueError("No transcriptions available to save.")
|
77 |
+
|
78 |
+
os.makedirs(directory, exist_ok=True)
|
79 |
+
saving_path = os.path.join(directory, f"{self.filename}_transcript.txt")
|
80 |
+
|
81 |
+
with open(saving_path, 'w', encoding='utf-8') as f:
|
82 |
+
for speaker, text in self.transcriptions:
|
83 |
+
if text:
|
84 |
+
speaker_name = self.speaker_names.get(speaker, speaker)
|
85 |
+
f.write(f"{speaker_name}: {text}\n")
|
86 |
+
|
87 |
+
print(f"Transcription saved to {saving_path}")
|
88 |
+
|
89 |
+
def get_name_speakers(self) -> None:
|
90 |
+
"""
|
91 |
+
Interactively assigns names to speakers in the transcriptions and retrieves the name of the speaker.
|
92 |
+
Provides a preview of one sentence for each speaker to help recognize who is speaking.
|
93 |
+
"""
|
94 |
+
for speaker, full_text in self.transcriptions:
|
95 |
+
if speaker in self.speaker_names:
|
96 |
+
continue
|
97 |
+
|
98 |
+
preview = full_text.split('.')[0] + '.'
|
99 |
+
print(f"\nCurrent speaker: {speaker}")
|
100 |
+
print(f"Preview: {preview}")
|
101 |
+
|
102 |
+
new_name = input(f"Enter a name for {speaker} (or press Enter to skip): ").strip()
|
103 |
+
if new_name:
|
104 |
+
self.speaker_names[speaker] = new_name
|
105 |
+
print(f"Speaker {speaker} renamed to {new_name}")
|
106 |
+
else:
|
107 |
+
print(f"Skipped renaming {speaker}")
|
108 |
+
|
109 |
+
print("\nSpeaker naming completed.")
|
110 |
+
print(f"Updated speaker names: {self.speaker_names}")
|
pyscript/transcriptor.py
ADDED
@@ -0,0 +1,163 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
from dotenv import load_dotenv
|
3 |
+
import whisper
|
4 |
+
from pyannote.audio import Pipeline
|
5 |
+
import torch
|
6 |
+
from tqdm import tqdm
|
7 |
+
from time import time
|
8 |
+
from transformers import pipeline
|
9 |
+
from .transcription import Transcription
|
10 |
+
from .audio_processing import AudioProcessor
|
11 |
+
|
12 |
+
load_dotenv()
|
13 |
+
|
14 |
+
class Transcriptor:
|
15 |
+
"""
|
16 |
+
A class for transcribing and diarizing audio files.
|
17 |
+
|
18 |
+
This class uses the Whisper model for transcription and the PyAnnote speaker diarization pipeline for speaker identification.
|
19 |
+
|
20 |
+
Attributes
|
21 |
+
----------
|
22 |
+
model_size : str
|
23 |
+
The size of the Whisper model to use for transcription. Available options are:
|
24 |
+
- 'tiny': Fastest, lowest accuracy
|
25 |
+
- 'base': Fast, good accuracy for many use cases
|
26 |
+
- 'small': Balanced speed and accuracy
|
27 |
+
- 'medium': High accuracy, slower than smaller models
|
28 |
+
- 'large': High accuracy, slower and more resource-intensive
|
29 |
+
- 'large-v1': Improved version of the large model
|
30 |
+
- 'large-v2': Further improved version of the large model
|
31 |
+
- 'large-v3': Latest and most accurate version of the large model
|
32 |
+
- 'large-v3-turbo': Optimized version of the large-v3 model for faster processing
|
33 |
+
model : whisper.model.Whisper
|
34 |
+
The Whisper model for transcription.
|
35 |
+
pipeline : pyannote.audio.pipelines.SpeakerDiarization
|
36 |
+
The PyAnnote speaker diarization pipeline.
|
37 |
+
|
38 |
+
Usage:
|
39 |
+
>>> transcript = Transcriptor(model_size="large-v3")
|
40 |
+
>>> transcription = transcript.transcribe_audio("/path/to/audio.wav")
|
41 |
+
>>> transcription.get_name_speakers()
|
42 |
+
>>> transcription.save("/path/to/transcripts")
|
43 |
+
|
44 |
+
Note:
|
45 |
+
Larger models, especially 'large-v3', provide higher accuracy but require more
|
46 |
+
computational resources and may be slower to process audio.
|
47 |
+
"""
|
48 |
+
|
49 |
+
def __init__(self, model_size: str = "base"):
|
50 |
+
self.model_size = model_size
|
51 |
+
self.HF_TOKEN = os.getenv("HF_TOKEN")
|
52 |
+
if not self.HF_TOKEN:
|
53 |
+
raise ValueError("HF_TOKEN not found. Please store token in .env")
|
54 |
+
self._setup()
|
55 |
+
|
56 |
+
def _setup(self):
|
57 |
+
"""Initialize the Whisper model and diarization pipeline."""
|
58 |
+
self.device = "cuda" if torch.cuda.is_available() else "cpu"
|
59 |
+
print("Initializing Whisper model...")
|
60 |
+
if self.model_size == "large-v3-turbo":
|
61 |
+
self.model = pipeline(
|
62 |
+
task="automatic-speech-recognition",
|
63 |
+
model="ylacombe/whisper-large-v3-turbo",
|
64 |
+
chunk_length_s=30,
|
65 |
+
device=self.device,
|
66 |
+
)
|
67 |
+
else:
|
68 |
+
self.model = whisper.load_model(self.model_size, device=self.device)
|
69 |
+
print("Building diarization pipeline...")
|
70 |
+
self.pipeline = Pipeline.from_pretrained(
|
71 |
+
"pyannote/speaker-diarization-3.1",
|
72 |
+
use_auth_token=self.HF_TOKEN
|
73 |
+
).to(torch.device(self.device))
|
74 |
+
print("Setup completed successfully!")
|
75 |
+
|
76 |
+
def transcribe_audio(self, audio_file_path: str, enhanced: bool = False) -> Transcription:
|
77 |
+
"""
|
78 |
+
Transcribe an audio file.
|
79 |
+
|
80 |
+
Parameters:
|
81 |
+
-----------
|
82 |
+
audio_file_path : str
|
83 |
+
Path to the audio file to be transcribed.
|
84 |
+
enhanced : bool, optional
|
85 |
+
If True, applies audio enhancement techniques to improve transcription quality.
|
86 |
+
This includes noise reduction, voice enhancement, and volume boosting.
|
87 |
+
|
88 |
+
Returns:
|
89 |
+
--------
|
90 |
+
Transcription
|
91 |
+
A Transcription object containing the transcribed text and speaker segments.
|
92 |
+
"""
|
93 |
+
try:
|
94 |
+
print("Processing audio file...")
|
95 |
+
processed_audio = self.process_audio(audio_file_path, enhanced)
|
96 |
+
audio_file_path = processed_audio.path
|
97 |
+
audio, sr, duration = processed_audio.load_as_array(), processed_audio.sample_rate, processed_audio.duration
|
98 |
+
|
99 |
+
print("Diarization in progress...")
|
100 |
+
start_time = time()
|
101 |
+
diarization = self.perform_diarization(audio_file_path)
|
102 |
+
print(f"Diarization completed in {time() - start_time:.2f} seconds.")
|
103 |
+
segments = list(diarization.itertracks(yield_label=True))
|
104 |
+
|
105 |
+
transcriptions = self.transcribe_segments(audio, sr, duration, segments)
|
106 |
+
return Transcription(audio_file_path, transcriptions, segments)
|
107 |
+
except Exception as e:
|
108 |
+
raise RuntimeError(f"Failed to process the audio file: {e}")
|
109 |
+
|
110 |
+
def process_audio(self, audio_file_path: str, enhanced: bool = False) -> AudioProcessor:
|
111 |
+
"""
|
112 |
+
Process the audio file to ensure it meets the requirements for transcription.
|
113 |
+
|
114 |
+
Parameters:
|
115 |
+
-----------
|
116 |
+
audio_file_path : str
|
117 |
+
Path to the audio file to be processed.
|
118 |
+
enhanced : bool, optional
|
119 |
+
If True, applies audio enhancement techniques to improve audio quality.
|
120 |
+
This includes optimizing noise reduction, voice enhancement, and volume boosting
|
121 |
+
parameters based on the audio characteristics.
|
122 |
+
|
123 |
+
Returns:
|
124 |
+
--------
|
125 |
+
AudioProcessor
|
126 |
+
An AudioProcessor object containing the processed audio file.
|
127 |
+
"""
|
128 |
+
processed_audio = AudioProcessor(audio_file_path)
|
129 |
+
if processed_audio.format != ".wav":
|
130 |
+
processed_audio.convert_to_wav()
|
131 |
+
|
132 |
+
if processed_audio.sample_rate != 16000:
|
133 |
+
processed_audio.resample_wav()
|
134 |
+
|
135 |
+
if enhanced:
|
136 |
+
parameters = processed_audio.optimize_enhancement_parameters()
|
137 |
+
processed_audio.enhance_audio(noise_reduce_strength=parameters[0],
|
138 |
+
voice_enhance_strength=parameters[1],
|
139 |
+
volume_boost=parameters[2])
|
140 |
+
|
141 |
+
processed_audio.display_changes()
|
142 |
+
return processed_audio
|
143 |
+
|
144 |
+
def perform_diarization(self, audio_file_path: str):
|
145 |
+
"""Perform speaker diarization on the audio file."""
|
146 |
+
with torch.no_grad():
|
147 |
+
return self.pipeline(audio_file_path)
|
148 |
+
|
149 |
+
def transcribe_segments(self, audio, sr, duration, segments):
|
150 |
+
"""Transcribe audio segments based on diarization."""
|
151 |
+
transcriptions = []
|
152 |
+
|
153 |
+
for turn, _, speaker in tqdm(segments, desc="Transcribing segments", unit="segment", ncols=100, colour="green"):
|
154 |
+
start = turn.start
|
155 |
+
end = min(turn.end, duration)
|
156 |
+
segment = audio[int(start * sr):int(end * sr)]
|
157 |
+
if self.model_size == "large-v3-turbo":
|
158 |
+
result = self.model(segment)
|
159 |
+
else:
|
160 |
+
result = self.model.transcribe(segment, fp16=self.device == "cuda")
|
161 |
+
transcriptions.append((speaker, result['text'].strip()))
|
162 |
+
|
163 |
+
return transcriptions
|
requirements.txt
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
openai-whisper
|
2 |
+
pyannote.audio
|
3 |
+
librosa
|
4 |
+
tqdm
|
5 |
+
python-dotenv
|
6 |
+
termcolor
|
7 |
+
pydub
|
8 |
+
SpeechRecognition
|
9 |
+
pyaudio
|
10 |
+
tabulate
|
11 |
+
soundfile
|
12 |
+
torch
|
13 |
+
numpy
|
14 |
+
transformers
|
15 |
+
gradio
|