Tingusto commited on
Commit
02d76b7
·
verified ·
1 Parent(s): 82403ce

Initializing audio transcriptor

Browse files
README.md CHANGED
@@ -1,13 +1,190 @@
1
- ---
2
- title: Audio Transcriptor
3
- emoji: 🐢
4
- colorFrom: gray
5
- colorTo: red
6
- sdk: gradio
7
- sdk_version: 5.3.0
8
- app_file: app.py
9
- pinned: false
10
- short_description: This project involves a Python-based audio transcription
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Audio Transcription and Diarization Tool
2
+
3
+ ## Overview
4
+
5
+ This project provides a robust set of tools for transcribing audio files using the Whisper model and performing speaker diarization with PyAnnote. Users can process audio files, record audio, and save transcriptions with speaker identification.
6
+
7
+ ## Table of Contents
8
+ - [Features](#features)
9
+ - [Requirements](#requirements)
10
+ - [Setup](#setup)
11
+ - [Usage](#usage)
12
+ - [Basic Example](#basic-example)
13
+ - [Audio Processing Example](#audio-processing-example)
14
+ - [Transcribing an Existing Audio File or Recording](#transcribing-an-existing-audio-file-or-recording)
15
+ - [Key Components](#key-components)
16
+ - [Transcriptor](#transcriptor)
17
+ - [AudioProcessor](#audioprocessor)
18
+ - [AudioRecording](#audiorecording)
19
+ - [Contributing](#contributing)
20
+ - [Acknowledgments](#acknowledgments)
21
+
22
+ ## Features
23
+
24
+ - **Transcription**: Convert audio files in various formats to text (automatically converts to WAV).
25
+ - **Speaker Diarization**: Identify different speakers in the audio.
26
+ - **Speaker Retrieval**: Name speakers during transcription.
27
+ - **Audio Recording**: Record audio directly from a microphone.
28
+ - **Audio Preprocessing**: Includes resampling, format conversion, and audio enhancement.
29
+ - **Multiple Model Support**: Choose from various Whisper model sizes.
30
+
31
+ ## Supported Whisper Models
32
+
33
+ This tool supports various Whisper model sizes, allowing you to balance accuracy and computational resources:
34
+
35
+ - **`tiny`**: Fastest, lowest accuracy
36
+ - **`base`**: Fast, good accuracy
37
+ - **`small`**: Balanced speed and accuracy
38
+ - **`medium`**: High accuracy, slower
39
+ - **`large`**: High accuracy, resource-intensive
40
+ - **`large-v1`**: Improved large model
41
+ - **`large-v2`**: Further improved large model
42
+ - **`large-v3`**: Latest and most accurate
43
+ - **`large-v3-turbo`**: Optimized for faster processing
44
+
45
+ Specify the model size when initializing the Transcriptor:
46
+
47
+ ```python
48
+ transcriptor = Transcriptor(model_size="base")
49
+ ```
50
+
51
+ The default model size is "base" if not specified.
52
+
53
+ ## Requirements
54
+
55
+ To run this project, you need Python 3.7+ and the following packages:
56
+
57
+ ```plaintext
58
+ - openai-whisper
59
+ - pyannote.audio
60
+ - librosa
61
+ - tqdm
62
+ - python-dotenv
63
+ - termcolor
64
+ - pydub
65
+ - SpeechRecognition
66
+ - pyaudio
67
+ - tabulate
68
+ - soundfile
69
+ - torch
70
+ - numpy
71
+ - transformers
72
+ - gradio
73
+ ```
74
+
75
+ Install the required packages using:
76
+
77
+ ```bash
78
+ pip install -r requirements.txt
79
+ ```
80
+
81
+ ## Setup
82
+
83
+ 1. **Clone the repository**:
84
+ ```bash
85
+ git clone https://github.com/your-username/audio-transcription-tool.git
86
+ cd audio-transcription-tool
87
+ ```
88
+
89
+ 2. **Install the required packages**:
90
+ ```bash
91
+ pip install -r requirements.txt
92
+ ```
93
+
94
+ 3. **Set up your environment variables**:
95
+ - Create a `.env` file in the root directory.
96
+ - Add your Hugging Face token:
97
+ ```plaintext
98
+ HF_TOKEN=your_hugging_face_token_here
99
+ ```
100
+
101
+ ## Usage
102
+
103
+ ### Basic Example
104
+
105
+ Here's how to use the Transcriptor class to transcribe an audio file:
106
+
107
+ ```python
108
+ from pyscript import Transcriptor
109
+
110
+ # Initialize the Transcriptor
111
+ transcriptor = Transcriptor()
112
+
113
+ # Transcribe an audio file
114
+ transcription = transcriptor.transcribe_audio("/path/to/audio")
115
+
116
+ # Interactively name speakers
117
+ transcription.get_name_speakers()
118
+
119
+ # Save the transcription
120
+ transcription.save()
121
+ ```
122
+
123
+ ### Audio Processing Example
124
+
125
+ Use the AudioProcessor class to preprocess your audio files:
126
+
127
+ ```python
128
+ from pyscript import AudioProcessor
129
+
130
+ # Load an audio file
131
+ audio = AudioProcessor("/path/to/audio.mp3")
132
+
133
+ # Display audio details
134
+ audio.display_details()
135
+
136
+ # Convert to WAV format and resample to 16000 Hz
137
+ audio.convert_to_wav()
138
+
139
+ # Display updated audio details
140
+ audio.display_changes()
141
+ ```
142
+
143
+ ### Transcribing an Existing Audio File or Recording
144
+
145
+ To transcribe an audio file or record and transcribe audio, use the demo application provided in `demo.py`:
146
+
147
+ ```bash
148
+ python demo.py
149
+ ```
150
+
151
+ ## Key Components
152
+
153
+ ### Transcriptor
154
+
155
+ The `Transcriptor` class (in `pyscript/transcriptor.py`) is the core of the transcription process. It handles:
156
+
157
+ - Loading the Whisper model
158
+ - Setting up the diarization pipeline
159
+ - Processing audio files
160
+ - Performing transcription and diarization
161
+
162
+ ### AudioProcessor
163
+
164
+ The `AudioProcessor` class (in `pyscript/audio_processing.py`) manages audio file preprocessing, including:
165
+
166
+ - Loading audio files
167
+ - Resampling
168
+ - Converting to WAV format
169
+ - Displaying audio file details and changes
170
+ - Audio enhancement (noise reduction, voice enhancement, volume boost)
171
+
172
+ ### AudioRecording
173
+
174
+ The `audio_recording.py` module provides functions for recording audio from a microphone, checking input devices, and saving audio files.
175
+
176
+ ## Contributing
177
+
178
+ Contributions are welcome! Please follow these steps:
179
+
180
+ 1. Fork the repository
181
+ 2. Create a new branch: `git checkout -b feature-branch-name`
182
+ 3. Make your changes and commit them: `git commit -m 'Add some feature'`
183
+ 4. Push to the branch: `git push origin feature-branch-name`
184
+ 5. Submit a pull request
185
+
186
+ ## Acknowledgments
187
+
188
+ - OpenAI for the Whisper model
189
+ - PyAnnote for the speaker diarization pipeline
190
+ - All contributors and users of this project
demo.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from pyscript import Transcriptor
3
+
4
+ demo = gr.Blocks()
5
+ transcriptor = Transcriptor(model_size="large-v3-turbo")
6
+
7
+ microphone_transcribe = gr.Interface(
8
+ fn=transcriptor.transcribe_audio,
9
+ inputs=[
10
+ gr.Audio(sources="microphone", type="filepath", label="Microphone"),
11
+ gr.Radio([True, False], value=True, label="Enable audio enhancement"),
12
+ ],
13
+ outputs=[
14
+ gr.Textbox(label="Transcription"),
15
+ # gr.File(label="Download Transcription"),
16
+ # gr.Textbox(label="Console Output", lines=10)
17
+ ],
18
+ title="Audio-Transcription leveraging Whisper Model",
19
+ description=(
20
+ "Transcribe microphone recording or audio inputs and return the transcription with speaker diarization."
21
+ ),
22
+ allow_flagging="never",
23
+ )
24
+
25
+ file_transcribe = gr.Interface(
26
+ fn=transcriptor.transcribe_audio,
27
+ inputs=[
28
+ gr.Audio(sources="upload", type="filepath", label="Audio file"),
29
+ gr.Radio([True, False], value=True, label="Enable audio enhancement"),
30
+ ],
31
+ outputs=[
32
+ gr.Textbox(label="Transcription"),
33
+ # gr.File(label="Download Transcription"),
34
+ # gr.Textbox(label="Console Output", lines=10)
35
+ ],
36
+ title="Audio-Transcription leveraging Whisper Model",
37
+ description=(
38
+ "Transcribe microphone recording or audio inputs and return the transcription with speaker diarization."
39
+ ),
40
+ allow_flagging="never",
41
+ )
42
+
43
+
44
+ with demo:
45
+ gr.TabbedInterface([microphone_transcribe, file_transcribe], ["Microphone", "Audio file"])
46
+
47
+ demo.queue().launch()
pyscript/__init__.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ from .transcriptor import Transcriptor
2
+ from .audio_processing import AudioProcessor
3
+ __all__ = ["Transcriptor", "AudioProcessor"]
pyscript/audio_processing.py ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import librosa
3
+ import numpy as np
4
+ from tabulate import tabulate
5
+ import soundfile as sf
6
+ import scipy.ndimage
7
+ import itertools
8
+ from scipy.stats import pearsonr
9
+ from tqdm import tqdm
10
+
11
+ class AudioProcessor:
12
+
13
+ def __init__(self, audio_file):
14
+ self.path = audio_file
15
+ self.name = os.path.splitext(os.path.basename(audio_file))[0]
16
+ self.format = os.path.splitext(os.path.basename(audio_file))[1]
17
+ self.duration = librosa.get_duration(path=audio_file)
18
+ self.sample_rate = librosa.get_samplerate(audio_file)
19
+ self.changes = []
20
+ self.optimized_params = None
21
+ self.load_details()
22
+
23
+ # File information methods
24
+ def load_details(self):
25
+ """Save the attributes of the audio file."""
26
+ data = [
27
+ ["File Name", self.name],
28
+ ["File Format", self.format],
29
+ ["Duration", f"{self.duration} seconds"],
30
+ ["Sample Rate", f"{self.sample_rate} Hz"]
31
+ ]
32
+ table = tabulate(data, headers=["Attribute", "Value"], tablefmt="outline")
33
+ self.changes.append(table)
34
+ return table
35
+
36
+ def display_details(self):
37
+ """Display the details of the audio file."""
38
+ print(self.changes[-1])
39
+
40
+ def display_changes(self):
41
+ """Display the changes made to the audio file side by side."""
42
+ self._clean_duplicates_changes()
43
+ if len(self.changes) == 1:
44
+ self.display_details()
45
+ else:
46
+ table1 = self.changes[0].split('\n')
47
+ table2 = self.changes[-1].split('\n')
48
+
49
+ combined_table = []
50
+ for line1, line2 in zip(table1, table2):
51
+ combined_table.append([line1, '===>', line2])
52
+
53
+ print(tabulate(combined_table, tablefmt="plain"))
54
+
55
+ def _clean_duplicates_changes(self):
56
+ """Remove duplicate consecutive changes from the audio file."""
57
+ self.changes = [change for i, change in enumerate(self.changes)
58
+ if i == 0 or change != self.changes[i-1]]
59
+
60
+ # Audio processing methods
61
+ def load_as_array(self, sample_rate: int = 16000) -> np.ndarray:
62
+ """
63
+ Load an audio file and convert it into a NumPy array.
64
+
65
+ Parameters
66
+ ----------
67
+ sample_rate : int, optional
68
+ The sample rate to which the audio will be resampled (default is 16000 Hz).
69
+
70
+ Returns
71
+ -------
72
+ np.ndarray
73
+ A NumPy array containing the audio data.
74
+ """
75
+ try:
76
+ audio, sr = librosa.load(self.path, sr=sample_rate)
77
+ self.sample_rate = sr
78
+ return audio
79
+ except Exception as e:
80
+ raise RuntimeError(f"Failed to load audio file: {e}")
81
+
82
+ def resample_wav(self) -> str:
83
+ output_path = os.path.join('resampled_files', f'{self.name}.wav')
84
+ try:
85
+ audio, sr = librosa.load(self.path)
86
+ resampled_audio = librosa.resample(y=audio, orig_sr=sr, target_sr=16000)
87
+ os.makedirs(os.path.dirname(output_path), exist_ok=True)
88
+ sf.write(output_path, resampled_audio, 16000)
89
+ self._update_file_info(output_path)
90
+ return output_path
91
+ except Exception as e:
92
+ raise RuntimeError(f"Failed to resample audio file: {e}")
93
+
94
+ def convert_to_wav(self):
95
+ """
96
+ Converts an audio file to WAV format.
97
+
98
+ Returns
99
+ -------
100
+ str
101
+ The path to the converted audio file.
102
+ """
103
+ output_path = os.path.join('converted_files', f'{self.name}.wav')
104
+ try:
105
+ os.makedirs(os.path.dirname(output_path), exist_ok=True)
106
+ audio, sr = librosa.load(self.path, sr=16000)
107
+ sf.write(output_path, audio, 16000)
108
+ self._update_file_info(output_path)
109
+ return output_path
110
+ except Exception as e:
111
+ raise RuntimeError(f"Failed to convert audio file to WAV: {e}")
112
+
113
+ def enhance_audio(self, noise_reduce_strength=0.5, voice_enhance_strength=1.5, volume_boost=1.2):
114
+ """
115
+ Enhance audio quality by reducing noise and clarifying voices.
116
+ """
117
+ try:
118
+ y, sr = librosa.load(self.path, sr=16000)
119
+ y_enhanced = self._enhance_audio_sample(y, noise_reduce_strength, voice_enhance_strength, volume_boost)
120
+
121
+ output_path = os.path.join('enhanced_files', f'{self.name}_enhanced.wav')
122
+ os.makedirs(os.path.dirname(output_path), exist_ok=True)
123
+ sf.write(output_path, y_enhanced, sr)
124
+
125
+ self._update_file_info(output_path)
126
+ return output_path
127
+ except Exception as e:
128
+ raise RuntimeError(f"Failed to enhance audio: {e}")
129
+
130
+ def optimize_enhancement_parameters(self, step=0.25, max_iterations=50, sample_duration=30):
131
+ """
132
+ Find optimal parameters for audio enhancement using grid search on a sample.
133
+ """
134
+ y_orig, sr = librosa.load(self.path, duration=sample_duration)
135
+
136
+ param_ranges = [
137
+ np.arange(0.25, 1.5, step), # noise_reduce_strength
138
+ np.arange(1.0, 3.0, step), # voice_enhance_strength
139
+ np.arange(1.0, 2.0, step) # volume_boost
140
+ ]
141
+
142
+ best_score = float('-inf')
143
+ best_params = None
144
+
145
+ total_iterations = min(max_iterations, len(list(itertools.product(*param_ranges))))
146
+
147
+ for params in tqdm(itertools.islice(itertools.product(*param_ranges), max_iterations),
148
+ total=total_iterations,
149
+ desc="Searching for optimal parameters"):
150
+ y_enhanced = self._enhance_audio_sample(y_orig, *params)
151
+
152
+ min_length = min(len(y_orig), len(y_enhanced))
153
+ y_orig_trimmed = y_orig[:min_length]
154
+ y_enhanced_trimmed = y_enhanced[:min_length]
155
+
156
+ correlation, _ = pearsonr(y_orig_trimmed, y_enhanced_trimmed)
157
+
158
+ S_orig = np.abs(librosa.stft(y_orig_trimmed))
159
+ S_enhanced = np.abs(librosa.stft(y_enhanced_trimmed))
160
+ contrast_improvement = np.mean(librosa.feature.spectral_contrast(S=S_enhanced)) - np.mean(librosa.feature.spectral_contrast(S=S_orig))
161
+
162
+ score = correlation + 0.5 * contrast_improvement
163
+
164
+ if score > best_score:
165
+ best_score = score
166
+ best_params = params
167
+
168
+ self.optimized_params = best_params
169
+ return best_params
170
+
171
+ def _enhance_audio_sample(self, y, noise_reduce_strength=0.5, voice_enhance_strength=1.5, volume_boost=1.2):
172
+ S = librosa.stft(y)
173
+ S_mag, S_phase = np.abs(S), np.angle(S)
174
+ S_filtered = scipy.ndimage.median_filter(S_mag, size=(1, 31))
175
+
176
+ mask = np.clip((S_mag - S_filtered) / (S_mag + 1e-10), 0, 1) ** noise_reduce_strength
177
+ S_denoised = S_mag * mask * np.exp(1j * S_phase)
178
+
179
+ y_denoised = librosa.istft(S_denoised)
180
+
181
+ y_harmonic, y_percussive = librosa.effects.hpss(y_denoised)
182
+ y_enhanced = (y_harmonic * voice_enhance_strength + y_percussive) * volume_boost
183
+
184
+ return librosa.util.normalize(y_enhanced, norm=np.inf, threshold=1.0)
185
+
186
+ # Helper method
187
+ def _update_file_info(self, new_path):
188
+ """Update file information after processing."""
189
+ self.path = new_path
190
+ self.sample_rate = librosa.get_samplerate(new_path)
191
+ self.format = os.path.splitext(new_path)[1]
192
+ self.duration = librosa.get_duration(path=new_path)
193
+ self.load_details()
pyscript/audio_recording.py ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import speech_recognition as sr
2
+ import os
3
+ import datetime
4
+ from termcolor import colored
5
+ from tabulate import tabulate
6
+
7
+ def micro_recording(save_folder_path: str = "audio_files", file_name: str = None, device_index: int = 0) -> str:
8
+ """Records audio from a microphone and saves it to a designated file."""
9
+ r = sr.Recognizer()
10
+ mic = sr.Microphone(device_index=device_index)
11
+
12
+ print_colored_separator("Starting microphone recording...", "green")
13
+
14
+ with mic as source:
15
+ print_colored("Recording...", "yellow")
16
+ audio = r.listen(source)
17
+ print_colored("Recording finished.", "green")
18
+
19
+ saved_path = save_audio_file(audio, save_folder_path, file_name)
20
+
21
+ print_colored_separator(f"Audio file saved to: {saved_path}", "green")
22
+ return saved_path
23
+
24
+ def check_input_device(test_duration: int = 1) -> dict:
25
+ """Checks the available microphone devices."""
26
+ devices = sr.Microphone.list_microphone_names()
27
+ available_devices, non_working_devices = [], []
28
+
29
+ for i, device in enumerate(devices):
30
+ try:
31
+ with sr.Microphone(device_index=i) as source:
32
+ sr.Recognizer().listen(source, timeout=test_duration)
33
+ available_devices.append(device)
34
+ except sr.WaitTimeoutError:
35
+ non_working_devices.append(device)
36
+ except Exception as e:
37
+ print(f"An error occurred while testing device {device}: {e}")
38
+
39
+ print_device_table("Available Devices", available_devices)
40
+ print_device_table("Non-Working Devices", non_working_devices)
41
+
42
+ return {'available_devices': available_devices, 'non_working_devices': non_working_devices}
43
+
44
+ def save_audio_file(audio, save_folder_path: str, file_name: str = None) -> str:
45
+ """Saves the audio file to the specified path."""
46
+ os.makedirs(save_folder_path, exist_ok=True)
47
+
48
+ if not file_name:
49
+ timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
50
+ file_name = f"recording_{timestamp}.wav"
51
+ else:
52
+ file_name = f"{file_name}.wav"
53
+
54
+ saved_path = os.path.join(save_folder_path, file_name)
55
+
56
+ with open(saved_path, "wb") as f:
57
+ f.write(audio.get_wav_data())
58
+
59
+ print_colored("Saving audio file...", "yellow")
60
+ return saved_path
61
+
62
+ def print_colored(message: str, color: str):
63
+ """Prints a colored message."""
64
+ print(colored(message, color))
65
+
66
+ def print_colored_separator(message: str, color: str):
67
+ """Prints a colored message with separators."""
68
+ print("--------------------------------")
69
+ print_colored(message, color)
70
+ print("--------------------------------")
71
+
72
+ def print_device_table(title: str, devices: list):
73
+ """Prints a table of devices."""
74
+ device_table = [[i+1, device] for i, device in enumerate(devices)]
75
+ print(f"\n{title}:")
76
+ print(tabulate(device_table, headers=["Index", "Device Name"]))
77
+
pyscript/transcription.py ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from itertools import cycle
3
+ from termcolor import colored
4
+
5
+ class Transcription:
6
+ """
7
+ A class for storing and saving transcriptions.
8
+
9
+ Attributes:
10
+ -----------
11
+ audio_file_path : str
12
+ The path to the audio file that was transcribed.
13
+ filename : str
14
+ The name of the audio file, without the extension.
15
+ transcriptions : list[str]
16
+ A list of tuples containing the speaker's label and their corresponding transcription, grouped by speaker.
17
+ speaker_names : dict
18
+ A dictionary mapping speaker labels to their assigned names.
19
+ segments : list
20
+ A list of segments from diarization.
21
+
22
+ """
23
+
24
+ def __init__(self, audio_file_path: str, transcriptions: list[str], segments: list[str]):
25
+ self.audio_file_path = audio_file_path
26
+ self.filename = os.path.splitext(os.path.basename(audio_file_path))[0]
27
+ self.transcriptions = self.group_by_speaker(transcriptions)
28
+ self.speaker_names = {}
29
+ self.segments = segments
30
+ self.colors = cycle(['red', 'green', 'blue', 'magenta', 'cyan', 'yellow'])
31
+
32
+ def __repr__(self) -> str:
33
+ result = []
34
+ for speaker, text in self.transcriptions:
35
+ speaker_name = self.speaker_names.get(speaker, speaker)
36
+ result.append(f"{speaker_name}:\n{text}")
37
+ return "\n\n".join(result)
38
+
39
+ def group_by_speaker(self, transcriptions: list[str]) -> list[str]:
40
+ """
41
+ Groups transcriptions by speaker.
42
+
43
+ Parameters
44
+ ----------
45
+ transcriptions : list[str]
46
+ A list of tuples containing the speaker's label and their corresponding transcription.
47
+
48
+ Returns
49
+ -------
50
+ list[str]
51
+ A list of tuples containing the speaker's label and their corresponding transcription, grouped by speaker.
52
+ """
53
+ speaker_transcriptions = []
54
+ previous_speaker = transcriptions[0][0]
55
+ speaker_text = ""
56
+ for speaker, text in transcriptions:
57
+ if speaker == previous_speaker:
58
+ speaker_text += text
59
+ else:
60
+ speaker_transcriptions.append((previous_speaker, speaker_text))
61
+ speaker_text = text
62
+ previous_speaker = speaker
63
+ speaker_transcriptions.append((previous_speaker, speaker_text))
64
+ return speaker_transcriptions
65
+
66
+ def save(self, directory: str = "transcripts") -> None:
67
+ """
68
+ Saves the transcription to a text file.
69
+
70
+ Parameters
71
+ ----------
72
+ directory : str, optional
73
+ The directory to save the transcription to. Defaults to "transcripts".
74
+ """
75
+ if not self.transcriptions:
76
+ raise ValueError("No transcriptions available to save.")
77
+
78
+ os.makedirs(directory, exist_ok=True)
79
+ saving_path = os.path.join(directory, f"{self.filename}_transcript.txt")
80
+
81
+ with open(saving_path, 'w', encoding='utf-8') as f:
82
+ for speaker, text in self.transcriptions:
83
+ if text:
84
+ speaker_name = self.speaker_names.get(speaker, speaker)
85
+ f.write(f"{speaker_name}: {text}\n")
86
+
87
+ print(f"Transcription saved to {saving_path}")
88
+
89
+ def get_name_speakers(self) -> None:
90
+ """
91
+ Interactively assigns names to speakers in the transcriptions and retrieves the name of the speaker.
92
+ Provides a preview of one sentence for each speaker to help recognize who is speaking.
93
+ """
94
+ for speaker, full_text in self.transcriptions:
95
+ if speaker in self.speaker_names:
96
+ continue
97
+
98
+ preview = full_text.split('.')[0] + '.'
99
+ print(f"\nCurrent speaker: {speaker}")
100
+ print(f"Preview: {preview}")
101
+
102
+ new_name = input(f"Enter a name for {speaker} (or press Enter to skip): ").strip()
103
+ if new_name:
104
+ self.speaker_names[speaker] = new_name
105
+ print(f"Speaker {speaker} renamed to {new_name}")
106
+ else:
107
+ print(f"Skipped renaming {speaker}")
108
+
109
+ print("\nSpeaker naming completed.")
110
+ print(f"Updated speaker names: {self.speaker_names}")
pyscript/transcriptor.py ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from dotenv import load_dotenv
3
+ import whisper
4
+ from pyannote.audio import Pipeline
5
+ import torch
6
+ from tqdm import tqdm
7
+ from time import time
8
+ from transformers import pipeline
9
+ from .transcription import Transcription
10
+ from .audio_processing import AudioProcessor
11
+
12
+ load_dotenv()
13
+
14
+ class Transcriptor:
15
+ """
16
+ A class for transcribing and diarizing audio files.
17
+
18
+ This class uses the Whisper model for transcription and the PyAnnote speaker diarization pipeline for speaker identification.
19
+
20
+ Attributes
21
+ ----------
22
+ model_size : str
23
+ The size of the Whisper model to use for transcription. Available options are:
24
+ - 'tiny': Fastest, lowest accuracy
25
+ - 'base': Fast, good accuracy for many use cases
26
+ - 'small': Balanced speed and accuracy
27
+ - 'medium': High accuracy, slower than smaller models
28
+ - 'large': High accuracy, slower and more resource-intensive
29
+ - 'large-v1': Improved version of the large model
30
+ - 'large-v2': Further improved version of the large model
31
+ - 'large-v3': Latest and most accurate version of the large model
32
+ - 'large-v3-turbo': Optimized version of the large-v3 model for faster processing
33
+ model : whisper.model.Whisper
34
+ The Whisper model for transcription.
35
+ pipeline : pyannote.audio.pipelines.SpeakerDiarization
36
+ The PyAnnote speaker diarization pipeline.
37
+
38
+ Usage:
39
+ >>> transcript = Transcriptor(model_size="large-v3")
40
+ >>> transcription = transcript.transcribe_audio("/path/to/audio.wav")
41
+ >>> transcription.get_name_speakers()
42
+ >>> transcription.save("/path/to/transcripts")
43
+
44
+ Note:
45
+ Larger models, especially 'large-v3', provide higher accuracy but require more
46
+ computational resources and may be slower to process audio.
47
+ """
48
+
49
+ def __init__(self, model_size: str = "base"):
50
+ self.model_size = model_size
51
+ self.HF_TOKEN = os.getenv("HF_TOKEN")
52
+ if not self.HF_TOKEN:
53
+ raise ValueError("HF_TOKEN not found. Please store token in .env")
54
+ self._setup()
55
+
56
+ def _setup(self):
57
+ """Initialize the Whisper model and diarization pipeline."""
58
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
59
+ print("Initializing Whisper model...")
60
+ if self.model_size == "large-v3-turbo":
61
+ self.model = pipeline(
62
+ task="automatic-speech-recognition",
63
+ model="ylacombe/whisper-large-v3-turbo",
64
+ chunk_length_s=30,
65
+ device=self.device,
66
+ )
67
+ else:
68
+ self.model = whisper.load_model(self.model_size, device=self.device)
69
+ print("Building diarization pipeline...")
70
+ self.pipeline = Pipeline.from_pretrained(
71
+ "pyannote/speaker-diarization-3.1",
72
+ use_auth_token=self.HF_TOKEN
73
+ ).to(torch.device(self.device))
74
+ print("Setup completed successfully!")
75
+
76
+ def transcribe_audio(self, audio_file_path: str, enhanced: bool = False) -> Transcription:
77
+ """
78
+ Transcribe an audio file.
79
+
80
+ Parameters:
81
+ -----------
82
+ audio_file_path : str
83
+ Path to the audio file to be transcribed.
84
+ enhanced : bool, optional
85
+ If True, applies audio enhancement techniques to improve transcription quality.
86
+ This includes noise reduction, voice enhancement, and volume boosting.
87
+
88
+ Returns:
89
+ --------
90
+ Transcription
91
+ A Transcription object containing the transcribed text and speaker segments.
92
+ """
93
+ try:
94
+ print("Processing audio file...")
95
+ processed_audio = self.process_audio(audio_file_path, enhanced)
96
+ audio_file_path = processed_audio.path
97
+ audio, sr, duration = processed_audio.load_as_array(), processed_audio.sample_rate, processed_audio.duration
98
+
99
+ print("Diarization in progress...")
100
+ start_time = time()
101
+ diarization = self.perform_diarization(audio_file_path)
102
+ print(f"Diarization completed in {time() - start_time:.2f} seconds.")
103
+ segments = list(diarization.itertracks(yield_label=True))
104
+
105
+ transcriptions = self.transcribe_segments(audio, sr, duration, segments)
106
+ return Transcription(audio_file_path, transcriptions, segments)
107
+ except Exception as e:
108
+ raise RuntimeError(f"Failed to process the audio file: {e}")
109
+
110
+ def process_audio(self, audio_file_path: str, enhanced: bool = False) -> AudioProcessor:
111
+ """
112
+ Process the audio file to ensure it meets the requirements for transcription.
113
+
114
+ Parameters:
115
+ -----------
116
+ audio_file_path : str
117
+ Path to the audio file to be processed.
118
+ enhanced : bool, optional
119
+ If True, applies audio enhancement techniques to improve audio quality.
120
+ This includes optimizing noise reduction, voice enhancement, and volume boosting
121
+ parameters based on the audio characteristics.
122
+
123
+ Returns:
124
+ --------
125
+ AudioProcessor
126
+ An AudioProcessor object containing the processed audio file.
127
+ """
128
+ processed_audio = AudioProcessor(audio_file_path)
129
+ if processed_audio.format != ".wav":
130
+ processed_audio.convert_to_wav()
131
+
132
+ if processed_audio.sample_rate != 16000:
133
+ processed_audio.resample_wav()
134
+
135
+ if enhanced:
136
+ parameters = processed_audio.optimize_enhancement_parameters()
137
+ processed_audio.enhance_audio(noise_reduce_strength=parameters[0],
138
+ voice_enhance_strength=parameters[1],
139
+ volume_boost=parameters[2])
140
+
141
+ processed_audio.display_changes()
142
+ return processed_audio
143
+
144
+ def perform_diarization(self, audio_file_path: str):
145
+ """Perform speaker diarization on the audio file."""
146
+ with torch.no_grad():
147
+ return self.pipeline(audio_file_path)
148
+
149
+ def transcribe_segments(self, audio, sr, duration, segments):
150
+ """Transcribe audio segments based on diarization."""
151
+ transcriptions = []
152
+
153
+ for turn, _, speaker in tqdm(segments, desc="Transcribing segments", unit="segment", ncols=100, colour="green"):
154
+ start = turn.start
155
+ end = min(turn.end, duration)
156
+ segment = audio[int(start * sr):int(end * sr)]
157
+ if self.model_size == "large-v3-turbo":
158
+ result = self.model(segment)
159
+ else:
160
+ result = self.model.transcribe(segment, fp16=self.device == "cuda")
161
+ transcriptions.append((speaker, result['text'].strip()))
162
+
163
+ return transcriptions
requirements.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ openai-whisper
2
+ pyannote.audio
3
+ librosa
4
+ tqdm
5
+ python-dotenv
6
+ termcolor
7
+ pydub
8
+ SpeechRecognition
9
+ pyaudio
10
+ tabulate
11
+ soundfile
12
+ torch
13
+ numpy
14
+ transformers
15
+ gradio