Luigi commited on
Commit
2d9c975
·
1 Parent(s): 4ae7310

switch to gradio

Browse files
Files changed (3) hide show
  1. README.md +42 -18
  2. app.py +111 -120
  3. requirements.txt +1 -3
README.md CHANGED
@@ -3,36 +3,36 @@ title: Dinercall Intent Demo
3
  emoji: 🏆
4
  colorFrom: red
5
  colorTo: gray
6
- sdk: streamlit
7
- sdk_version: 1.44.1
8
  app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
  short_description: restaurant reservation intent detector
12
  ---
13
 
14
-
15
  # 🍽️ 餐廳訂位意圖識別系統 (Mandarin Reservation Intent Classifier)
16
 
17
- 🎙️ 本系統讓使用者可以透過**語音錄音**或**文字輸入**,自動判斷是否具有「訂位意圖」,是語音助理或自動客服前端的理想元件之一。
18
 
19
  ---
20
 
21
  ## 🔍 功能介紹
22
 
23
  - 🧠 **語音辨識**:使用 fine-tuned Whisper 模型 [`Jingmiao/whisper-small-zh_tw`](https://huggingface.co/Jingmiao/whisper-small-zh_tw) 將語音轉為繁體中文文字。
24
- - 🤖 **意圖分類**:使用微調的 ALBERT 中文模型判斷輸入是否包含訂位意圖。
25
  - 📱 **支援手機與桌機**:介面具備良好響應性,適用於各類瀏覽器與行動裝置。
26
- - 🔊 **瀏覽器錄音**:可直接錄音並即時進行語音辨識與意圖分類。
27
 
28
  ---
29
 
30
  ## 🚀 使用方式
31
 
32
- 1. 點擊「▶️ 開始錄音」按鈕開始說話。
33
- 2. 點擊「⏹️ 停止錄音」完成語音輸入。
34
- 3. 系統會自動轉文字,並進行「是否為訂位」意圖判斷。
35
- 4. 或者也可以直接手動輸入文字,再點擊送出按鈕。
 
36
 
37
  ---
38
 
@@ -44,29 +44,53 @@ short_description: restaurant reservation intent detector
44
  ### 中文意圖分類模型:
45
  - [`Luigi/albert-tiny-chinese-dinercall-intent`](https://huggingface.co/Luigi/albert-tiny-chinese-dinercall-intent)
46
  - [`Luigi/albert-base-chinese-dinercall-intent`](https://huggingface.co/Luigi/albert-base-chinese-dinercall-intent)
 
47
 
48
  ---
49
 
50
  ## 📦 依賴環境
51
 
52
  ```txt
53
- streamlit
54
- transformers>=4.30.0
 
55
  torch
56
- torchaudio
 
 
 
 
 
 
 
 
 
 
57
  ```
58
 
59
  ---
60
 
61
  ## 🛠️ 開發者備註
62
 
63
- - 本應用為 Streamlit App,支援 Hugging Face Spaces 部署。
64
- - 使用 JavaScript 客製化錄音介面,透過 Web Audio API 進行錄音與 POST 回傳。
65
  - 若需延伸本系統至其他語言或多輪對話,歡迎 fork 本專案進行改造!
66
 
67
  ---
68
 
69
- © 2024 by [Your Name or Team]. Made with ❤️ using Hugging Face + Streamlit.
70
- ```
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
- ---
 
3
  emoji: 🏆
4
  colorFrom: red
5
  colorTo: gray
6
+ sdk: gradio
7
+ sdk_version: 5+
8
  app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
  short_description: restaurant reservation intent detector
12
  ---
13
 
 
14
  # 🍽️ 餐廳訂位意圖識別系統 (Mandarin Reservation Intent Classifier)
15
 
16
+ 🎙️ 本系統讓使用者可以透過**語音錄音**或**文字輸入**,自動判斷是否具有「訂位意圖」,是語音助理或自動客服前端的理想元件之一。這個版本基於 **Gradio** 建構,具有簡單直觀的分頁式輸入模式切換(「麥克風」或「文字」)。
17
 
18
  ---
19
 
20
  ## 🔍 功能介紹
21
 
22
  - 🧠 **語音辨識**:使用 fine-tuned Whisper 模型 [`Jingmiao/whisper-small-zh_tw`](https://huggingface.co/Jingmiao/whisper-small-zh_tw) 將語音轉為繁體中文文字。
23
+ - 🤖 **意圖分類**:使用微調的 ALBERT 中文模型或 Qwen 模型判斷輸入是否包含訂位意圖。
24
  - 📱 **支援手機與桌機**:介面具備良好響應性,適用於各類瀏覽器與行動裝置。
25
+ - 🔊 **雙重輸入模式**:使用者可在「麥克風」和「文字」兩種模式間切換,以提供語音或手動輸入。
26
 
27
  ---
28
 
29
  ## 🚀 使用方式
30
 
31
+ 1. 選擇輸入模式:
32
+ - 「麥克風」:點擊錄音按鈕開始錄音,錄製完成後自動轉文字並判斷意圖。
33
+ - 「文字」:直接在文字框中輸入語句,再點擊「執行辨識」按鈕。
34
+ 2. 從下拉選單選擇使用的模型(例如 ALBERT-tiny、ALBERT-base 或 Qwen)。
35
+ 3. 按下「執行辨識」後,系統將顯示轉換後的文字、意圖判斷結果,並以 TTS(語音合成)的方式回應。
36
 
37
  ---
38
 
 
44
  ### 中文意圖分類模型:
45
  - [`Luigi/albert-tiny-chinese-dinercall-intent`](https://huggingface.co/Luigi/albert-tiny-chinese-dinercall-intent)
46
  - [`Luigi/albert-base-chinese-dinercall-intent`](https://huggingface.co/Luigi/albert-base-chinese-dinercall-intent)
47
+ - 或使用 [`Qwen/Qwen2.5-0.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)(透過 Outlines 整合)
48
 
49
  ---
50
 
51
  ## 📦 依賴環境
52
 
53
  ```txt
54
+ llama-cpp-python
55
+ gradio>=5.0.0
56
+ transformers
57
  torch
58
+ soundfile
59
+ outlines
60
+ numpy>=1.24,<2.0
61
+ kokoro
62
+ huggingface-hub
63
+ jieba
64
+ docopt
65
+ ordered-set
66
+ cn2an
67
+ pypinyin
68
+ sentencepiece
69
  ```
70
 
71
  ---
72
 
73
  ## 🛠️ 開發者備註
74
 
75
+ - 本應用現改為 Gradio App,適合在 Hugging Face Spaces 上部署,並支援 Gradio V5 的最新功能。
76
+ - 採用雙重輸入模式(麥克風與文字)讓使用者能靈活切換輸入方式。
77
  - 若需延伸本系統至其他語言或多輪對話,歡迎 fork 本專案進行改造!
78
 
79
  ---
80
 
81
+ © 2024 by [Your Name or Team]. Made with ❤️ using Hugging Face + Gradio.
82
+ ---
83
+
84
+ ### Explanation
85
+
86
+ - **README.md:**
87
+ - The SDK and app_file information has been updated to indicate a Gradio-based application.
88
+ - The features have been revised to highlight the dual-input mode (麥克風 vs. 文字).
89
+ - The installation instructions and usage steps now reflect the updated Gradio interface.
90
+
91
+ - **requirements.txt:**
92
+ - The dependencies for Streamlit and streamlit-mic-recorder have been removed.
93
+ - Gradio (version 5.0.0 or higher) has been added as the primary UI framework.
94
+ - The remaining dependencies support the models and other processing components.
95
 
96
+ Feel free to customize further as needed for your deployment or additional features!
app.py CHANGED
@@ -1,56 +1,58 @@
1
- import streamlit as st
2
- from streamlit_mic_recorder import mic_recorder
3
- from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
4
- import outlines # Use outlines with transformers integration
5
- from torch.nn.functional import softmax
6
  import torch
 
 
 
 
7
  import tempfile
 
 
8
  import re
9
  from pathlib import Path
10
- import io
11
- import base64
12
- import numpy as np
13
- import soundfile as sf
14
- from kokoro import KPipeline
15
 
16
- # ------------------ Model Identifiers ------------------
 
17
 
18
- # Whisper ASR model identifier (using Hugging Face Transformers pipeline)
19
  whisper_model_id = "Jingmiao/whisper-small-zh_tw"
20
-
21
- # Qwen LLM model identifier (using outlines transformers integration)
22
  qwen_model_id = "Qwen/Qwen2.5-0.5B-Instruct"
23
 
24
- # Available models for text classification (intent detection) via Transformers
25
  available_models = {
26
  "ALBERT-tiny (Chinese)": "Luigi/albert-tiny-chinese-dinercall-intent",
27
  "ALBERT-base (Chinese)": "Luigi/albert-base-chinese-dinercall-intent",
28
- "Qwen (via Transformers - outlines)": "qwen" # special keyword to use Qwen below
29
  }
30
 
31
- # ------------------ Load Functions ------------------
32
-
33
- @st.cache_resource
34
  def load_whisper_pipeline():
35
- return pipeline("automatic-speech-recognition", model=whisper_model_id)
36
-
37
- @st.cache_resource
38
- def load_transformers_model(model_id):
39
- # Load ALBERT-based classification model via Transformers.
 
 
 
40
  tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
41
  model = AutoModelForSequenceClassification.from_pretrained(model_id)
 
 
42
  return tokenizer, model
43
 
44
- @st.cache_resource
45
  def load_qwen_model():
46
- # Load Qwen using the outlines transformers integration.
47
- # Note that the prompt-based interaction requires proper chat tokens.
48
  return outlines.models.transformers(qwen_model_id)
49
 
50
- # ------------------ Prediction Functions ------------------
 
 
51
 
52
- def predict_with_qwen(text):
53
- # Use Qwen via outlines for intent classification with a prompt.
54
  model = load_qwen_model()
55
  prompt = f"""
56
  <|im_start|>system
@@ -76,10 +78,11 @@ Classify the following message: "{text}"
76
  else:
77
  return f"未知回應: {prediction}"
78
 
79
- def predict_intent(text, model_id):
80
- # Use ALBERT-based Transformers for intent detection.
81
  tokenizer, model = load_transformers_model(model_id)
82
  inputs = tokenizer(text, return_tensors="pt")
 
 
83
  with torch.no_grad():
84
  logits = model(**inputs).logits
85
  probs = softmax(logits, dim=-1)
@@ -89,20 +92,7 @@ def predict_intent(text, model_id):
89
  else:
90
  return f"❌ 無訂位意圖 (Not Reservation intent)(訂位信心度 Confidence: {confidence:.2%})"
91
 
92
- def load_clean_readme(path="README.md"):
93
- text = Path(path).read_text(encoding="utf-8")
94
- text = re.sub(r"(?s)^---.*?---", "", text).strip()
95
- text = re.sub(r"^# .*?\n+", "", text)
96
- return text
97
-
98
- # ------------------ TTS Integration via kokoro ------------------
99
-
100
- @st.cache_resource
101
- def get_tts_pipeline():
102
- # Instantiate and cache the KPipeline for TTS; setting language code to Chinese.
103
- return KPipeline(lang_code="z")
104
-
105
- def get_tts_message(intent_result):
106
  if intent_result and "訂位意圖" in intent_result and "無" not in intent_result:
107
  return "稍後您將會從簡訊收到訂位連結"
108
  elif intent_result:
@@ -110,7 +100,7 @@ def get_tts_message(intent_result):
110
  else:
111
  return "未能判斷意圖"
112
 
113
- def play_tts_message(message, voice='af_heart'):
114
  pipeline_tts = get_tts_pipeline()
115
  generator = pipeline_tts(message, voice=voice)
116
  audio_chunks = []
@@ -118,78 +108,79 @@ def play_tts_message(message, voice='af_heart'):
118
  audio_chunks.append(audio)
119
  if audio_chunks:
120
  audio_concat = np.concatenate(audio_chunks)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
  else:
122
- audio_concat = np.array([])
123
- wav_buffer = io.BytesIO()
124
- sf.write(wav_buffer, audio_concat, 24000, format="WAV")
125
- wav_buffer.seek(0)
126
- return wav_buffer.read()
127
-
128
- def play_audio_auto(audio_data, mime="audio/wav"):
129
- audio_base64 = base64.b64encode(audio_data).decode()
130
- audio_html = f'''
131
- <audio controls autoplay style="width: 100%;">
132
- <source src="data:{mime};base64,{audio_base64}" type="{mime}">
133
- Your browser does not support the audio element.
134
- </audio>
135
- '''
136
- st.markdown(audio_html, unsafe_allow_html=True)
137
-
138
- # ------------------ App UI ------------------
139
-
140
- st.title("🍽️ 餐廳訂位意圖識別")
141
- st.markdown("錄音或輸入文字,自動判斷是否具有訂位意圖。")
142
-
143
- model_label = st.selectbox("選擇模型", list(available_models.keys()))
144
- model_id = available_models[model_label]
145
-
146
- st.markdown("### 🎙️ 點擊錄音(支援瀏覽器)")
147
- audio = mic_recorder(start_prompt="開始錄音", stop_prompt="停止錄音", just_once=True, use_container_width=True, format="wav", key="recorder")
148
-
149
- # Process audio recording input
150
- if audio:
151
- st.success("錄音完成!")
152
- st.audio(audio["bytes"], format="audio/wav")
153
- with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmpfile:
154
- tmpfile.write(audio["bytes"])
155
- tmpfile_path = tmpfile.name
156
-
157
- with st.spinner("🧠 Whisper 處理語音中..."):
158
- try:
159
- whisper_pipe = load_whisper_pipeline()
160
- result = whisper_pipe(tmpfile_path)
161
- transcription = result["text"]
162
- st.success(f"📝 語音轉文字:{transcription}")
163
- except Exception as e:
164
- st.error(f"❌ Whisper 錯誤:{str(e)}")
165
- transcription = ""
166
-
167
- if transcription:
168
- with st.spinner("預測中..."):
169
- if model_id == "qwen":
170
- result_text = predict_with_qwen(transcription)
171
- else:
172
- result_text = predict_intent(transcription, model_id)
173
- st.success(result_text)
174
- tts_text = get_tts_message(result_text)
175
- st.info(f"TTS 語音內容: {tts_text}")
176
- audio_message = play_tts_message(tts_text)
177
- play_audio_auto(audio_message, mime="audio/wav")
178
-
179
- # Process text input for intent classification
180
- text_input = st.text_input("✍️ 或手動輸入語句")
181
- if text_input and st.button("🚀 送出"):
182
- with st.spinner("預測中..."):
183
- if model_id == "qwen":
184
- result_text = predict_with_qwen(text_input)
185
  else:
186
- result_text = predict_intent(text_input, model_id)
187
- st.success(result_text)
188
- tts_text = get_tts_message(result_text)
189
- st.info(f"TTS 語音內容: {tts_text}")
190
- audio_message = play_tts_message(tts_text)
191
- play_audio_auto(audio_message, mime="audio/wav")
192
-
193
- with st.expander("ℹ️ 說明文件 / 使用說明 (README)", expanded=False):
194
- readme_md = load_clean_readme()
195
- st.markdown(readme_md, unsafe_allow_html=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
 
 
 
3
  import torch
4
+ from torch.nn.functional import softmax
5
+ import numpy as np
6
+ import soundfile as sf
7
+ import io
8
  import tempfile
9
+ import outlines # For Qwen integration via outlines
10
+ import kokoro # For TTS synthesis
11
  import re
12
  from pathlib import Path
13
+ from functools import lru_cache
14
+ import warnings
 
 
 
15
 
16
+ # Suppress FutureWarnings (e.g. about using `inputs` vs. `input_features`)
17
+ warnings.filterwarnings("ignore", category=FutureWarning)
18
 
19
+ # ------------------- Model Identifiers -------------------
20
  whisper_model_id = "Jingmiao/whisper-small-zh_tw"
 
 
21
  qwen_model_id = "Qwen/Qwen2.5-0.5B-Instruct"
22
 
 
23
  available_models = {
24
  "ALBERT-tiny (Chinese)": "Luigi/albert-tiny-chinese-dinercall-intent",
25
  "ALBERT-base (Chinese)": "Luigi/albert-base-chinese-dinercall-intent",
26
+ "Qwen (via Transformers - outlines)": "qwen"
27
  }
28
 
29
+ # ------------------- Caching and Loading Functions -------------------
30
+ @lru_cache(maxsize=1)
 
31
  def load_whisper_pipeline():
32
+ pipe = pipeline("automatic-speech-recognition", model=whisper_model_id)
33
+ # Move model to GPU if available for faster inference
34
+ if torch.cuda.is_available():
35
+ pipe.model.to("cuda")
36
+ return pipe
37
+
38
+ @lru_cache(maxsize=2)
39
+ def load_transformers_model(model_id: str):
40
  tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
41
  model = AutoModelForSequenceClassification.from_pretrained(model_id)
42
+ if torch.cuda.is_available():
43
+ model.to("cuda")
44
  return tokenizer, model
45
 
46
+ @lru_cache(maxsize=1)
47
  def load_qwen_model():
 
 
48
  return outlines.models.transformers(qwen_model_id)
49
 
50
+ @lru_cache(maxsize=1)
51
+ def get_tts_pipeline():
52
+ return kokoro.KPipeline(lang_code="z")
53
 
54
+ # ------------------- Inference Functions -------------------
55
+ def predict_with_qwen(text: str):
56
  model = load_qwen_model()
57
  prompt = f"""
58
  <|im_start|>system
 
78
  else:
79
  return f"未知回應: {prediction}"
80
 
81
+ def predict_intent(text: str, model_id: str):
 
82
  tokenizer, model = load_transformers_model(model_id)
83
  inputs = tokenizer(text, return_tensors="pt")
84
+ if torch.cuda.is_available():
85
+ inputs = {k: v.to("cuda") for k, v in inputs.items()}
86
  with torch.no_grad():
87
  logits = model(**inputs).logits
88
  probs = softmax(logits, dim=-1)
 
92
  else:
93
  return f"❌ 無訂位意圖 (Not Reservation intent)(訂位信心度 Confidence: {confidence:.2%})"
94
 
95
+ def get_tts_message(intent_result: str):
 
 
 
 
 
 
 
 
 
 
 
 
 
96
  if intent_result and "訂位意圖" in intent_result and "無" not in intent_result:
97
  return "稍後您將會從簡訊收到訂位連結"
98
  elif intent_result:
 
100
  else:
101
  return "未能判斷意圖"
102
 
103
+ def tts_audio_output(message: str, voice: str = 'af_heart'):
104
  pipeline_tts = get_tts_pipeline()
105
  generator = pipeline_tts(message, voice=voice)
106
  audio_chunks = []
 
108
  audio_chunks.append(audio)
109
  if audio_chunks:
110
  audio_concat = np.concatenate(audio_chunks)
111
+ # Return as tuple (sample_rate, numpy_array) for gr.Audio (sample rate used: 24000 Hz)
112
+ return (24000, audio_concat)
113
+ else:
114
+ return None
115
+
116
+ def transcribe_audio(audio_file):
117
+ whisper_pipe = load_whisper_pipeline()
118
+ # audio_file is the file path from gr.Audio (with type="filepath")
119
+ result = whisper_pipe(audio_file)
120
+ return result["text"]
121
+
122
+ # ------------------- Main Processing Function -------------------
123
+ def classify_intent(mode, audio_file, text_input, model_choice):
124
+ # Determine input based on explicit mode.
125
+ if mode == "Microphone" and audio_file is not None:
126
+ transcription = transcribe_audio(audio_file)
127
+ elif mode == "Text" and text_input:
128
+ transcription = text_input
129
+ else:
130
+ return "請提供語音或文字輸入", "", None
131
+
132
+ # Classify the transcribed or provided text.
133
+ if available_models[model_choice] == "qwen":
134
+ classification = predict_with_qwen(transcription)
135
  else:
136
+ classification = predict_intent(transcription, available_models[model_choice])
137
+ # Generate TTS message and audio.
138
+ tts_msg = get_tts_message(classification)
139
+ tts_audio = tts_audio_output(tts_msg)
140
+ return transcription, classification, tts_audio
141
+
142
+ # ------------------- Gradio Blocks Interface Setup -------------------
143
+ with gr.Blocks() as demo:
144
+ gr.Markdown("## 🍽️ 餐廳訂位意圖識別")
145
+ gr.Markdown("錄音或輸入文字,自動判斷是否具有訂位意圖。")
146
+
147
+ with gr.Row():
148
+ # Input Mode Selector
149
+ mode = gr.Radio(choices=["Microphone", "Text"], label="選擇輸入模式", value="Microphone")
150
+
151
+ with gr.Row():
152
+ # Audio and Text inputs – only one will be visible based on mode selection.
153
+ audio_input = gr.Audio(sources=["microphone"], type="filepath", label="語音輸入 (點擊錄音)")
154
+ text_input = gr.Textbox(lines=2, placeholder="請輸入文字", label="文字輸入")
155
+
156
+ # Initially, only the microphone input is visible.
157
+ text_input.visible = False
158
+
159
+ # Change event for mode selection to toggle visibility.
160
+ def update_visibility(selected_mode):
161
+ if selected_mode == "Microphone":
162
+ return gr.update(visible=True), gr.update(visible=False)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
  else:
164
+ return gr.update(visible=False), gr.update(visible=True)
165
+ mode.change(fn=update_visibility, inputs=mode, outputs=[audio_input, text_input])
166
+
167
+ with gr.Row():
168
+ model_dropdown = gr.Dropdown(choices=list(available_models.keys()),
169
+ value="ALBERT-tiny (Chinese)", label="選擇模型")
170
+
171
+ with gr.Row():
172
+ classify_btn = gr.Button("執行辨識")
173
+
174
+ with gr.Row():
175
+ transcription_output = gr.Textbox(label="轉換文字")
176
+ with gr.Row():
177
+ classification_output = gr.Textbox(label="意圖判斷結果")
178
+ with gr.Row():
179
+ tts_output = gr.Audio(type="numpy", label="TTS 語音輸出")
180
+
181
+ # Button event triggers the classification. Gradio will show a spinner during processing.
182
+ classify_btn.click(fn=classify_intent,
183
+ inputs=[mode, audio_input, text_input, model_dropdown],
184
+ outputs=[transcription_output, classification_output, tts_output])
185
+
186
+ demo.launch()
requirements.txt CHANGED
@@ -3,11 +3,9 @@
3
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
4
 
5
  llama-cpp-python
6
- streamlit
7
- streamlit-mic-recorder
8
  transformers
9
  torch
10
- faster-whisper
11
  soundfile
12
  outlines
13
  numpy>=1.24,<2.0
 
3
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
4
 
5
  llama-cpp-python
6
+ gradio>=5.0.0
 
7
  transformers
8
  torch
 
9
  soundfile
10
  outlines
11
  numpy>=1.24,<2.0