Spaces:

fedirz
/

faster-whisper-server

Configuration error

App Files Files Community

Fedir Zadniprovskyi commited on May 19, 2024

Commit

313814b

0 Parent(s):

init

Browse files

Files changed (27) hide show

.dockerignore +12 -0
.envrc +1 -0
.gitignore +5 -0
.pre-commit-config.yaml +25 -0
Dockerfile.cpu +14 -0
Dockerfile.cuda +14 -0
LICENSE +21 -0
README.md +24 -0
Taskfile.yaml +15 -0
compose.yaml +34 -0
flake.lock +61 -0
flake.nix +40 -0
poetry.lock +0 -0
pyproject.toml +32 -0
speaches/__init__.py +0 -0
speaches/asr.py +80 -0
speaches/audio.py +96 -0
speaches/client.py +94 -0
speaches/config.py +176 -0
speaches/core.py +207 -0
speaches/logger.py +13 -0
speaches/main.py +162 -0
speaches/server_models.py +26 -0
speaches/transcriber.py +75 -0
tests/__init__.py +0 -0
tests/app_test.py +84 -0
tests/conftest.py +9 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,12 @@

+__pycache__
+tests/data
+.pytest_cache
+.git
+flake.nix
+flake.lock
+.envrc
+.gitignore
+.direnv
+.task
+Taskfile.yaml
+README.md

.envrc ADDED Viewed

	@@ -0,0 +1 @@


1	+ use flake

.gitignore ADDED Viewed

	@@ -0,0 +1,5 @@

+__pycache__
+.pytest_cache
+tests/data
+.direnv
+.task

.pre-commit-config.yaml ADDED Viewed

	@@ -0,0 +1,25 @@

+# See https://pre-commit.com for more information
+# See https://pre-commit.com/hooks.html for more hooks
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v3.2.0
+    hooks:
+      - id: trailing-whitespace
+      - id: end-of-file-fixer
+      - id: check-yaml
+      - id: check-added-large-files
+  - repo: https://github.com/pre-commit/mirrors-mypy
+    rev: v1.10.0
+    hooks:
+      - id: mypy
+  # - repo: https://github.com/PyCQA/isort
+  #   rev: 5.13.2
+  #   hooks:
+  #     - id: isort
+  #
+  # - repo: https://github.com/psf/black
+  #   rev: 24.4.2
+  #   hooks:
+  #     - id: black

Dockerfile.cpu ADDED Viewed

	@@ -0,0 +1,14 @@

+FROM ubuntu:22.04
+RUN apt-get update && \
+    apt-get install -y curl software-properties-common && \
+    add-apt-repository ppa:deadsnakes/ppa && \
+    apt-get update && \
+    DEBIAN_FRONTEND=noninteractive apt-get -y install python3.11 python3.11-distutils && \
+    curl -sS https://bootstrap.pypa.io/get-pip.py | python3.11
+RUN pip install --no-cache-dir poetry==1.8.2
+WORKDIR /root/speaches
+COPY pyproject.toml poetry.lock ./
+RUN poetry install
+COPY ./speaches ./speaches
+ENTRYPOINT ["poetry", "run"]
+CMD ["uvicorn", "speaches.main:app"]

Dockerfile.cuda ADDED Viewed

	@@ -0,0 +1,14 @@

+FROM nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04
+RUN apt-get update && \
+    apt-get install -y curl software-properties-common && \
+    add-apt-repository ppa:deadsnakes/ppa && \
+    apt-get update && \
+    DEBIAN_FRONTEND=noninteractive apt-get -y install python3.11 python3.11-distutils && \
+    curl -sS https://bootstrap.pypa.io/get-pip.py | python3.11
+RUN pip install --no-cache-dir poetry==1.8.2
+WORKDIR /root/speaches
+COPY pyproject.toml poetry.lock ./
+RUN poetry install
+COPY ./speaches ./speaches
+ENTRYPOINT ["poetry", "run"]
+CMD ["uvicorn", "speaches.main:app"]

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2024 Fedir Zadniprovskyi
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,24 @@

+# Intro
+`speaches` is a webserver that supports real-time transcription using WebSockets.
+- [faster-whisper](https://github.com/SYSTRAN/faster-whisper) is used as the backend. Both GPU and CPU inference is supported.
+- LocalAgreement2([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf)|[original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for real-time transcription.
+- Can be deployed using Docker (Compose configuration can be found in (compose.yaml[./compose.yaml])).
+- All configuration is done through environment variables. See [config.py](./speaches/config.py).
+- NOTE: only transcription of single channel, 16000 sample rate, raw, 16-bit little-endian audio is supported.
+- NOTE: this isn't really meant to be used as a standalone tool but rather to add transcription features to other applications
+Please create an issue if you find a bug, have a question, or a feature suggestion.
+# Quick Start
+NOTE: You'll need to install [websocat](https://github.com/vi/websocat?tab=readme-ov-file#installation) or an alternative.
+Spinning up a `speaches` web-server
+```bash
+docker run --detach --gpus=all --publish 8000:8000 --mount ~/.cache/huggingface:/root/.cache/huggingface --name speaches fedirz/speaches:cuda
+# or
+docker run --detach --publish 8000:8000 --mount ~/.cache/huggingface:/root/.cache/huggingface --name speaches fedirz/speaches:cpu
+```
+Sending audio data via websocket
+```bash
+arecord -f S16_LE -c1 -r 16000 -t raw -D default | websocat --binary ws://localhost:8000/v1/audio/transcriptions
+# or
+ffmpeg -f alsa -ac 1 -ar 16000 -sample_fmt s16le -i default | websocat --binary ws://localhost:8000/v1/audio/transcriptions
+```
+# Example

Taskfile.yaml ADDED Viewed

	@@ -0,0 +1,15 @@

+version: "3"
+tasks:
+  speaches: poetry run uvicorn speaches.main:app {{.CLI_ARGS}}
+  test:
+    cmds:
+      - poetry run pytest -o log_cli=true -o log_cli_level=DEBUG {{.CLI_ARGS}}
+    sources:
+      - "**/*.py"
+  build-and-push:
+    cmds:
+      - docker compose build --push speaches
+    sources:
+      - Dockerfile
+      - speaches/*.py
+  sync: lsyncd -nodaemon -delay 0 -rsyncssh . gpu-box speaches

compose.yaml ADDED Viewed

	@@ -0,0 +1,34 @@

+services:
+  speaches-cuda:
+    image: fedirz/speaches:cuda
+    build:
+      dockerfile: Dockerfile.cuda
+      context: .
+      tags:
+        - fedirz/speaches:cuda
+    volumes:
+      - ~/.cache/huggingface:/root/.cache/huggingface
+    restart: unless-stopped
+    ports:
+      - 8000:8000
+    environment:
+      - INFERENCE_DEVICE=cuda
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - capabilities: ["gpu"]
+  speaches-cpu:
+    image: fedirz/speaches:cpu
+    build:
+      dockerfile: Dockerfile.cpu
+      context: .
+      tags:
+        - fedirz/speaches:cpu
+    volumes:
+      - ~/.cache/huggingface:/root/.cache/huggingface
+    restart: unless-stopped
+    ports:
+      - 8000:8000
+    environment:
+      - INFERENCE_DEVICE=cpu

flake.lock ADDED Viewed

	@@ -0,0 +1,61 @@

+{
+  "nodes": {
+    "flake-utils": {
+      "inputs": {
+        "systems": "systems"
+      },
+      "locked": {
+        "lastModified": 1710146030,
+        "narHash": "sha256-SZ5L6eA7HJ/nmkzGG7/ISclqe6oZdOZTNoesiInkXPQ=",
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "rev": "b1d9ab70662946ef0850d488da1c9019f3a9752a",
+        "type": "github"
+      },
+      "original": {
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "type": "github"
+      }
+    },
+    "nixpkgs": {
+      "locked": {
+        "lastModified": 1716073433,
+        "narHash": "sha256-9G0BS7I/5z0n35Vx1d+TLxaIKQ93rEf5VLXNLWu7/44=",
+        "owner": "NixOS",
+        "repo": "nixpkgs",
+        "rev": "b7d845292c304e026d86097e6d07409070e80dcc",
+        "type": "github"
+      },
+      "original": {
+        "owner": "NixOS",
+        "ref": "master",
+        "repo": "nixpkgs",
+        "type": "github"
+      }
+    },
+    "root": {
+      "inputs": {
+        "flake-utils": "flake-utils",
+        "nixpkgs": "nixpkgs"
+      }
+    },
+    "systems": {
+      "locked": {
+        "lastModified": 1681028828,
+        "narHash": "sha256-Vy1rq5AaRuLzOxct8nz4T6wlgyUR7zLU309k9mBC768=",
+        "owner": "nix-systems",
+        "repo": "default",
+        "rev": "da67096a3b9bf56a91d16901293e51ba5b49a27e",
+        "type": "github"
+      },
+      "original": {
+        "owner": "nix-systems",
+        "repo": "default",
+        "type": "github"
+      }
+    }
+  },
+  "root": "root",
+  "version": 7
+}

flake.nix ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+  inputs = {
+    nixpkgs.url = "github:NixOS/nixpkgs/master";
+    flake-utils.url = "github:numtide/flake-utils";
+  };
+  outputs =
+    { nixpkgs, flake-utils, ... }:
+    flake-utils.lib.eachDefaultSystem (
+      system:
+      let
+        pkgs = import nixpkgs {
+          inherit system;
+          config.allowUnfree = true;
+        };
+      in
+      {
+        devShells = {
+          default = pkgs.mkShell {
+            nativeBuildInputs = with pkgs; [
+              (with python311Packages; huggingface-hub)
+              ffmpeg-full
+              go-task
+              lsyncd
+              poetry
+              pre-commit
+              pyright
+              python311
+              websocat
+            ];
+            shellHook = ''
+              source $(poetry env info --path)/bin/activate
+              export LD_LIBRARY_PATH=${pkgs.stdenv.cc.cc.lib}/lib:$LD_LIBRARY_PATH
+              export LD_LIBRARY_PATH=${pkgs.zlib}/lib:$LD_LIBRARY_PATH
+            '';
+          };
+        };
+        formatter = pkgs.nixfmt;
+      }
+    );
+}

poetry.lock ADDED Viewed

The diff for this file is too large to render. See raw diff

pyproject.toml ADDED Viewed

	@@ -0,0 +1,32 @@

+[tool.poetry]
+package-mode = false
+[tool.poetry.dependencies]
+python = "^3.11"
+faster-whisper = "^1.0.2"
+pydantic = "^2.7.1"
+fastapi = "^0.111.0"
+uvicorn = "^0.29.0"
+python-multipart = "^0.0.9"
+soundfile = "^0.12.1"
+pydantic-settings = "^2.2.1"
+websockets = "^12.0"
+numpy = "^1.26.4"
+[tool.poetry.group.dev.dependencies]
+pytest = "^8.2.0"
+pytest-asyncio = "^0.23.6"
+httpx = "^0.27.0"
+httpx-ws = "^0.6.0"
+pytest-xdist = "^3.6.1"
+[tool.poetry.group.client.dependencies]
+httpx = "^0.27.0"
+httpx-ws = "^0.6.0"
+[build-system]
+requires = ["poetry-core"]
+build-backend = "poetry.core.masonry.api"

speaches/__init__.py ADDED Viewed

File without changes

speaches/asr.py ADDED Viewed

	@@ -0,0 +1,80 @@

+import asyncio
+import time
+from typing import Iterable
+from faster_whisper import transcribe
+from pydantic import BaseModel
+from speaches.audio import Audio
+from speaches.config import Language
+from speaches.core import Transcription, Word
+from speaches.logger import logger
+class TranscribeOpts(BaseModel):
+    language: Language | None
+    vad_filter: bool
+    condition_on_previous_text: bool
+class FasterWhisperASR:
+    def __init__(
+        self,
+        whisper: transcribe.WhisperModel,
+        transcribe_opts: TranscribeOpts,
+    ) -> None:
+        self.whisper = whisper
+        self.transcribe_opts = transcribe_opts
+    def _transcribe(
+        self,
+        audio: Audio,
+        prompt: str | None = None,
+    ) -> tuple[Transcription, transcribe.TranscriptionInfo]:
+        start = time.perf_counter()
+        segments, transcription_info = self.whisper.transcribe(
+            audio.data,
+            initial_prompt=prompt,
+            word_timestamps=True,
+            **self.transcribe_opts.model_dump(),
+        )
+        words = words_from_whisper_segments(segments)
+        for word in words:
+            word.offset(audio.start)
+        transcription = Transcription(words)
+        end = time.perf_counter()
+        logger.info(
+            f"Transcribed {audio} in {end - start:.2f} seconds. Prompt: {prompt}. Transcription: {transcription.text}"
+        )
+        return (transcription, transcription_info)
+    async def transcribe(
+        self,
+        audio: Audio,
+        prompt: str | None = None,
+    ) -> tuple[Transcription, transcribe.TranscriptionInfo]:
+        """Wrapper around _transcribe so it can be used in async context"""
+        # is this the optimal way to execute a blocking call in an async context?
+        # TODO: verify performance when running inference on a CPU
+        return await asyncio.get_running_loop().run_in_executor(
+            None,
+            self._transcribe,
+            audio,
+            prompt,
+        )
+def words_from_whisper_segments(segments: Iterable[transcribe.Segment]) -> list[Word]:
+    words: list[Word] = []
+    for segment in segments:
+        assert segment.words is not None
+        words.extend(
+            Word(
+                start=word.start,
+                end=word.end,
+                text=word.word,
+                probability=word.probability,
+            )
+            for word in segment.words
+        )
+    return words

speaches/audio.py ADDED Viewed

	@@ -0,0 +1,96 @@

+from __future__ import annotations
+import asyncio
+from typing import AsyncGenerator, BinaryIO
+import numpy as np
+import soundfile as sf
+from numpy.typing import NDArray
+from speaches.config import SAMPLES_PER_SECOND
+from speaches.logger import logger
+def audio_samples_from_file(file: BinaryIO) -> NDArray[np.float32]:
+    audio_and_sample_rate: tuple[NDArray[np.float32], Any] = sf.read(  # type: ignore
+        file,
+        format="RAW",
+        channels=1,
+        samplerate=SAMPLES_PER_SECOND,
+        subtype="PCM_16",
+        dtype="float32",
+        endian="LITTLE",
+    )
+    audio = audio_and_sample_rate[0]
+    return audio
+class Audio:
+    def __init__(
+        self,
+        data: NDArray[np.float32] = np.array([], dtype=np.float32),
+        start: float = 0.0,
+    ) -> None:
+        self.data = data
+        self.start = start
+    def __repr__(self) -> str:
+        return f"Audio(start={self.start:.2f}, end={self.end:.2f})"
+    @property
+    def end(self) -> float:
+        return self.start + self.duration
+    @property
+    def duration(self) -> float:
+        return len(self.data) / SAMPLES_PER_SECOND
+    def after(self, ts: float) -> Audio:
+        assert ts <= self.duration
+        return Audio(self.data[int(ts * SAMPLES_PER_SECOND) :], start=ts)
+    def extend(self, data: NDArray[np.float32]) -> None:
+        # logger.debug(f"Extending audio by {len(data) / SAMPLES_PER_SECOND:.2f}s")
+        self.data = np.append(self.data, data)
+        # logger.debug(f"Audio duration: {self.duration:.2f}s")
+# TODO: trim data longer than x
+class AudioStream(Audio):
+    def __init__(
+        self,
+        data: NDArray[np.float32] = np.array([], dtype=np.float32),
+        start: float = 0.0,
+    ) -> None:
+        super().__init__(data, start)
+        self.closed = False
+        self.modify_event = asyncio.Event()
+    def extend(self, data: NDArray[np.float32]) -> None:
+        assert self.closed == False
+        super().extend(data)
+        self.modify_event.set()
+    def close(self) -> None:
+        assert self.closed == False
+        self.closed = True
+        self.modify_event.set()
+        logger.info("AudioStream closed")
+    async def chunks(
+        self, min_duration: float
+    ) -> AsyncGenerator[NDArray[np.float32], None]:
+        i = 0.0  # end time of last chunk
+        while True:
+            await self.modify_event.wait()
+            self.modify_event.clear()
+            if self.closed or self.duration - i >= min_duration:
+                # If `i` shouldn't be set to `duration` after the yield
+                # because by the time assignment would happen more data might have been added
+                i_ = i
+                i = self.duration
+                # NOTE: probably better to just to a slice
+                yield self.after(i_).data
+            if self.closed:
+                return

speaches/client.py ADDED Viewed

	@@ -0,0 +1,94 @@

+# TODO: move out of `speaches` package
+import asyncio
+import signal
+import httpx
+from httpx_ws import AsyncWebSocketSession, WebSocketDisconnect, aconnect_ws
+from wsproto.connection import ConnectionState
+CHUNK = 1024 * 4
+AUDIO_RECORD_CMD = "arecord -D default -f S16_LE -r 16000 -c 1 -t raw"
+COPY_TO_CLIPBOARD_CMD = "wl-copy"
+NOTIFY_CMD = "notify-desktop"
+client = httpx.AsyncClient(base_url="ws://localhost:8000")
+async def audio_sender(ws: AsyncWebSocketSession) -> None:
+    process = await asyncio.create_subprocess_shell(
+        AUDIO_RECORD_CMD,
+        stdout=asyncio.subprocess.PIPE,
+        stderr=asyncio.subprocess.DEVNULL,
+    )
+    assert process.stdout is not None
+    try:
+        while not process.stdout.at_eof():
+            data = await process.stdout.read(CHUNK)
+            if ws.connection.state != ConnectionState.OPEN:
+                break
+            await ws.send_bytes(data)
+    except Exception as e:
+        print(e)
+    finally:
+        process.kill()
+async def transcription_receiver(ws: AsyncWebSocketSession) -> None:
+    transcription = ""
+    notification_id: int | None = None
+    try:
+        while True:
+            data = await ws.receive_text()
+            if not data:
+                break
+            transcription += data
+            await copy_to_clipboard(transcription)
+            notification_id = await notify(transcription, replaces_id=notification_id)
+    except WebSocketDisconnect:
+        pass
+    print(transcription)
+async def copy_to_clipboard(text: str) -> None:
+    process = await asyncio.create_subprocess_shell(
+        COPY_TO_CLIPBOARD_CMD, stdin=asyncio.subprocess.PIPE
+    )
+    await process.communicate(input=text.encode("utf-8"))
+    await process.wait()
+async def notify(text: str, replaces_id: int | None = None) -> int:
+    cmd = ["notify-desktop", "--app-name", "Speaches"]
+    if replaces_id is not None:
+        cmd.extend(["--replaces-id", str(replaces_id)])
+    cmd.append("'Speaches'")
+    cmd.append(f"'{text}'")
+    process = await asyncio.create_subprocess_shell(
+        " ".join(cmd),
+        stdout=asyncio.subprocess.PIPE,
+    )
+    await process.wait()
+    assert process.stdout is not None
+    notification_id = (await process.stdout.read()).decode("utf-8")
+    return int(notification_id)
+async def main() -> None:
+    async with aconnect_ws("/v1/audio/transcriptions", client) as ws:
+        async with asyncio.TaskGroup() as tg:
+            sender_task = tg.create_task(audio_sender(ws))
+            receiver_task = tg.create_task(transcription_receiver(ws))
+            async def on_interrupt():
+                sender_task.cancel()
+                receiver_task.cancel()
+                await asyncio.gather(sender_task, receiver_task)
+            asyncio.get_running_loop().add_signal_handler(
+                signal.SIGINT,
+                lambda: asyncio.create_task(on_interrupt()),
+            )
+asyncio.run(main())
+# poetry --directory /home/nixos/code/speaches run python /home/nixos/code/speaches/speaches/client.py

speaches/config.py ADDED Viewed

	@@ -0,0 +1,176 @@

+import enum
+from pydantic import BaseModel, Field
+from pydantic_settings import BaseSettings, SettingsConfigDict
+SAMPLES_PER_SECOND = 16000
+BYTES_PER_SAMPLE = 2
+BYTES_PER_SECOND = SAMPLES_PER_SECOND * BYTES_PER_SAMPLE
+# 2 BYTES = 16 BITS = 1 SAMPLE
+# 1 SECOND OF AUDIO = 32000 BYTES = 16000 SAMPLES
+# TODO: confirm names
+class Model(enum.StrEnum):
+    TINY_EN = "tiny.en"
+    TINY = "tiny"
+    BASE_EN = "base.en"
+    BASE = "base"
+    SMALL_EN = "small.en"
+    SMALL = "small"
+    MEDIUM_EN = "medium.en"
+    MEDIUM = "medium"
+    LARGE = "large"
+    LARGE_V1 = "large-v1"
+    LARGE_V2 = "large-v2"
+    LARGE_V3 = "large-v3"
+    DISTIL_SMALL_EN = "distil-small.en"
+    DISTIL_MEDIUM_EN = "distil-medium.en"
+    DISTIL_LARGE_V2 = "distil-large-v2"
+    DISTIL_LARGE_V3 = "distil-large-v3"
+class Device(enum.StrEnum):
+    CPU = "cpu"
+    CUDA = "cuda"
+    AUTO = "auto"
+# https://github.com/OpenNMT/CTranslate2/blob/master/docs/quantization.md
+class Quantization(enum.StrEnum):
+    INT8 = "int8"
+    INT8_FLOAT16 = "int8_float16"
+    INT8_BFLOAT16 = "int8_bfloat16"
+    INT8_FLOAT32 = "int8_float32"
+    INT16 = "int16"
+    FLOAT16 = "float16"
+    BFLOAT16 = "bfloat16"
+    FLOAT32 = "float32"
+    DEFAULT = "default"
+class Language(enum.StrEnum):
+    AF = "af"
+    AM = "am"
+    AR = "ar"
+    AS = "as"
+    AZ = "az"
+    BA = "ba"
+    BE = "be"
+    BG = "bg"
+    BN = "bn"
+    BO = "bo"
+    BR = "br"
+    BS = "bs"
+    CA = "ca"
+    CS = "cs"
+    CY = "cy"
+    DA = "da"
+    DE = "de"
+    EL = "el"
+    EN = "en"
+    ES = "es"
+    ET = "et"
+    EU = "eu"
+    FA = "fa"
+    FI = "fi"
+    FO = "fo"
+    FR = "fr"
+    GL = "gl"
+    GU = "gu"
+    HA = "ha"
+    HAW = "haw"
+    HE = "he"
+    HI = "hi"
+    HR = "hr"
+    HT = "ht"
+    HU = "hu"
+    HY = "hy"
+    ID = "id"
+    IS = "is"
+    IT = "it"
+    JA = "ja"
+    JW = "jw"
+    KA = "ka"
+    KK = "kk"
+    KM = "km"
+    KN = "kn"
+    KO = "ko"
+    LA = "la"
+    LB = "lb"
+    LN = "ln"
+    LO = "lo"
+    LT = "lt"
+    LV = "lv"
+    MG = "mg"
+    MI = "mi"
+    MK = "mk"
+    ML = "ml"
+    MN = "mn"
+    MR = "mr"
+    MS = "ms"
+    MT = "mt"
+    MY = "my"
+    NE = "ne"
+    NL = "nl"
+    NN = "nn"
+    NO = "no"
+    OC = "oc"
+    PA = "pa"
+    PL = "pl"
+    PS = "ps"
+    PT = "pt"
+    RO = "ro"
+    RU = "ru"
+    SA = "sa"
+    SD = "sd"
+    SI = "si"
+    SK = "sk"
+    SL = "sl"
+    SN = "sn"
+    SO = "so"
+    SQ = "sq"
+    SR = "sr"
+    SU = "su"
+    SV = "sv"
+    SW = "sw"
+    TA = "ta"
+    TE = "te"
+    TG = "tg"
+    TH = "th"
+    TK = "tk"
+    TL = "tl"
+    TR = "tr"
+    TT = "tt"
+    UK = "uk"
+    UR = "ur"
+    UZ = "uz"
+    VI = "vi"
+    YI = "yi"
+    YO = "yo"
+    YUE = "yue"
+    ZH = "zh"
+class WhisperConfig(BaseModel):
+    model: Model = Field(default=Model.DISTIL_SMALL_EN)
+    inference_device: Device = Field(default=Device.AUTO)
+    compute_type: Quantization = Field(default=Quantization.DEFAULT)
+class Config(BaseSettings):
+    model_config = SettingsConfigDict(env_nested_delimiter="_")
+    log_level: str = "info"
+    whisper: WhisperConfig = WhisperConfig()
+    """
+    Max duration to for the next audio chunk before finilizing the transcription and closing the connection.
+    """
+    max_no_data_seconds: float = 1.0
+    min_duration: float = 1.0
+    word_timestamp_error_margin: float = 0.2
+    inactivity_window_seconds: float = 3.0
+    max_inactivity_seconds: float = 1.5
+config = Config()

speaches/core.py ADDED Viewed

	@@ -0,0 +1,207 @@

+# TODO: rename module
+from __future__ import annotations
+import re
+from dataclasses import dataclass
+from speaches.config import config
+# TODO: use the `Segment` from `faster-whisper.transcribe` instead
+@dataclass
+class Segment:
+    text: str
+    start: float = 0.0
+    end: float = 0.0
+    @property
+    def is_eos(self) -> bool:
+        if self.text.endswith("..."):
+            return False
+        for punctuation_symbol in ".?!":
+            if self.text.endswith(punctuation_symbol):
+                return True
+        return False
+    def offset(self, seconds: float) -> None:
+        self.start += seconds
+        self.end += seconds
+# TODO: use the `Word` from `faster-whisper.transcribe` instead
+@dataclass
+class Word(Segment):
+    probability: float = 0.0
+    @classmethod
+    def common_prefix(cls, a: list[Word], b: list[Word]) -> list[Word]:
+        i = 0
+        while (
+            i < len(a)
+            and i < len(b)
+            and canonicalize_word(a[i].text) == canonicalize_word(b[i].text)
+        ):
+            i += 1
+        return a[:i]
+class Transcription:
+    def __init__(self, words: list[Word] = []) -> None:
+        self.words: list[Word] = []
+        self.extend(words)
+    @property
+    def text(self) -> str:
+        return " ".join(word.text for word in self.words).strip()
+    @property
+    def start(self) -> float:
+        return self.words[0].start if len(self.words) > 0 else 0.0
+    @property
+    def end(self) -> float:
+        return self.words[-1].end if len(self.words) > 0 else 0.0
+    @property
+    def duration(self) -> float:
+        return self.end - self.start
+    def after(self, seconds: float) -> Transcription:
+        return Transcription(
+            words=[word for word in self.words if word.start > seconds]
+        )
+    def extend(self, words: list[Word]) -> None:
+        self._ensure_no_word_overlap(words)
+        self.words.extend(words)
+    def _ensure_no_word_overlap(self, words: list[Word]) -> None:
+        if len(self.words) > 0 and len(words) > 0:
+            if (
+                words[0].start + config.word_timestamp_error_margin
+                <= self.words[-1].end
+            ):
+                raise ValueError(
+                    f"Words overlap: {self.words[-1]} and {words[0]}. Error margin: {config.word_timestamp_error_margin}"
+                )
+        for i in range(1, len(words)):
+            if words[i].start + config.word_timestamp_error_margin <= words[i - 1].end:
+                raise ValueError(
+                    f"Words overlap: {words[i - 1]} and {words[i]}. All words: {words}"
+                )
+def test_segment_is_eos():
+    assert Segment("Hello").is_eos == False
+    assert Segment("Hello...").is_eos == False
+    assert Segment("Hello.").is_eos == True
+    assert Segment("Hello!").is_eos == True
+    assert Segment("Hello?").is_eos == True
+    assert Segment("Hello. Yo").is_eos == False
+    assert Segment("Hello. Yo...").is_eos == False
+    assert Segment("Hello. Yo.").is_eos == True
+def to_full_sentences(words: list[Word]) -> list[Segment]:
+    sentences: list[Segment] = [Segment("")]
+    for word in words:
+        sentences[-1] = Segment(
+            start=sentences[-1].start,
+            end=word.end,
+            text=sentences[-1].text + word.text,
+        )
+        if word.is_eos:
+            sentences.append(Segment(""))
+    if len(sentences) > 0 and not sentences[-1].is_eos:
+        sentences.pop()
+    return sentences
+def tests_to_full_sentences():
+    assert to_full_sentences([]) == []
+    assert to_full_sentences([Word(text="Hello")]) == []
+    assert to_full_sentences([Word(text="Hello..."), Word(" world")]) == []
+    assert to_full_sentences([Word(text="Hello..."), Word(" world.")]) == [
+        Segment(text="Hello... world.")
+    ]
+    assert to_full_sentences(
+        [Word(text="Hello..."), Word(" world."), Word(" How")]
+    ) == [Segment(text="Hello... world.")]
+def to_text(words: list[Word]) -> str:
+    return "".join(word.text for word in words)
+def to_text_w_ts(words: list[Word]) -> str:
+    return "".join(f"{word.text}({word.start:.2f}-{word.end:.2f})" for word in words)
+def canonicalize_word(text: str) -> str:
+    text = text.lower()
+    # Remove non-alphabetic characters using regular expression
+    text = re.sub(r"[^a-z]", "", text)
+    return text.lower().strip().strip(".,?!")
+def test_canonicalize_word():
+    assert canonicalize_word("ABC") == "abc"
+    assert canonicalize_word("...ABC?") == "abc"
+    assert canonicalize_word("... AbC  ...") == "abc"
+def common_prefix(a: list[Word], b: list[Word]) -> list[Word]:
+    i = 0
+    while (
+        i < len(a)
+        and i < len(b)
+        and canonicalize_word(a[i].text) == canonicalize_word(b[i].text)
+    ):
+        i += 1
+    return a[:i]
+def test_common_prefix():
+    def word(text: str) -> Word:
+        return Word(text=text, start=0.0, end=0.0, probability=0.0)
+    a = [word("a"), word("b"), word("c")]
+    b = [word("a"), word("b"), word("c")]
+    assert common_prefix(a, b) == [word("a"), word("b"), word("c")]
+    a = [word("a"), word("b"), word("c")]
+    b = [word("a"), word("b"), word("d")]
+    assert common_prefix(a, b) == [word("a"), word("b")]
+    a = [word("a"), word("b"), word("c")]
+    b = [word("a")]
+    assert common_prefix(a, b) == [word("a")]
+    a = [word("a")]
+    b = [word("a"), word("b"), word("c")]
+    assert common_prefix(a, b) == [word("a")]
+    a = [word("a")]
+    b = []
+    assert common_prefix(a, b) == []
+    a = []
+    b = [word("a")]
+    assert common_prefix(a, b) == []
+    a = [word("a"), word("b"), word("c")]
+    b = [word("b"), word("c")]
+    assert common_prefix(a, b) == []
+def test_common_prefix_and_canonicalization():
+    def word(text: str) -> Word:
+        return Word(text=text, start=0.0, end=0.0, probability=0.0)
+    a = [word("A...")]
+    b = [word("a?"), word("b"), word("c")]
+    assert common_prefix(a, b) == [word("A...")]
+    a = [word("A..."), word("B?"), word("C,")]
+    b = [word("a??"), word("  b"), word(" ,c")]
+    assert common_prefix(a, b) == [word("A..."), word("B?"), word("C,")]

speaches/logger.py ADDED Viewed

	@@ -0,0 +1,13 @@

+import logging
+from speaches.config import config
+# Disables all but `speaches` logger
+root_logger = logging.getLogger()
+root_logger.setLevel(logging.CRITICAL)
+logger = logging.getLogger(__name__)
+logger.setLevel(config.log_level.upper())
+logging.basicConfig(
+    format="%(asctime)s:%(levelname)s:%(name)s:%(funcName)s:%(message)s"
+)

speaches/main.py ADDED Viewed

	@@ -0,0 +1,162 @@

+from __future__ import annotations
+import asyncio
+import logging
+import time
+from contextlib import asynccontextmanager
+from io import BytesIO
+from typing import Annotated
+from fastapi import (
+    Depends,
+    FastAPI,
+    Response,
+    UploadFile,
+    WebSocket,
+    WebSocketDisconnect,
+)
+from fastapi.websockets import WebSocketState
+from faster_whisper import WhisperModel
+from faster_whisper.vad import VadOptions, get_speech_timestamps
+from speaches.asr import FasterWhisperASR, TranscribeOpts
+from speaches.audio import AudioStream, audio_samples_from_file
+from speaches.config import SAMPLES_PER_SECOND, Language, config
+from speaches.core import Transcription
+from speaches.logger import logger
+from speaches.server_models import (
+    ResponseFormat,
+    TranscriptionResponse,
+    TranscriptionVerboseResponse,
+)
+from speaches.transcriber import audio_transcriber
+whisper: WhisperModel = None  # type: ignore
+@asynccontextmanager
+async def lifespan(_: FastAPI):
+    global whisper
+    logging.debug(f"Loading {config.whisper.model}")
+    start = time.perf_counter()
+    whisper = WhisperModel(
+        config.whisper.model,
+        device=config.whisper.inference_device,
+        compute_type=config.whisper.compute_type,
+    )
+    end = time.perf_counter()
+    logger.debug(f"Loaded {config.whisper.model} loaded in {end - start:.2f} seconds")
+    yield
+app = FastAPI(lifespan=lifespan)
+@app.get("/health")
+def health() -> Response:
+    return Response(status_code=200, content="Everything is peachy!")
+async def transcription_parameters(
+    language: Language = Language.EN,
+    vad_filter: bool = True,
+    condition_on_previous_text: bool = False,
+) -> TranscribeOpts:
+    return TranscribeOpts(
+        language=language,
+        vad_filter=vad_filter,
+        condition_on_previous_text=condition_on_previous_text,
+    )
+TranscribeParams = Annotated[TranscribeOpts, Depends(transcription_parameters)]
+@app.post("/v1/audio/transcriptions")
+async def transcribe_file(
+    file: UploadFile,
+    transcription_opts: TranscribeParams,
+    response_format: ResponseFormat = ResponseFormat.JSON,
+) -> str:
+    asr = FasterWhisperASR(whisper, transcription_opts)
+    audio_samples = audio_samples_from_file(file.file)
+    audio = AudioStream(audio_samples)
+    transcription, _ = await asr.transcribe(audio)
+    return format_transcription(transcription, response_format)
+async def audio_receiver(ws: WebSocket, audio_stream: AudioStream) -> None:
+    try:
+        while True:
+            bytes_ = await asyncio.wait_for(
+                ws.receive_bytes(), timeout=config.max_no_data_seconds
+            )
+            logger.debug(f"Received {len(bytes_)} bytes of audio data")
+            audio_samples = audio_samples_from_file(BytesIO(bytes_))
+            audio_stream.extend(audio_samples)
+            if audio_stream.duration - config.inactivity_window_seconds >= 0:
+                audio = audio_stream.after(
+                    audio_stream.duration - config.inactivity_window_seconds
+                )
+                vad_opts = VadOptions(min_silence_duration_ms=500, speech_pad_ms=0)
+                timestamps = get_speech_timestamps(audio.data, vad_opts)
+                if len(timestamps) == 0:
+                    logger.info(
+                        f"No speech detected in the last {config.inactivity_window_seconds} seconds."
+                    )
+                    break
+                elif (
+                    # last speech end time
+                    config.inactivity_window_seconds
+                    - timestamps[-1]["end"] / SAMPLES_PER_SECOND
+                    >= config.max_inactivity_seconds
+                ):
+                    logger.info(
+                        f"Not enough speech in the last {config.inactivity_window_seconds} seconds."
+                    )
+                    break
+    except asyncio.TimeoutError:
+        logger.info(
+            f"No data received in {config.max_no_data_seconds} seconds. Closing the connection."
+        )
+    except WebSocketDisconnect as e:
+        logger.info(f"Client disconnected: {e}")
+    audio_stream.close()
+def format_transcription(
+    transcription: Transcription, response_format: ResponseFormat
+) -> str:
+    if response_format == ResponseFormat.TEXT:
+        return transcription.text
+    elif response_format == ResponseFormat.JSON:
+        return TranscriptionResponse(text=transcription.text).model_dump_json()
+    elif response_format == ResponseFormat.VERBOSE_JSON:
+        return TranscriptionVerboseResponse(
+            duration=transcription.duration,
+            text=transcription.text,
+            words=transcription.words,
+        ).model_dump_json()
+@app.websocket("/v1/audio/transcriptions")
+async def transcribe_stream(
+    ws: WebSocket,
+    transcription_opts: TranscribeParams,
+    response_format: ResponseFormat = ResponseFormat.JSON,
+) -> None:
+    await ws.accept()
+    asr = FasterWhisperASR(whisper, transcription_opts)
+    audio_stream = AudioStream()
+    async with asyncio.TaskGroup() as tg:
+        tg.create_task(audio_receiver(ws, audio_stream))
+        async for transcription in audio_transcriber(asr, audio_stream):
+            logger.debug(f"Sending transcription: {transcription.text}")
+            # Or should it be
+            if ws.client_state == WebSocketState.DISCONNECTED:
+                break
+            await ws.send_text(format_transcription(transcription, response_format))
+    if not ws.client_state == WebSocketState.DISCONNECTED:
+        # this means that the client HASNT disconnected
+        await ws.close()

speaches/server_models.py ADDED Viewed

	@@ -0,0 +1,26 @@

+import enum
+from pydantic import BaseModel
+from speaches.core import Word
+class ResponseFormat(enum.StrEnum):
+    JSON = "json"
+    TEXT = "text"
+    VERBOSE_JSON = "verbose_json"
+# https://platform.openai.com/docs/api-reference/audio/json-object
+class TranscriptionResponse(BaseModel):
+    text: str
+# Subset of https://platform.openai.com/docs/api-reference/audio/verbose-json-object
+class TranscriptionVerboseResponse(BaseModel):
+    task: str = "transcribe"
+    duration: float
+    text: str
+    words: list[
+        Word
+    ]  # Different from OpenAI's `words`. `Word.text` instead of `Word.word`

speaches/transcriber.py ADDED Viewed

	@@ -0,0 +1,75 @@

+from __future__ import annotations
+from typing import AsyncGenerator
+from speaches.asr import FasterWhisperASR
+from speaches.audio import Audio, AudioStream
+from speaches.config import config
+from speaches.core import Transcription, Word, common_prefix, to_full_sentences
+from speaches.logger import logger
+class LocalAgreement:
+    def __init__(self) -> None:
+        self.unconfirmed = Transcription()
+    def merge(self, confirmed: Transcription, incoming: Transcription) -> list[Word]:
+        # https://github.com/ufal/whisper_streaming/blob/main/whisper_online.py#L264
+        incoming = incoming.after(confirmed.end - 0.1)
+        prefix = common_prefix(incoming.words, self.unconfirmed.words)
+        logger.debug(f"Confirmed: {confirmed.text}")
+        logger.debug(f"Unconfirmed: {self.unconfirmed.text}")
+        logger.debug(f"Incoming: {incoming.text}")
+        if len(incoming.words) > len(prefix):
+            self.unconfirmed = Transcription(incoming.words[len(prefix) :])
+        else:
+            self.unconfirmed = Transcription()
+        return prefix
+    @classmethod
+    def prompt(cls, confirmed: Transcription) -> str | None:
+        sentences = to_full_sentences(confirmed.words)
+        if len(sentences) == 0:
+            return None
+        return sentences[-1].text
+    # TODO: better name
+    @classmethod
+    def needs_audio_after(cls, confirmed: Transcription) -> float:
+        full_sentences = to_full_sentences(confirmed.words)
+        return full_sentences[-1].end if len(full_sentences) > 0 else 0.0
+def needs_audio_after(confirmed: Transcription) -> float:
+    full_sentences = to_full_sentences(confirmed.words)
+    return full_sentences[-1].end if len(full_sentences) > 0 else 0.0
+def prompt(confirmed: Transcription) -> str | None:
+    sentences = to_full_sentences(confirmed.words)
+    if len(sentences) == 0:
+        return None
+    return sentences[-1].text
+async def audio_transcriber(
+    asr: FasterWhisperASR,
+    audio_stream: AudioStream,
+) -> AsyncGenerator[Transcription, None]:
+    local_agreement = LocalAgreement()
+    full_audio = Audio()
+    confirmed = Transcription()
+    async for chunk in audio_stream.chunks(config.min_duration):
+        full_audio.extend(chunk)
+        audio = full_audio.after(needs_audio_after(confirmed))
+        transcription, _ = await asr.transcribe(audio, prompt(confirmed))
+        new_words = local_agreement.merge(confirmed, transcription)
+        if len(new_words) > 0:
+            confirmed.extend(new_words)
+            yield confirmed
+    logger.debug("Flushing...")
+    confirmed.extend(local_agreement.unconfirmed.words)
+    yield confirmed
+    logger.info("Audio transcriber finished")

tests/__init__.py ADDED Viewed

File without changes

tests/app_test.py ADDED Viewed

	@@ -0,0 +1,84 @@

+import json
+import os
+import threading
+import time
+from difflib import SequenceMatcher
+from typing import Generator
+import pytest
+from fastapi import WebSocketDisconnect
+from fastapi.testclient import TestClient
+from starlette.testclient import WebSocketTestSession
+from speaches.config import BYTES_PER_SECOND
+from speaches.main import app
+from speaches.server_models import TranscriptionVerboseResponse
+SIMILARITY_THRESHOLD = 0.97
+@pytest.fixture()
+def client() -> Generator[TestClient, None, None]:
+    with TestClient(app) as client:
+        yield client
+def get_audio_file_paths():
+    file_paths = []
+    directory = "tests/data"
+    for filename in reversed(os.listdir(directory)[5:6]):
+        if filename.endswith(".raw"):
+            file_paths.append(os.path.join(directory, filename))
+    return file_paths
+file_paths = get_audio_file_paths()
+def stream_audio_data(
+    ws: WebSocketTestSession, data: bytes, *, chunk_size: int = 4000, speed: float = 1.0
+):
+    for i in range(0, len(data), chunk_size):
+        ws.send_bytes(data[i : i + chunk_size])
+        delay = len(data[i : i + chunk_size]) / BYTES_PER_SECOND / speed
+        time.sleep(delay)
+def transcribe_audio_data(
+    client: TestClient, data: bytes
+) -> TranscriptionVerboseResponse:
+    response = client.post(
+        "/v1/audio/transcriptions?response_format=verbose_json",
+        files={"file": ("audio.raw", data, "audio/raw")},
+    )
+    data = json.loads(response.json())  # TODO: figure this out
+    return TranscriptionVerboseResponse(**data)  # type: ignore
+@pytest.mark.parametrize("file_path", file_paths)
+def test_ws_audio_transcriptions(client: TestClient, file_path: str):
+    with open(file_path, "rb") as file:
+        data = file.read()
+        streaming_transcription: TranscriptionVerboseResponse = None  # type: ignore
+        with client.websocket_connect(
+            "/v1/audio/transcriptions?response_format=verbose_json"
+        ) as ws:
+            thread = threading.Thread(
+                target=stream_audio_data, args=(ws, data), kwargs={"speed": 4.0}
+            )
+            thread.start()
+            while True:
+                try:
+                    streaming_transcription = TranscriptionVerboseResponse(
+                        **ws.receive_json()
+                    )
+                except WebSocketDisconnect:
+                    break
+            ws.close()
+        file_transcription = transcribe_audio_data(client, data)
+        s = SequenceMatcher(
+            lambda x: x == " ", file_transcription.text, streaming_transcription.text
+        )
+        assert (
+            s.ratio() > SIMILARITY_THRESHOLD
+        ), f"\nExpected: {file_transcription.text}\nReceived: {streaming_transcription.text}"

tests/conftest.py ADDED Viewed

	@@ -0,0 +1,9 @@

+import logging
+disable_loggers = ["multipart.multipart", "faster_whisper"]
+def pytest_configure():
+    for logger_name in disable_loggers:
+        logger = logging.getLogger(logger_name)
+        logger.disabled = True