(Speech recognition) Implement sampling (#1)

Browse files

* (Speech recognition) Implement sampling

* Self-review

Files changed (13) hide show

.readthedocs.yaml +4 -1
docs/source/img/hello-sampling.png +0 -0
docs/source/img/hello-sound.png +0 -0
docs/source/img/real-vs-sampling.png +0 -0
docs/source/img/sampling-sound-wave.gif +0 -0
docs/source/img/sound-wave.png +0 -0
docs/source/img/speech-processing.png +0 -0
docs/source/lamassu.rst +1 -4
docs/source/speech/sampling.rst +68 -0
lamassu/__init__.py +0 -0
lamassu/speech/__init__.py +0 -0
lamassu/speech/sampling.py +16 -0
setup.py +1 -1

.readthedocs.yaml CHANGED Viewed

@@ -23,4 +23,7 @@ sphinx:
 python:
   install:
-  - requirements: docs/source/requirements.txt

 python:
   install:
+    - method: pip
+      path: .
+    - requirements: requirements.txt
+    - requirements: docs/source/requirements.txt

docs/source/img/hello-sampling.png ADDED Viewed

docs/source/img/hello-sound.png ADDED Viewed

docs/source/img/real-vs-sampling.png ADDED Viewed

docs/source/img/sampling-sound-wave.gif ADDED Viewed

docs/source/img/sound-wave.png ADDED Viewed

docs/source/img/speech-processing.png ADDED Viewed

docs/source/lamassu.rst CHANGED Viewed

@@ -2,11 +2,8 @@
 Lamassu
 =======
-Machine Learning
-================
 .. toctree::
    :maxdepth: 100
    machine_learning/rnn

 Lamassu
 =======
 .. toctree::
    :maxdepth: 100
    machine_learning/rnn
+   speech/sampling.rst

docs/source/speech/sampling.rst ADDED Viewed

	@@ -0,0 +1,68 @@

+===============================
+Speech Recognition with Lamassu
+===============================
+.. contents:: Table of Contents
+    :depth: 2
+Speech recognition will become a primary way that we interact with computers.
+One might guess that we could simply feed sound recordings into a neural network and train it to produce text:
+.. figure:: ../img/speech-processing.png
+    :align: center
+That's the holy grail of speech recognition with deep learning, but we aren't quite there yet. The big problem is that
+speech varies in speed. One person might say "hello!" very quickly and another person might say
+"heeeelllllllllllllooooo!" very slowly, producing a much longer sound file with much more data. Both sounds should be
+recognized as exactly the same text - "hello!" Automatically aligning audio files of various lengths to a fixed-length
+piece of text turns out to be pretty hard. To work around this, we have to use some special tricks and extra precessing.
+Turning Sounds into Bits
+========================
+The first step in speech recognition is obvious — we need to feed sound waves into a computer. Sound is transmitted as
+waves. A sound clip of someone saying "Hello" looks like
+.. figure:: ../img/hello-sound.png
+    :align: center
+Sound waves are one-dimensional. At every moment in time, they have a single value based on the height of the wave.
+Let's zoom in on one tiny part of the sound wave and take a look:
+.. figure:: ../img/sound-wave.png
+    :align: center
+To turn this sound wave into numbers, we just record of the height of the wave at equally-spaced points:
+.. figure:: ../img/sampling-sound-wave.gif
+    :align: center
+This is called *sampling*. We are taking a reading thousands of times a second and recording a number representing the
+height of the sound wave at that point in time. That's basically all an uncompressed .wav audio file is.
+"CD Quality" audio is sampled at 44.1khz (44,100 readings per second). But for speech recognition, a sampling rate of
+16khz (16,000 samples per second) is enough to cover the frequency range of human speech.
+Lets sample our "Hello" sound wave 16,000 times per second. Here's the first 100 samples:
+.. figure:: ../img/hello-sampling.png
+    :align: center
+.. note:: Can digital samples perfectly recreate the original analog sound wave? What about those gaps?
+   You might be thinking that sampling is only creating a rough approximation of the original sound wave because it's
+   only taking occasional readings. There's gaps in between our readings so we must be losing data, right?
+   .. figure:: ../img/real-vs-sampling.png
+    :align: center
+   But thanks to the `Nyquist theorem`_, we know that we can use math to perfectly reconstruct the original sound wave
+   from the spaced-out samples — as long as we sample at least twice as fast as the highest frequency we want to record.
+.. automodule:: lamassu.speech.sampling
+   :members:
+   :undoc-members:
+   :show-inheritance:
+.. _`Nyquist theorem`: https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem

lamassu/__init__.py ADDED Viewed

File without changes

lamassu/speech/__init__.py ADDED Viewed

File without changes

lamassu/speech/sampling.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import wave
+import numpy as np
+def sample_wav(file_path: str):
+    """
+    Sampling a .wav file
+    :param file_path:  The absolute path to the .wav file to be sampled
+    :return: an array of sampled points
+    """
+    with wave.open(file_path, "rb") as f:
+        frames = f.readframes(f.getnframes())
+        return np.frombuffer(frames, dtype=np.int16)

setup.py CHANGED Viewed

@@ -2,7 +2,7 @@ from setuptools import setup, find_packages
 setup(
     name="lamassu",
-    version="0.0.8",
     description="Empowering individual to agnostically run machine learning algorithms to produce ad-hoc AI features",
     url="https://github.com/QubitPi/lamassu",
     author="Jiaqi liu",

 setup(
     name="lamassu",
+    version="0.0.9",
     description="Empowering individual to agnostically run machine learning algorithms to produce ad-hoc AI features",
     url="https://github.com/QubitPi/lamassu",
     author="Jiaqi liu",