QubitPi commited on
Commit
5e6608a
·
unverified ·
1 Parent(s): b7258ac

(Speech recognition) Implement sampling (#1)

Browse files

* (Speech recognition) Implement sampling

* Self-review

.readthedocs.yaml CHANGED
@@ -23,4 +23,7 @@ sphinx:
23
 
24
  python:
25
  install:
26
- - requirements: docs/source/requirements.txt
 
 
 
 
23
 
24
  python:
25
  install:
26
+ - method: pip
27
+ path: .
28
+ - requirements: requirements.txt
29
+ - requirements: docs/source/requirements.txt
docs/source/img/hello-sampling.png ADDED
docs/source/img/hello-sound.png ADDED
docs/source/img/real-vs-sampling.png ADDED
docs/source/img/sampling-sound-wave.gif ADDED
docs/source/img/sound-wave.png ADDED
docs/source/img/speech-processing.png ADDED
docs/source/lamassu.rst CHANGED
@@ -2,11 +2,8 @@
2
  Lamassu
3
  =======
4
 
5
-
6
- Machine Learning
7
- ================
8
-
9
  .. toctree::
10
  :maxdepth: 100
11
 
12
  machine_learning/rnn
 
 
2
  Lamassu
3
  =======
4
 
 
 
 
 
5
  .. toctree::
6
  :maxdepth: 100
7
 
8
  machine_learning/rnn
9
+ speech/sampling.rst
docs/source/speech/sampling.rst ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ===============================
2
+ Speech Recognition with Lamassu
3
+ ===============================
4
+
5
+ .. contents:: Table of Contents
6
+ :depth: 2
7
+
8
+ Speech recognition will become a primary way that we interact with computers.
9
+
10
+ One might guess that we could simply feed sound recordings into a neural network and train it to produce text:
11
+
12
+ .. figure:: ../img/speech-processing.png
13
+ :align: center
14
+
15
+ That's the holy grail of speech recognition with deep learning, but we aren't quite there yet. The big problem is that
16
+ speech varies in speed. One person might say "hello!" very quickly and another person might say
17
+ "heeeelllllllllllllooooo!" very slowly, producing a much longer sound file with much more data. Both sounds should be
18
+ recognized as exactly the same text - "hello!" Automatically aligning audio files of various lengths to a fixed-length
19
+ piece of text turns out to be pretty hard. To work around this, we have to use some special tricks and extra precessing.
20
+
21
+ Turning Sounds into Bits
22
+ ========================
23
+
24
+ The first step in speech recognition is obvious — we need to feed sound waves into a computer. Sound is transmitted as
25
+ waves. A sound clip of someone saying "Hello" looks like
26
+
27
+ .. figure:: ../img/hello-sound.png
28
+ :align: center
29
+
30
+ Sound waves are one-dimensional. At every moment in time, they have a single value based on the height of the wave.
31
+ Let's zoom in on one tiny part of the sound wave and take a look:
32
+
33
+ .. figure:: ../img/sound-wave.png
34
+ :align: center
35
+
36
+ To turn this sound wave into numbers, we just record of the height of the wave at equally-spaced points:
37
+
38
+ .. figure:: ../img/sampling-sound-wave.gif
39
+ :align: center
40
+
41
+ This is called *sampling*. We are taking a reading thousands of times a second and recording a number representing the
42
+ height of the sound wave at that point in time. That's basically all an uncompressed .wav audio file is.
43
+
44
+ "CD Quality" audio is sampled at 44.1khz (44,100 readings per second). But for speech recognition, a sampling rate of
45
+ 16khz (16,000 samples per second) is enough to cover the frequency range of human speech.
46
+
47
+ Lets sample our "Hello" sound wave 16,000 times per second. Here's the first 100 samples:
48
+
49
+ .. figure:: ../img/hello-sampling.png
50
+ :align: center
51
+
52
+ .. note:: Can digital samples perfectly recreate the original analog sound wave? What about those gaps?
53
+
54
+ You might be thinking that sampling is only creating a rough approximation of the original sound wave because it's
55
+ only taking occasional readings. There's gaps in between our readings so we must be losing data, right?
56
+
57
+ .. figure:: ../img/real-vs-sampling.png
58
+ :align: center
59
+
60
+ But thanks to the `Nyquist theorem`_, we know that we can use math to perfectly reconstruct the original sound wave
61
+ from the spaced-out samples — as long as we sample at least twice as fast as the highest frequency we want to record.
62
+
63
+ .. automodule:: lamassu.speech.sampling
64
+ :members:
65
+ :undoc-members:
66
+ :show-inheritance:
67
+
68
+ .. _`Nyquist theorem`: https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem
lamassu/__init__.py ADDED
File without changes
lamassu/speech/__init__.py ADDED
File without changes
lamassu/speech/sampling.py ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import wave
2
+
3
+ import numpy as np
4
+
5
+
6
+ def sample_wav(file_path: str):
7
+ """
8
+ Sampling a .wav file
9
+
10
+ :param file_path: The absolute path to the .wav file to be sampled
11
+
12
+ :return: an array of sampled points
13
+ """
14
+ with wave.open(file_path, "rb") as f:
15
+ frames = f.readframes(f.getnframes())
16
+ return np.frombuffer(frames, dtype=np.int16)
setup.py CHANGED
@@ -2,7 +2,7 @@ from setuptools import setup, find_packages
2
 
3
  setup(
4
  name="lamassu",
5
- version="0.0.8",
6
  description="Empowering individual to agnostically run machine learning algorithms to produce ad-hoc AI features",
7
  url="https://github.com/QubitPi/lamassu",
8
  author="Jiaqi liu",
 
2
 
3
  setup(
4
  name="lamassu",
5
+ version="0.0.9",
6
  description="Empowering individual to agnostically run machine learning algorithms to produce ad-hoc AI features",
7
  url="https://github.com/QubitPi/lamassu",
8
  author="Jiaqi liu",