Fun text to voice

4/9/2023

In Part 3, we learned how to take an image and treat it as an array of numbers so that we can feed directly into a neural network for image recognition:Ĭan digital samples perfectly recreate the original analog sound wave? What about those gaps?īut thanks to the Nyquist theorem, we know that we can use math to perfectly reconstruct the original sound wave from the spaced-out samples - as long as we sample at least twice as fast as the highest frequency we want to record. The first step in speech recognition is obvious - we need to feed sound waves into a computer. Let’s see how it works! Turning Sounds into Bits

To work around this, we have to use some special tricks and extra precessing in addition to a deep neural network. Both both sound files should be recognized as exactly the same text - “hello!” Automatically aligning audio files of various lengths to a fixed-length piece of text turns out to be pretty hard.

One person might say “hello!” very quickly and another person might say “heeeelllllllllllllooooo!” very slowly, producing a much longer sound file with much more data.

The big problem is that speech varies in speed. That’s the holy grail of speech recognition with deep learning, but we aren’t quite there yet (at least at the time that I wrote this - I bet that we will be in a couple of years).

0 Comments

Fun text to voice

Leave a Reply.

Author

Archives

Categories