Pattern based NLP & ASR

MikeB · « **Reply #30 on:** December 10, 2021, 11:00:14 am »

Update.

I changed the Resonator/Peak filter's bandwidth to 250hz (Q = 3) to better detect some voice formants.. but in doing that the SNR dropped to 15% (compared to 40-100% before), so now it needs noise reduction.

Most of the voice formants that don't move in pitch (and are spaced out) are easy to detect, but not the more complicated ones.

Voice formants for the long vowel "a-":
"a-". F1: ~500hz (+/-200), F2: 1500 transition to 2250hz, F3: ~2750hz (+/- 100), F4: ~3500hz (+/- 100)

F1, F3 can be locked in for this phoneme (F4 is just air), but F2 needs to be more clearer then drawn as a vector.

MikeB · « **Reply #31 on:** December 17, 2021, 09:54:28 am »

Update.

Remade all the Resonator/Peak Filters in a program called EasyFilter which uses 64-bit doubles to do calculations instead of cheap 16-bit sums, and it's a lot better.

Added a Low-pass filter before doing the Res/Peak filtering. I have access to the data buffer so I was able to use a Zero-phase IIR filter (running it forwards and backwards over the data)...

Now the noise/non-signal is very low and consistent, and the voice/signal has good SNR and consistent.

Vowels can't be understood very well this way but it does give a good starting point for Match Filtering against a pre-recorded artificial/"perfect" vowel. IE Skipping manual cross-correlation of all 44 phonemes... As you can reduce the number of possibilities first.

Praat allows making and saving custom sounds/vowels in 16-bit sample points, so at this point (after seeing a particular frequency is busy) I can straight away reduce the number of possible phonemes then test them. As vowel phonemes are 200ms of 1-3 frequency tones that gradually go up/down/straight, being out by 10% may not make much difference... So will be testing this next week.

MikeB · « **Reply #32 on:** January 05, 2022, 08:28:57 am »

Update.

Changed the main LP filter to a 2750hz Chebyshev zero-phase, and changed all sound data to use 32-bit floats while processing.

Wave formants 1-4 (up to 2750hz) are now 200-300% SNR over background noise (spoken close to mic).

Tone analysis (sine wave matching) didn't work over 100ms as it only gives 3.5hz bandwidth before being 75% or more out of phase and indistinguishable from noise... Tone analysis using a short window wont work very well as vowels "a-", "o-" (long 'a', and long 'o') transition from one frequency to another over 100ms-1000ms, and need to be +/-10% anyway.

So I'll be looking at more traditional frequency analysis like FFT, DCT, DTT. With the fast pre-calculated wave formants, many of the phonemes can be counted in/out before doing deeper signal analysis anyway so it should be faster.

MikeB · « **Reply #33 on:** February 07, 2022, 11:57:13 am »

Deleted all IIR filters as they seemed to be adding noise. Now only using pre-calculated FIR filters, which also have a built-in Hamming Window for the FFT.

There are 3 FIR filters (low-passes) at 1100, 2200, 4200, with the sample data split into 0-250hz, 250-1000hz, 1000-1500hz, 1500-3000hz. This reduces noise as much as possible, as the processing is mostly how Busy the wave data is.

"Busyness" is calculated for the 250-1000hz range. (Sample data difference / RMS power). If busy then the FFT is processed.

The FFT runs on each of the four frequency ranges seperately. - If the four ranges are merged with only one FIR the output is too noisy for the lower ranges.

After this the FFT output data is smoothed with the top frequencies averaged together but I will be changing this to a custom Mel Filter Bank setup (not the algorithm, using 12 fixed ranges to suit phoneme ranges).

There's still a lot to change but thought I would post this as it's completely different, and still only takes ~1.5ms to process (from a 102.4ms sample buffer). The NLP only takes 0.1-0.2ms on top. So it is 98.5% idling in a thread::yield().

Normal Speech Recognition takes 0.5-0.7 seconds per sample. The problem seems to be in the autoregression/prediction modelling after the FFT and MFCC/Mel Filter algorithms. Precision but the speed is too slow...

In the picture "a-" and "e-" long vowels are detected.

MikeB · « **Reply #34 on:** March 09, 2022, 10:08:07 am »

Rewrote everything to better frame the data around noise at the beginning and end.

Using RMS derived 'Busyness' and Volume Peak to auto adjust the noise floor - You can gradually raise your voice and it will only output "*noise*", but when you suddenly speak from quiet it will then process voice.

The main differences now are in framing, as this and pitch detection are the most important.

So in not using autoregression/prediction and inverse FFT etc, the main time consumer, all I have is basically one FFT output. This makes it difficult to detect any short lasting sound, which is most consonants, as FFT's need many samples for reliability.

So to detect consonants and vowels together, I'm using framing, resolution, and sound transition. Frame lengths are 16ms, 32ms, 64ms, 64ms (for 8khz: 128, 256, 512, 512 samples) fixed length starting from when the voice starts. Consonants are a one or two part short burst (16ms, and 32ms frames). Vowels are two part (64ms, and 64ms). The theory is to use ~5 group resolution in hz for Consonants, and 12 for vowels. The 12 groups for vowels are based on general vowel speaking range. Consonants aren't always at the beginning of something you say, but most are involved with plosives so will always have silence before them, creating a new speech frame with the consonant at the beginning.

Another problem was a low hz bias in the FFT output (linearly stronger at lower frequencies), so levelling those out made a big difference.

Getting good results from vowels, but framing, FFT quality, and post processing still need work.

MikeB · « **Reply #35 on:** April 08, 2022, 01:44:25 pm »

This month was mostly theory and expanding on Framing, Resolution, and sound Transitions.

An FIR Hilbert filter (at 2500hz) was added in finding the initial Framing/Noise Floor level, this replaced the entire 'Busyness' code as the Hilbert filter removes bass, top end, and 2500hz breath noise in one go. Busyness calculations that relied on those to work now shows no difference. So Volume Peaks are relied on instead.

Framing is now expanded to cover both short & long voice transitions. At idle/noise a standard 1408 samples are captured (~0.7% duty cycle, ~1ms/140.8ms frame at 10khz). During work, up to 10 frames or 3456 samples are captured (at ~25% duty cycle, including waiting for samples).

Frames are: G1: 12.8, 25.6, 25.6, 25.6, 25.6, 25.6ms. G2: 51.2, 51.2, 51.2, 51.2ms.

The FFT still works well at 256 samples.

Mels Bank style filter groups are now 13 for vowels: 250, 350, 450, 550, 650, 750, 1000, 1333, 1666, 2000, 2500, 3500, 5000hz.

The Low Hz Removal filter was upgraded to reflect HRTF/true ear resonance values as our ears naturally remove bass and boost 2-5khz, but mics don't.

Consonant & Vowel detection model is still: Silence-ConsonantGroup-VowelGroup-Silence ... or Silence-VowelGroup-Silence.
This meant adding 15 new consonants as combinations, and a lot of vowel groups. Vowel groups are a WIP as there's potentially hundreds. The longest and most difficult connected vowels without silence are words/phrases like "who are you", "are you 'aight"/"are you alright", "oh yeah", "you wait"/"you ate". They contain consonants but if they're not pronounced with spaces then the only way to detect them is as one long vowel with multiple sound transitions.

So being able to identify frequency transitions clearly, patterns of vowels are much easier to detect...

Overall it's a nightmare, but I'm close to word output now. If I can detect "k" regularly I can output "cake", fast, and be reliably different to "tree".

MagnusWootton · « **Reply #36 on:** April 08, 2022, 04:17:09 pm »

nice 1, but could it trick someone into thinking the bot is someone, cause I need to get out of some annoying interviews I'm currently having.

MikeB · « **Reply #37 on:** April 09, 2022, 07:00:24 am »

Response speed, including being able to choose from instant to a well timed response, adds to affinity/immersion so that increases believability.

EG. An instant response in tense circumstances, or a 1-2 second delay in casual circumstances.

It's more intended for real-time applications than a "cover all" approach.

MikeB · « **Reply #38 on:** June 03, 2022, 11:43:33 am »

Last month I intended to output "cake" and "tree" but it was still too inconsistant.

This month is half-way there.

Framing and Resolution was updated to support 16khz instead of 10khz, to better detect consonants at 4000-5500hz.

Sample sizes are double but frames are less, so the speed is roughly the same.

Sample framing is: (G1) 16, 32, 32, 32, 32ms. (G2) 64, 64, 64ms.

Resolution groups have been updated slightly. 15 instead of 13, starting from 400hz-5500hz.

No changes to vowel Transition model, but now also outputs which consonants/vowels are detected.

Consonant detection is mostly fine with a clear voice. Only "k" tested.

Vowel detection is inconsistant. As vowels rely on multiple frames, changing to 16khz as well as reducing frames didn't help, so will be testing 12khz in the future with shorter/more frames.

Saying "cake" (below) with phoneme output.

MikeB · « **Reply #39 on:** July 06, 2022, 02:03:46 pm »

The frequencies for "cake" and "tree" are reasonably consistant now.

Response time is only 5-10ms from the end of recording the last sample to the end of processing the data.

"K-ay" (in "cake") and "t-r-ee" (in "tree") are 90-100% consistant with good volume. The second "k" sound in 'cake' is often a low-volume miss.

Robust in *mild* accent change, vagueness/clarity change, volume change, pitch change. Low volume is the worst.

16khz fixed input frequency.

FIR Hilbert, and FIR Low pass filters have been modified to work with factors of 8 to fit the frame sizes. Dont know why I didn't do that before. One Low Pass at 1000hz, the other Low Pass at 8000hz (dual samples feed the FFT).

Frame sizes are one 16ms frame, the rest are 32ms each.
16khz/368ms max. 5888 samples: (G1) 256-512, (G2) 512-512-512-512-512, (G3) 512-512-512-512-512.

Added a Plosive Detector to determine whether G1 samples are a plosive consonant.

Non-plosive consonants are now checked with vowels.

Seeing all the frequencies I need to for the consonants & vowels I want. I just need to add more and detect low volume samples better.

I'm aiming for a small list of words such as "up, down, left, right, start, stop, hi, bye..." so it can become useful straight away. It's also easier to handle strong accents and missing sounds if the word list is smaller. I have many ideas for solving it, but it'll at least be as good as current command based systems. If this works out then more is possible...

MagnusWootton · « **Reply #40 on:** July 06, 2022, 03:18:05 pm »

Ive done alot of computer vision. (In the form of a corner tracker.) And I can extend to theory to some 1d system for sound.

Ever thought of using k-nearest neighbours on sound bytes to do voice recognition?

MikeB · « **Reply #41 on:** July 07, 2022, 08:40:16 am »

K-nearest neighbour is a statistical analysis/learning method, and this is the reason why speech recognition always takes at least 0.5 seconds to process, and why speech recognition has never progressed since the 1950's (except for different types of guessing algorithms)...

My approach is halfway between IBM shoebox and modern speech recognition.

Frequencies above & below 1000hz are split, ultimately run through an FFT, then fast frequency analysis performed. Nothing else. The NLP (for homophones, missing words...) uses Compression Pattern Matching.

Speech recognition that can handle all words in all languages are one thing, but an "as fast as possible" approach is still needed in society...

MikeB · « **Reply #42 on:** August 09, 2022, 11:52:25 am »

This month was mostly fine tuning and an efficiency pass.

Processing takes approx 1ms per 32ms frame.

Framing is the same. One 16ms frame, the rest 32ms.

Resolution has been increased to the below values:
438, 469, 500, 532, 563, 594, 625, 657, 688, 719, 750, 782, 1000, 1250, 1500, 1625, 1750, 1875, 2000, 2125, 2250, 2500, 3000, 3500, 4000, 5500, 8000

HRTF/Fletcher-Munson calibration values for the above are better.

Frequency Transition detection is the same.

More phonemes added (26 unique patterns: 23 vowels. 1 non-plosive consonant. 2 consonants)

That is basically all the vowels I need to add, which include some duplicates for accents. I have more patterns for consonants but I'm not testing those just yet.

Finally added Peak Volume Normalisation which doubled the reliability/robustness.

"Cake" and "Tree" are IMO comfortably robust for a command based system. I think most people would get them in 1-3 shots. That might seem like it has little value, but this is no data, no training, no learning, handles mild accents, vagueness, pitch & volume change, and is instant.

Phoneme Animations:

This month I also planned a total redesign to suit realtime phoneme animations. IE. To send an Animation ID & Length along with audio chunk data, for 3d experiences and games.

Sound waves travel one metre in ~2.9ms, so to be as real as possible, at 10 metres phoneme animations should be 29ms faster to start than the audio chunk data, and be seamless.

So I calculated with 4kb data chunks (128ms total @ 16khz), the initial plosive consonant (48ms) can process and send up to 78ms early (which is 27 metres in sound travel). 3kb data chunks work as well (up to 46ms early or 16 metres). Animations will be 50-100ms late if you look at your own character while you speak.

90% of the quality factor in this will be initial plosive detection (regardless of actual consonant) and this is just sample volume related. Only a general plosive mouth animation needs to be played if the actual consonant was missed. So initial mouth movement will start at the right time, look 75%+ right, and end at the right time.

Current phoneme animations in games are either pre-calculated to pre-recorded audio, or random mouth movements.

Saying "tree":

infurl · « **Reply #43 on:** August 09, 2022, 12:25:30 pm »

Is this something that you could run on a single board computer like a Raspberry Pi? A lot of people would have a use for it in that situation.

MikeB · « **Reply #44 on:** August 09, 2022, 01:58:02 pm »

Yes it's all single thread. It should run on an ARM Cortex M4F as well. Just need a MEMS microphone.

The FFT is the most complex thing, but it's low res and there are FFT's optimised for the Cortex M series & Rasp Pi's...

Pattern based NLP & ASR

MikeB

Re: Pattern based NLP

MikeB

Re: Pattern based NLP

MikeB

Re: Pattern based NLP

MikeB

Re: Pattern based NLP

MikeB

Re: Pattern based NLP

MikeB

Re: Pattern based NLP

MagnusWootton

Re: Pattern based NLP

MikeB

Re: Pattern based NLP

MikeB

Re: Pattern based NLP

MikeB

Re: Pattern based NLP

MagnusWootton

Re: Pattern based NLP

MikeB

Re: Pattern based NLP

MikeB

Re: Pattern based NLP

infurl

Re: Pattern based NLP

MikeB

Re: Pattern based NLP

Recent Topics

Recent News

Users Online

Articles