Pattern based NLP

  • 38 Replies
  • 92413 Views
*

MikeB

  • Mechanical Turk
  • *****
  • 156
Re: Pattern based NLP
« Reply #30 on: December 10, 2021, 11:00:14 am »
Update.

I changed the Resonator/Peak filter's bandwidth to 250hz (Q = 3) to better detect some voice formants.. but in doing that the SNR dropped to 15% (compared to 40-100% before), so now it needs noise reduction.

Most of the voice formants that don't move in pitch (and are spaced out) are easy to detect, but not the more complicated ones.

Voice formants for the long vowel "a-":
"a-". F1: ~500hz (+/-200), F2: 1500 transition to 2250hz, F3: ~2750hz (+/- 100), F4: ~3500hz (+/- 100)

F1, F3 can be locked in for this phoneme (F4 is just air), but F2 needs to be more clearer then drawn as a vector.

*

MikeB

  • Mechanical Turk
  • *****
  • 156
Re: Pattern based NLP
« Reply #31 on: December 17, 2021, 09:54:28 am »
Update.

Remade all the Resonator/Peak Filters in a program called EasyFilter which uses 64-bit doubles to do calculations instead of cheap 16-bit sums, and it's a lot better.

Added a Low-pass filter before doing the Res/Peak filtering. I have access to the data buffer so I was able to use a Zero-phase IIR filter (running it forwards and backwards over the data)...

Now the noise/non-signal is very low and consistent, and the voice/signal has good SNR and consistent.

Vowels can't be understood very well this way but it does give a good starting point for Match Filtering against a pre-recorded artificial/"perfect" vowel. IE Skipping manual cross-correlation of all 44 phonemes... As you can reduce the number of possibilities first.

Praat allows making and saving custom sounds/vowels in 16-bit sample points, so at this point (after seeing a particular frequency is busy) I can straight away reduce the number of possible phonemes then test them. As vowel phonemes are 200ms of 1-3 frequency tones that gradually go up/down/straight, being out by 10% may not make much difference... So will be testing this next week.

*

MikeB

  • Mechanical Turk
  • *****
  • 156
Re: Pattern based NLP
« Reply #32 on: January 05, 2022, 08:28:57 am »
Update.

Changed the main LP filter to a 2750hz Chebyshev zero-phase, and changed all sound data to use 32-bit floats while processing.

Wave formants 1-4 (up to 2750hz) are now 200-300% SNR over background noise (spoken close to mic).

Tone analysis (sine wave matching) didn't work over 100ms as it only gives 3.5hz bandwidth before being 75% or more out of phase and indistinguishable from noise... Tone analysis using a short window wont work very well as vowels "a-", "o-" (long 'a', and long 'o') transition from one frequency to another over 100ms-1000ms, and need to be +/-10% anyway.

So I'll be looking at more traditional frequency analysis like FFT, DCT, DTT. With the fast pre-calculated wave formants, many of the phonemes can be counted in/out before doing deeper signal analysis anyway so it should be faster.

*

MikeB

  • Mechanical Turk
  • *****
  • 156
Re: Pattern based NLP
« Reply #33 on: February 07, 2022, 11:57:13 am »
Deleted all IIR filters as they seemed to be adding noise. Now only using pre-calculated FIR filters, which also have a built-in Hamming Window for the FFT.

There are 3 FIR filters (low-passes) at 1100, 2200, 4200, with the sample data split into 0-250hz, 250-1000hz, 1000-1500hz, 1500-3000hz. This reduces noise as much as possible, as the processing is mostly how Busy the wave data is.

"Busyness" is calculated for the 250-1000hz range. (Sample data difference / RMS power). If busy then the FFT is processed.

The FFT runs on each of the four frequency ranges seperately. - If the four ranges are merged with only one FIR the output is too noisy for the lower ranges.

After this the FFT output data is smoothed with the top frequencies averaged together but I will be changing this to a custom Mel Filter Bank setup (not the algorithm, using 12 fixed ranges to suit phoneme ranges).

There's still a lot to change but thought I would post this as it's completely different, and still only takes ~1.5ms to process (from a 102.4ms sample buffer). The NLP only takes 0.1-0.2ms on top. So it is 98.5% idling in a thread::yield().

Normal Speech Recognition takes 0.5-0.7 seconds per sample. The problem seems to be in the autoregression/prediction modelling after the FFT and MFCC/Mel Filter algorithms. Precision but the speed is too slow...

In the picture "a-" and "e-" long vowels are detected.

*

MikeB

  • Mechanical Turk
  • *****
  • 156
Re: Pattern based NLP
« Reply #34 on: March 09, 2022, 10:08:07 am »
Rewrote everything to better frame the data around noise at the beginning and end.

Using RMS derived 'Busyness' and Volume Peak to auto adjust the noise floor - You can gradually raise your voice and it will only output "*noise*", but when you suddenly speak from quiet it will then process voice.

The main differences now are in framing, as this and pitch detection are the most important.

So in not using autoregression/prediction and inverse FFT etc, the main time consumer, all I have is basically one FFT output. This makes it difficult to detect any short lasting sound, which is most consonants, as FFT's need many samples for reliability.

So to detect consonants and vowels together, I'm using framing, resolution, and sound transition. Frame lengths are 16ms, 32ms, 64ms, 64ms (for 8khz: 128, 256, 512, 512 samples) fixed length starting from when the voice starts. Consonants are a one or two part short burst (16ms, and 32ms frames). Vowels are two part (64ms, and 64ms). The theory is to use ~5 group resolution in hz for Consonants, and 12 for vowels. The 12 groups for vowels are based on general vowel speaking range. Consonants aren't always at the beginning of something you say, but most are involved with plosives so will always have silence before them, creating a new speech frame with the consonant at the beginning.

Another problem was a low hz bias in the FFT output (linearly stronger at lower frequencies), so levelling those out made a big difference.

Getting good results from vowels, but framing, FFT quality, and post processing still need work.


*

MikeB

  • Mechanical Turk
  • *****
  • 156
Re: Pattern based NLP
« Reply #35 on: April 08, 2022, 01:44:25 pm »
This month was mostly theory and expanding on Framing, Resolution, and sound Transitions.

An FIR Hilbert filter (at 2500hz) was added in finding the initial Framing/Noise Floor level, this replaced the entire 'Busyness' code as the Hilbert filter removes bass, top end, and 2500hz breath noise in one go. Busyness calculations that relied on those to work now shows no difference. So Volume Peaks are relied on instead.

Framing is now expanded to cover both short & long voice transitions. At idle/noise a standard 1408 samples are captured (~0.7% duty cycle, ~1ms/140.8ms frame at 10khz). During work, up to 10 frames or 3456 samples are captured (at ~25% duty cycle, including waiting for samples).

Frames are: G1: 12.8, 25.6, 25.6, 25.6, 25.6, 25.6ms. G2: 51.2, 51.2, 51.2, 51.2ms.

The FFT still works well at 256 samples.

Mels Bank style filter groups are now 13 for vowels: 250, 350, 450, 550, 650, 750, 1000, 1333, 1666, 2000, 2500, 3500, 5000hz.

The Low Hz Removal filter was upgraded to reflect HRTF/true ear resonance values as our ears naturally remove bass and boost 2-5khz, but mics don't.

Consonant & Vowel detection model is still: Silence-ConsonantGroup-VowelGroup-Silence ... or Silence-VowelGroup-Silence.
This meant adding 15 new consonants as combinations, and a lot of vowel groups. Vowel groups are a WIP as there's potentially hundreds. The longest and most difficult connected vowels without silence are words/phrases like "who are you", "are you 'aight"/"are you alright", "oh yeah", "you wait"/"you ate". They contain consonants but if they're not pronounced with spaces then the only way to detect them is as one long vowel with multiple sound transitions.

So being able to identify frequency transitions clearly, patterns of vowels are much easier to detect...

Overall it's a nightmare, but I'm close to word output now. If I can detect "k" regularly I can output "cake", fast, and be reliably different to "tree".

« Last Edit: April 11, 2022, 07:40:17 am by MikeB »

*

MagnusWootton

  • Starship Trooper
  • *******
  • 499
Re: Pattern based NLP
« Reply #36 on: April 08, 2022, 04:17:09 pm »
nice 1,  but could it trick someone into thinking the bot is someone,  cause I need to get out of some annoying interviews I'm currently having.   :2funny:

*

MikeB

  • Mechanical Turk
  • *****
  • 156
Re: Pattern based NLP
« Reply #37 on: April 09, 2022, 07:00:24 am »
Response speed, including being able to choose from instant to a well timed response, adds to affinity/immersion so that increases believability.

EG. An instant response in tense circumstances, or a 1-2 second delay in casual circumstances.

It's more intended for real-time applications than a "cover all" approach.

*

MikeB

  • Mechanical Turk
  • *****
  • 156
Re: Pattern based NLP
« Reply #38 on: June 03, 2022, 11:43:33 am »
Last month I intended to output "cake" and "tree" but it was still too inconsistant.

This month is half-way there.

Framing and Resolution was updated to support 16khz instead of 10khz, to better detect consonants at 4000-5500hz.

Sample sizes are double but frames are less, so the speed is roughly the same.

Sample framing is: (G1) 16, 32, 32, 32, 32ms. (G2) 64, 64, 64ms.

Resolution groups have been updated slightly. 15 instead of 13, starting from 400hz-5500hz.

No changes to vowel Transition model, but now also outputs which consonants/vowels are detected.

Consonant detection is mostly fine with a clear voice. Only "k" tested.

Vowel detection is inconsistant. As vowels rely on multiple frames, changing to 16khz as well as reducing frames didn't help, so will be testing 12khz in the future with shorter/more frames.

Saying "cake" (below) with phoneme output.