Ai Dreams Forum

Member's Experiments & Projects => General Project Discussion => Topic started by: MikeB on May 24, 2020, 12:16:50 pm

Title: Pattern based NLP & ASR
Post by: MikeB on May 24, 2020, 12:16:50 pm

This is a project I've been working on for a few years. In 2019, I tested it on Pandora Bots. This year I'm converting it to C/C++.

The main goal is to be small, fast, and white-box, instead of algorithms, knowledge, self-learning.

Broadly it works as a word and sentence compressor.

Words are matched to a pre-defined list of words in a lookup table. Matches return one 8-bit character symbol for sentence matching, and two other 8-bit symbols for context and uniqueness. [March 2023] There are 49 total word groups for the first symbol, and up to 256 for the second and third individually.

Sentences made of 4-10 one character symbols, where each symbol is 1 of 49 options, each containing hundreds to thousands of words. This means each sentence can detect hundreds of millions of sentences which contain similar meaning. These sentences are grouped and stored in a pre-defined list which compress the sentence to an intention symbol.

For a chatbot, the developer may use the one-character intentions, with multiple one-character word context in pre-defined lists to cover practically all possible spoken interactions, with a good level of attentiveness to the original sentence and a deliberate white-box response. Output can be further modified using another lookup table to output randomness in text/audio response. A good number of literal responses to cover a broad range of sentence intentions is 50-100. This makes changing chatbot personality very easy.

Size and Response times:

In 2019 in pandora bots (1000 words, 200 sentences) size is ~500kb, speed ~1 second response.
In 2020 in C on an Arm Cortex M4 @ 120mhz , total size is ~159kb, speed 15-100 ms / sentence.
In 2021 in C on a PC @ 2.6ghz with Binary Searching (set up time of 70ms), speed ~1ms / sentence.

Other chatbot features:

Total size is < 500kb including word databases.
One sentence generally takes less than 1 ms to process.
Limited chatbot responses make it easy to record an actors voice and change personality.
Private information is stripped during word compression (words that aren't in the pre-defined list are lost and non-recoverable).

Fine differentiation of intentions, eg between: Wondering, Questions, and Directions - "can you speak english" "do you speak english" "speak english".

Can count occurance of emotional words, logical words, burning-analyser words, light-sense words to reply in kind better.

For the problem of chatbots in experiences/games and/or cpu restricted platforms, it solves
Too much data or processing power required.
Cannot change the personality/no personality.
Cannot change the language/only one language.
Chatbot escaping the topic due to bad intention reading.
Chatbot returning bad views / knowledge calculation (only pre-determined responses).
Chatbot terrible voice synthesis (a voice actor can record all lines including random alternates).

Title: Re: Pattern based NLP for games
Post by: MikeB on May 24, 2020, 12:48:43 pm

I will be working on it throughout this year and aim to make it into an API at least by the end of the year...

So far I have tested the Plain C version on the 2017/2018 Loebner prize questions...

All the words in each question can be changed for slightly different words and still return the same response (randomised differently).. If the same test is run again the responses could change to their random alternate versions...

Quote

Loebner 2018.
   Good afternoon.   - "greetings"
   Do you consent to having this conversation recorded? - "for me, not a problem"
   Have we met before? - "cannot say, good to meet you"
   Do you know how to make toast? - "no, not sure about this" [toast]
   Which languages can you use? - "many many, english, espanol, italiano... what about you?" [only english]
   How many letters are in the word "abracadabra"? - "who am I? Pietro Bembo?"
   Where should one look for love? - "interesting! i don't know" [doesn't know "love"]
   How are you feeling right now? - "I'm very good, and how are you?"
   Who said "I have a dream"? - "no no quotes"
   How old are you? - "thousands... cannot say"
   When might I need to know how many times a wheel has rotated? - "who am I? Fibonacci?"
   What will you do later today? - "just came back from surfing, may go out again!"
   What is the third angle in a triangle with internal angles of 90 degrees and 30 degrees? - "no no maths"
   Who is your favourite artist? - "interesting! many many..."
   What do you hold when you shake hands? - "words are not my profession" [hands]
   Do you have any legs? - "no, words elude me [legs]
   Can you recommend me a film? - "yes, thousands..." [doesn't know "film"]
   Do you understand Winograd Schemas? - "nay, confusing words" [doesn't know winograd schemas]
   If a chicken roosts with a fox they may be eaten. What may be eaten? - "OK. from memory... somebody..."
   I had to go to the toilet during the film because it was too long. What was too long? - "alright. alright. from memory... that thing..."

Title: Re: Pattern based NLP for games
Post by: ivan.moony on May 24, 2020, 09:01:48 pm

Sounds like a great improvement over current chatbot technology like AIML. What do you plan to do with it?

Title: Re: Pattern based NLP for games
Post by: 8pla.net on May 25, 2020, 12:13:13 am

C Language is a good choice, I think.

Title: Re: Pattern based NLP for games
Post by: MikeB on August 07, 2020, 06:49:29 am

Quote from: ivan.moony on May 24, 2020, 09:01:48 pm

Sounds like a great improvement over current chatbot technology like AIML. What do you plan to do with it?

I'll be trying to integrate it as an Unreal Asset and/or approach a few different people who already do chat interfaces... In some ways it's better than AIML (you don't have to choose between a menu reply system or 10,000 custom responses)... but in other ways it's not very flexible. You have the ~100 fixed phrases, but they must be an alternative of one of the preprogrammed ones... and there's a section for custom reponses, but the input is choosing one of the fixed intentions/topics/perspectives and the output is one of the fixed ~100 phrases.

So you couldn't talk specifically about a product or idea. You'd use a secondary bot that has a list of all the keywords you're looking for, then you could join the intention with those.

Title: Re: Pattern based NLP for games
Post by: MikeB on August 07, 2020, 06:56:27 am

Quote from: 8pla.net on May 25, 2020, 12:13:13 am

C Language is a good choice, I think.

It compiled tiny in C, but I had to move to C++ now to make a windows DLL and get 16-bit wide chars. 400kb :(

Title: Re: Pattern based NLP for games
Post by: squarebear on August 07, 2020, 08:35:57 am

Quote from: MikeB on May 24, 2020, 12:16:50 pm

The size and speed in pandora bots (1000 individual words with sentences) is ~500kb, and 1 to 2 seconds response time.

I've not found such a delay. I have a bot with over 350,000 categories and it responds almost instantly. www.kuki.bot
Perhaps you are using AIML in a non standard way?

Title: Re: Pattern based NLP for games
Post by: 8pla.net on August 07, 2020, 01:14:40 pm

Quote from: MikeB on August 07, 2020, 06:56:27 am

Quote from: 8pla.net on May 25, 2020, 12:13:13 am
C Language is a good choice, I think.

It compiled tiny in C, but I had to move to C++ now to make a windows DLL and get 16-bit wide chars. 400kb :(

Do both then, C Language and C++... You may as well. They are compatible.

And, I would suggest making a Linux version, too, like ChatScript has.

Title: Re: Pattern based NLP for games
Post by: MikeB on September 15, 2020, 09:22:31 am

Quote from: squarebear on August 07, 2020, 08:35:57 am

Quote from: MikeB on May 24, 2020, 12:16:50 pm
The size and speed in pandora bots (1000 individual words with sentences) is ~500kb, and 1 to 2 seconds response time.
I've not found such a delay. I have a bot with over 350,000 categories and it responds almost instantly. www.kuki.bot
Perhaps you are using AIML in a non standard way?

I used about 2000 categories, but it re-searches several times. So 10 words can be 2000 x 5 x 10. If it's only 5 words or less it's instant....

Title: Re: Pattern based NLP for games
Post by: MikeB on September 15, 2020, 10:13:19 am

Recompiled to C++ DLL, C++ windows console (8bit standard english characters). 250kb

Approx 500 spellcheck words, 1200 words, 100 symbolic sentences, 50 chatbot recognised intentions, 50 chatbot fixed english phrases

1ms response time.

In the image below, the chatbot response is wrong (picking up general "how is your *" instead of "how are you"), but this is what it's like as a demo.

"explain is I/you motion-moving logic-direct" are the uncompressed symbols. One per word...

It's still basically an I-Don't-Know Bot, but the instant intention pickup is useful. You can still talk ON the topic/intention... and the ~50 fixed output phrases means it can all be voice recorded...

(https://i.ibb.co/hZ8HRN8/debug.jpg)

Title: Re: Pattern based NLP
Post by: MikeB on September 25, 2020, 09:29:25 am

Updated the word searching in Misspelled Words and Tokenise Word lists for a faster way of doing it.

The old way was scrolling through every character in the input sentence for each of the words in the 500 - 1300 word lists.

The new way is basically how people do it:
First: Look at the start character.
Second: Look at the length of the word.
Third: Look at the last character.
Forth: Is it only one character long?
Fifth: Check every character from 2nd to the last.

You break out (or continue;) the loop if any one of those fails. On average it's something like a 1 in 26 shot for the first, 1 in 5 for the second, 1 in 5 for the third, 1 in 5 for the forth...

Seemed to double or triple the speed. A 20 word sentence (2-3ms) now takes 0-1 ms.

Can't have spaces in the words though so will have to make a short "Catchphrase" word list.

Title: Re: Pattern based NLP
Post by: MikeB on October 12, 2020, 09:16:23 am

Decided to make it into a full NLP including Thesaurus, Sentiment (like/dislike), Email Spam, Aggressive language detection as well as the Chatbot.

Here's the Thesaurus. Everything takes 0-1ms.

It's fast because the words are already categorised in groups with each other, so it's just a reverse look-up. However it does still need some topic searching because some groups have over 50 words.

(https://i.ibb.co/W6RX1Gg/debug2.jpg)

Title: Re: Pattern based NLP
Post by: MikeB on October 26, 2020, 05:50:21 am

Here's an example of the Spam detection and differentiation.

The differentiation is between the phrases:
"Do not miss out"
"Do not miss out on great fun"
"Do not miss out on great offers"

The Thesaurus also shows all the alternative words ( max 8 ) that could have been used to output the same thing.

The Chatbots response is:
"For what purpose?"
"Ok. Not a problem."
"Gah, no selling. I'm not buying."

Still shifting the words around into different categories. There's now 1500 words (+300).

Also the word searching has been changed again to just "quick search" the first letter (using as few instructions as possible in a tight loop, so it can move onto the next fast), before searching the rest of the word. Also using a rebellious "goto" command to get to the next iteration faster.

Next: Chatbot (Alternate Language output), Language Translate, Tone/Harrassment identification.

(https://i.ibb.co/Kzrdqdn/debug3-Spam-Thesaurus.jpg)

Title: Re: Pattern based NLP
Post by: MikeB on November 26, 2020, 08:28:32 am

Working on a new utility to handle entry into the Chatbot Decisions file (Handles input from the NLP as tokens I T T P, and outputs S S S speech tokens).

(https://i.ibb.co/TqQZVDx/debug-Utility.png)

Title: Re: Pattern based NLP
Post by: MikeB on December 14, 2020, 05:29:14 am

Added a Start Page/Test Page to the utility.

The NLP processing/debug itself isn't changeable in the utility (word symbolising), that's still left to the console app. The utility is for setting up Chatbots, and some separate Spam and Tone options not related to the chatbot.

Spam detection is symbol based not literal, so synonyms of the word "offers" are all detected together, not just single words. This is multi-language as well.

The Thesaurus is a simple reverse lookup on word and a secondary word-topic so there's no setup apart from how many words to return.

Tone detection is an output of approx 10 levels from light patronising/grooming/objectifying to "i hate everything, all x's are x". Tested this in an early alpha version but is not implemented in the nlp and utility yet.

(https://i.ibb.co/GtqQqDh/debug-Utility2.png)

Title: Re: Pattern based NLP
Post by: MikeB on January 06, 2021, 08:02:43 am

Recently added both:
-Tone (9 levels - 3 negative, 3 positive, 3 grooming behaviour/patronising)
-WiC Challenge test (Words in Context - https://pilehvar.github.io/wic/)

The WiC test is one of the few NLP tests that can actually be done on this pattern based NLP, as it's not specifically prediction or knowledge based.

The WiC test (training data & results) is ~5500 lines. It completes in only 2 seconds (1980ms-2000ms), however many of the lines include deep knowledge or some other non-literal meaning to trick everyone, including people, so it'll also trick this NLP... The human score is only 80%. Most NLP's get 60-75%.

Most of the NLP set up is complete, so this year I'll be adding words & sentences in order to get through this test...

O0

(https://i.ibb.co/txrxM0Q/debug-Utility3.png)

Title: Re: Pattern based NLP
Post by: ivan.moony on January 06, 2021, 09:42:45 am

Hi MikeB :)

May I ask, how do you derive answers to the tests?

Title: Re: Pattern based NLP
Post by: MikeB on January 07, 2021, 03:35:29 pm

Quote from: ivan.moony on January 06, 2021, 09:42:45 am

Hi MikeB :)

May I ask, how do you derive answers to the tests?

Hi Ivan, I ignore the selected word that the test says to match, altogether, and just look to see if the underlying intention is the same.

In the line "He wore a jock strap with a metal cup. Bees filled the waxen cups with honey."... the word "cup" means the same. A traditional NLP would see if "metal cup" and "waxen cup" means the same based on knowledge linking, but in the pattern matching NLP I just look to see if the basic underlying intention is the same. So both of these sentences would come under "Person describing" with sub tags "clothing, material,..." and some others. If one sentence was a catchphrase or greatly different then it would return not a match.

Another example... "I try to avoid the company of gamblers. We avoided the ball."...the word "avoid" means the same. Both have the intention "Person explain", so this would return true.

It should get at least 60% doing it this way. There is a way to add catchphrases to get a few more, and some other things I can do with tags. Trying to keep real knowledge linking and deducing as far away as possible...

Title: Re: Pattern based NLP
Post by: MikeB on January 14, 2021, 02:53:06 pm

Just comparing the two Intentions isn't working out too well. Going to start a specialised way of doing it (still without knowledge) by looking at the words before & after the selected word.

Title: Re: Pattern based NLP
Post by: MikeB on February 01, 2021, 07:27:24 am

Restructered the WiC / Word in Context test to look at the words before & after the indicated word, similar to how we do it.

A brief overview...

1) Both sentences are formatted (look for odd symbols, double spaces, spelling, words spaced out like "h e l l o", extended laughing "hehehehe...").
2) Pattern-match each word to a predefined symbol from a list (only ~20 different symbols total, out of ~2300 english words. No stemming.).
3) Analyse WiC:
a) Input: Both sentences, the 'lookup word', and both locations of the word.
b) WiC function: Check the 'lookup word' (now a symbol shared with ~100 similar words) exists in the WiC / sentence compatibility table (~50-100 entries).
c) WiC function: If at least one match, check all other words. Highest word count (3-5 words) is selected as a match. Remember compatibility ID. Now check second sentence for a match. Return match true/false.

This is much more detailed than just checking the intention, as it can pick up the same context even if one sentence is an "instruction" and the other is a "person describing". EG. "come/came" (1) "Come out of the closet" (2) "He came singing down the road".

I got the time down from 2000-2600ms, to ~1400ms by removing most of the pre-formatting and only keeping 'Double Space' check as the test is already formatted...

Score is not worthy of publishing because I've only checked about 100 of the ~5500 records! A lot of sentences are reused though so shouldn't have to check all of them.

(https://i.ibb.co/6Xdr188/debug-Utility4.png)

Title: Re: Pattern based NLP
Post by: MikeB on March 01, 2021, 09:49:25 am

Still working on the WIC test.

Making progress of about 0.1% per day. (20-30% to go)

There's now 3700 words (+1400). 900 WIC pattern sentences (+800). Re-added spell-checking, so the full WIC test takes about 2.5 seconds to complete.

The scale and pickup is actually immense. Each of the 900 WIC pattern sentences has 3-6 "Symbolic Words". Each Symbolic Word represents 10-500 words. So each of the 900 WIC sentences actually picks up 500,000 - 20,000,000 variations.

Many times I add 10-20 WIC patterns (~100,000,000 word-sentence variations) and it only picks up one solitary record in the 5428 record WIC test... So the test is basic... but the word formatting is still broad enough that you can't just cheese the test.

Another problem is lack of words... I'm estimating I'll need at least 5000-7000 total to get a good result, and all these are hand entered in specific categories , so it's going to take some months...

One side effect is that I'm probably going to drop the old "Intention" categories I used to use for the chatbot and use these new WIC categories instead as it picks up an interesting variety. There are about 50 different groups (will be merging some) along the lines of:
"person or thing started to move / person or thing has him..."
"the object/concept of a had-thing"
"had the concept when..."
"a motion was taken / apply a rule / have-take the concept-chance to..."
"i play/avoid the / objects moved/ordered/fell to the
"logic-action an object"
"moving-action the object"
"an object of objects / vivid objects/objectives of"

So these will be better in chatbot programming.

Title: Re: Pattern based NLP
Post by: MikeB on March 05, 2021, 08:51:40 am

Sped up the processing thanks to Infurl's suggestion of adding Binary Searches.

Huge results.

Added to Spell Checking (800 words), and Word-token assignment (3700 words).

The original lists are unsorted, so they are hashed & sorted in program. (Hashed by ASCII adding.) There are typically 0-5 duplicate hash ids/collisions so the correct matches are checked letter-by-letter as well.

Processing 5428 lots of two sentences:
Before: 2600ms
After: 76ms of preparing. Hashing & sorting spelling and word list.
After: linear searching the hash lists: 1700ms (900ms faster)
After: binary searching the hash lists: 930ms (1670ms faster)

There are other processes, but for the spell/word search alone, Hashed/Linear seems to make it ~50% faster, and Hashed/Binary seems to make it ~90% faster.

Title: Re: Pattern based NLP
Post by: ivan.moony on March 05, 2021, 11:12:46 am

Great speedup! O0

And the good thing is that, using binary search, growing the search set doesn't slow down in linear scale, it slows down in logarithmic scale (that's almost as good as constant speed). The bigger the search set is, more you see the difference between linear search and binary search.

Title: Re: Pattern based NLP
Post by: infurl on March 06, 2021, 02:17:16 am

Quote from: MikeB on March 05, 2021, 08:51:40 am

The original lists are unsorted, so they are hashed & sorted in program. (Hashed by ASCII adding.) There are typically 0-5 duplicate hash ids/collisions so the correct matches are checked letter-by-letter as well.
...
After: 76ms of preparing. Hashing & sorting spelling and word list.

Pro-tip #2. There is no reason that you would have to do the preparation such as hashing and sorting at run-time. You could break out the portion of the code that does that preparation into a separate program which you run at compile time. This program does all the necessary preparation and then prints out all the data structures in a format that can be included by your final program and compiled in place into its final form. That will save you a chunk of time every time you run the actual program.

In my case I am parsing and processing millions of grammar rules which can take a considerable amount of time just to prepare. Although small grammars can be processed from start to finish at run-time, I have found it much faster to compile the different files that make up the grammar into intermediate partially processed files; these files in turn get loaded and merged into a final grammar definition which is then saved in source files that can be compiled and linked directly into my parser software, as well as a database format which can be loaded as a binary file at run-time.

That last feature has lots of advantages. The preprocessed files were so large that it was taking a long time just to compile them, but the best thing is that by separating the data files from the software, I can choose completely different processing options on the command line.

Title: Re: Pattern based NLP
Post by: MikeB on March 15, 2021, 07:08:02 am

That might be an idea.. If it gets longer than 500ms to load then I might do that... Only expecting about 5000-7000 words, but if I add more languages it could take a while.

I'm trying to keep the data slightly linked-in to the software so that it's harder to work out how it works, but it seems like it's just written in the DLL in plain english anyway... so may end up separating them.

Title: Re: Pattern based NLP
Post by: MikeB on March 23, 2021, 08:38:35 am

I finally ran into a problem with the word-token grouping not being separated enough, so I'm redoing all the groups.

Originally I'm using 6 Logical groups, 6 Emotive groups, 1 generic 'Possessive/having' group, and a bunch of others including Person (1st/2nd/3rd person). The past/present/future tense is included in the logical/emotive groups as well, and this leads to having to do double-ups in the WIC entries to include bad grammar ("I run away", "I running away", "I runned away", "I ran away")... I'd rather have them in the same group and use a past/present/future tag on the word to analyse later.... The context of "one person running" is the same (it's not "running a fridge"/operating - if it is, it's easy to pick up the extra words...).

The original theory is for 24 word groups, but 16 seems to be the best after I laid all the main keywords out. No prefix/suffix separation anymore...
4 Logical (concepts),
4 Emotive (everything that moves),
4 Burning (analytical/possessive/having/working),
4 Light (romantic/sense/pose/art).

The WIC entries should go down from ~800 to 200-400 with the same pickups and have more range... especially around the 'Burning' and 'Light' categories.

Title: Re: Pattern based NLP
Post by: MikeB on November 08, 2021, 08:33:01 am

I'm still working on this. I recently completed the concept for English words into the new Grammar categories.

There are:

Four main groups for Nouns, verbs (actions), and adjectives (modifiers). The groups are: Moving/living things, Analytical/laws/concepts, Logical subparts/binary actions, Light sense/stories/beautiful terms. (IE. The Four elemental/original groups each contain three sub groups - Noun, Verb, Adjective).

Seven other groups: Articles/Quantifier, Person/Agent, Question/Interrogative, Time-spatial, Direction-spatial, Conjuction/sentence breakers, Exlamation/grunt/hi/bye.

All Eleven groups are also encoded with present/future/past and precise/optimistic/explaining at the same time. IE All Present things are Precise, all Future things are Optimistic, all Past things are Explaining. In the Four main/elemental groups: Nouns are Precise/Present, Verbs are Future/Optimistic, Adjectives are Past/Explaining.

The last thing that breaks all grammar common sense is that each word is only permitted to be in one category only. So words like "brush" must either be an action (verb) or the name of a thing (noun). The default is to be an action (verb) as nouns aren't heavily relied on in sentence matching.

The whole concept is speed over quality... but as there's nothing for 3d environments between hard-written dialogue trees and GPT-3, this will sit right inbetween...

There are 3500 words to individually convert over so will take until the new year to do as I'm also looking at Speech Recognition software.

Speech Recognition software today uses Algorithms/Ngrams/NN's and is really slow (1-3 seconds response time) and uses a lot of power... The speed of my FSM/FST/binary NLP is 0.1ms to process a sentence (all words & intention)... So if the speech rec software is fast as well then it's more suited to 3d enviroments even if not as good....

Combining the NLP with speech rec is as simple as writing phonemes next to each word in the dictionary... If the user is speaking via voice then the word-text searching can be skipped altogether... it can go straight from voice phoneme->symonym symbols->pattern sentence pickup->intention grouping.

For audio processing I'm looking at OpenAL Soft (Open Audio Language) right now. There's nothing in the libraries for voice recognition, or even microphone low/high/band filter passing, but it's low-level enough to work on and have both speed and cross compatibility with other OS's.

The fastest approach I've seen is to take about 50ms of audio (shortest phonemes), generalising the pitch then associating it with a phoneme (tuned to your accent). This is about 1ms fast... but again, sits inbetween the best and something hard-coded.

One of the benefits if it works is that a responding chatbot can completely vary the response time to suit the situation... including interrupting the user, which adds another layer of humanness, but depends on how well words & intentions are picked up.

Title: Re: Pattern based NLP
Post by: MikeB on November 24, 2021, 03:06:58 pm

Update on audio speech recognition.

Traditional speech recognition uses FFT/DCT/DTT Fast Fourier Transform's to decode audio into voice phonemes. These capture 3 voice formants (frequency ranges specific to a phoneme) from one 'signature'. However these use nested-loops and are slow to process. DTT is the fastest but I want to try it another way...

Most spoken phonemes have a range of different frequency areas combined to make the sound - bass/warmness, middle range, high range. EG. "oh" is mostly bass. "ee" bass-middle. "ss" high.

The way I want to try is separating common frequency ranges first initially, then measure the power & complexity afterwards to tell if one range is loud/complex versus the others.

Separating the frequency ranges (band-passing) can be done in real time using just a few instructions, using pre-calculated IIR filters (http://www.schwietering.com/jayduino/filtuino/index.php). FIR filters are better quality but slow.

There is 20ms inbetween recorded audio frames to process the data, so I'm aiming to get both phoneme and NLP processing out the way in 1-10ms. Using the same thread as the one capturing the data.

This is some captured data for the word "wikipedia". The asterisks (*) represent good power & complexity levels versus background noise.

Currently there's little noise filtering and the band-pass filters need tightening up, but eventually if the results are strongly reproducable then they can be added to tables as base values...

(https://i.ibb.co/D4N3wWg/waveform-wikipedia.png)

Title: Re: Pattern based NLP
Post by: MikeB on November 26, 2021, 06:17:06 am

An update.

I changed the IIR filters to Resonators centered around 600hz, 1250hz, and 3150hz and now have double the signal-to-noise with more stable numbers.

This amplifies the signal for certain types of sounds, but in order for this to work I feel like I need about 10 filters, centered around different frequencies.

One FIR filter or a Fast Fourier Transform (normal approaches to speech rec) are approx 50-100x slower than one pre-calculated IIR resonator filter, so there's plenty of room...

Signal is only 12.5%-25% over noise background, and you need to speak close to the mic, so SNR needs to be improved by at least twice again to work...

(https://i.ibb.co/RBSH63F/waveform-wikipedia.png)

Title: Re: Pattern based NLP
Post by: MikeB on December 03, 2021, 11:19:32 am

Update.

I changed the three IIR Resonator filters to eight, and now have 50-100% more SNR with more stable reproducible numbers, compared to the 20% SNR before.

The numbers in the picture below represent Complexity (busyness) of the signal at specific frequencies. For different vowels/consonants the busyness should change greatly depending on frequency, however the frequencies are overlapping so they're mostly all equally busy. The Q of the IIR Resonator filters can be increased though to produce more isolation so I will be changing that.

cF1 is 300hz ... cF8 is 2750hz.

Frame size is 20ms, and duty cycle is only 1-3ms of that 20ms. So the thread goes to sleep 85-95% of the time.

Next, after sharpening the IIR filters, I also downloaded a program called Praat to analyse my voice so I can see which frequencies should be the most busy depending on vowel/consonant.

There are a couple of other things I found out about accents as well to make vowels/consonants switching possible, but that depends on a clearer output.

(https://i.ibb.co/5vK3ktk/waveform-wikipedia.png)

Title: Re: Pattern based NLP
Post by: MikeB on December 10, 2021, 11:00:14 am

Update.

I changed the Resonator/Peak filter's bandwidth to 250hz (Q = 3) to better detect some voice formants.. but in doing that the SNR dropped to 15% (compared to 40-100% before), so now it needs noise reduction.

Most of the voice formants that don't move in pitch (and are spaced out) are easy to detect, but not the more complicated ones.

Voice formants for the long vowel "a-":
"a-". F1: ~500hz (+/-200), F2: 1500 transition to 2250hz, F3: ~2750hz (+/- 100), F4: ~3500hz (+/- 100)

F1, F3 can be locked in for this phoneme (F4 is just air), but F2 needs to be more clearer then drawn as a vector.

Title: Re: Pattern based NLP
Post by: MikeB on December 17, 2021, 09:54:28 am

Update.

Remade all the Resonator/Peak Filters in a program called EasyFilter which uses 64-bit doubles to do calculations instead of cheap 16-bit sums, and it's a lot better.

Added a Low-pass filter before doing the Res/Peak filtering. I have access to the data buffer so I was able to use a Zero-phase IIR filter (running it forwards and backwards over the data)...

Now the noise/non-signal is very low and consistent, and the voice/signal has good SNR and consistent.

Vowels can't be understood very well this way but it does give a good starting point for Match Filtering against a pre-recorded artificial/"perfect" vowel. IE Skipping manual cross-correlation of all 44 phonemes... As you can reduce the number of possibilities first.

Praat allows making and saving custom sounds/vowels in 16-bit sample points, so at this point (after seeing a particular frequency is busy) I can straight away reduce the number of possible phonemes then test them. As vowel phonemes are 200ms of 1-3 frequency tones that gradually go up/down/straight, being out by 10% may not make much difference... So will be testing this next week.

Title: Re: Pattern based NLP
Post by: MikeB on January 05, 2022, 08:28:57 am

Update.

Changed the main LP filter to a 2750hz Chebyshev zero-phase, and changed all sound data to use 32-bit floats while processing.

Wave formants 1-4 (up to 2750hz) are now 200-300% SNR over background noise (spoken close to mic).

Tone analysis (sine wave matching) didn't work over 100ms as it only gives 3.5hz bandwidth before being 75% or more out of phase and indistinguishable from noise... Tone analysis using a short window wont work very well as vowels "a-", "o-" (long 'a', and long 'o') transition from one frequency to another over 100ms-1000ms, and need to be +/-10% anyway.

So I'll be looking at more traditional frequency analysis like FFT, DCT, DTT. With the fast pre-calculated wave formants, many of the phonemes can be counted in/out before doing deeper signal analysis anyway so it should be faster.

Title: Re: Pattern based NLP
Post by: MikeB on February 07, 2022, 11:57:13 am

Deleted all IIR filters as they seemed to be adding noise. Now only using pre-calculated FIR filters, which also have a built-in Hamming Window for the FFT.

There are 3 FIR filters (low-passes) at 1100, 2200, 4200, with the sample data split into 0-250hz, 250-1000hz, 1000-1500hz, 1500-3000hz. This reduces noise as much as possible, as the processing is mostly how Busy the wave data is.

"Busyness" is calculated for the 250-1000hz range. (Sample data difference / RMS power). If busy then the FFT is processed.

The FFT runs on each of the four frequency ranges seperately. - If the four ranges are merged with only one FIR the output is too noisy for the lower ranges.

After this the FFT output data is smoothed with the top frequencies averaged together but I will be changing this to a custom Mel Filter Bank setup (not the algorithm, using 12 fixed ranges to suit phoneme ranges).

There's still a lot to change but thought I would post this as it's completely different, and still only takes ~1.5ms to process (from a 102.4ms sample buffer). The NLP only takes 0.1-0.2ms on top. So it is 98.5% idling in a thread::yield().

Normal Speech Recognition takes 0.5-0.7 seconds per sample. The problem seems to be in the autoregression/prediction modelling after the FFT and MFCC/Mel Filter algorithms. Precision but the speed is too slow...

In the picture "a-" and "e-" long vowels are detected.
(https://i.ibb.co/f0qxR7L/waveform-longa-e.png)

Title: Re: Pattern based NLP
Post by: MikeB on March 09, 2022, 10:08:07 am

Rewrote everything to better frame the data around noise at the beginning and end.

Using RMS derived 'Busyness' and Volume Peak to auto adjust the noise floor - You can gradually raise your voice and it will only output "*noise*", but when you suddenly speak from quiet it will then process voice.

The main differences now are in framing, as this and pitch detection are the most important.

So in not using autoregression/prediction and inverse FFT etc, the main time consumer, all I have is basically one FFT output. This makes it difficult to detect any short lasting sound, which is most consonants, as FFT's need many samples for reliability.

So to detect consonants and vowels together, I'm using framing, resolution, and sound transition. Frame lengths are 16ms, 32ms, 64ms, 64ms (for 8khz: 128, 256, 512, 512 samples) fixed length starting from when the voice starts. Consonants are a one or two part short burst (16ms, and 32ms frames). Vowels are two part (64ms, and 64ms). The theory is to use ~5 group resolution in hz for Consonants, and 12 for vowels. The 12 groups for vowels are based on general vowel speaking range. Consonants aren't always at the beginning of something you say, but most are involved with plosives so will always have silence before them, creating a new speech frame with the consonant at the beginning.

Another problem was a low hz bias in the FFT output (linearly stronger at lower frequencies), so levelling those out made a big difference.

Getting good results from vowels, but framing, FFT quality, and post processing still need work.

(https://i.ibb.co/yk5r9wd/2022-mar-waveform-longa-e.png)

Title: Re: Pattern based NLP
Post by: MikeB on April 08, 2022, 01:44:25 pm

This month was mostly theory and expanding on Framing, Resolution, and sound Transitions.

An FIR Hilbert filter (at 2500hz) was added in finding the initial Framing/Noise Floor level, this replaced the entire 'Busyness' code as the Hilbert filter removes bass, top end, and 2500hz breath noise in one go. Busyness calculations that relied on those to work now shows no difference. So Volume Peaks are relied on instead.

Framing is now expanded to cover both short & long voice transitions. At idle/noise a standard 1408 samples are captured (~0.7% duty cycle, ~1ms/140.8ms frame at 10khz). During work, up to 10 frames or 3456 samples are captured (at ~25% duty cycle, including waiting for samples).

Frames are: G1: 12.8, 25.6, 25.6, 25.6, 25.6, 25.6ms. G2: 51.2, 51.2, 51.2, 51.2ms.

The FFT still works well at 256 samples.

Mels Bank style filter groups are now 13 for vowels: 250, 350, 450, 550, 650, 750, 1000, 1333, 1666, 2000, 2500, 3500, 5000hz.

The Low Hz Removal filter was upgraded to reflect HRTF/true ear resonance values as our ears naturally remove bass and boost 2-5khz, but mics don't.

Consonant & Vowel detection model is still: Silence-ConsonantGroup-VowelGroup-Silence ... or Silence-VowelGroup-Silence.
This meant adding 15 new consonants as combinations, and a lot of vowel groups. Vowel groups are a WIP as there's potentially hundreds. The longest and most difficult connected vowels without silence are words/phrases like "who are you", "are you 'aight"/"are you alright", "oh yeah", "you wait"/"you ate". They contain consonants but if they're not pronounced with spaces then the only way to detect them is as one long vowel with multiple sound transitions.

So being able to identify frequency transitions clearly, patterns of vowels are much easier to detect...

Overall it's a nightmare, but I'm close to word output now. If I can detect "k" regularly I can output "cake", fast, and be reliably different to "tree".

(https://i.ibb.co/LZgvytm/2022-04-waveform-longa.png)

Title: Re: Pattern based NLP
Post by: MagnusWootton on April 08, 2022, 04:17:09 pm

nice 1, but could it trick someone into thinking the bot is someone, cause I need to get out of some annoying interviews I'm currently having. :2funny:

Title: Re: Pattern based NLP
Post by: MikeB on April 09, 2022, 07:00:24 am

Response speed, including being able to choose from instant to a well timed response, adds to affinity/immersion so that increases believability.

EG. An instant response in tense circumstances, or a 1-2 second delay in casual circumstances.

It's more intended for real-time applications than a "cover all" approach.

Title: Re: Pattern based NLP
Post by: MikeB on June 03, 2022, 11:43:33 am

Last month I intended to output "cake" and "tree" but it was still too inconsistant.

This month is half-way there.

Framing and Resolution was updated to support 16khz instead of 10khz, to better detect consonants at 4000-5500hz.

Sample sizes are double but frames are less, so the speed is roughly the same.

Sample framing is: (G1) 16, 32, 32, 32, 32ms. (G2) 64, 64, 64ms.

Resolution groups have been updated slightly. 15 instead of 13, starting from 400hz-5500hz.

No changes to vowel Transition model, but now also outputs which consonants/vowels are detected.

Consonant detection is mostly fine with a clear voice. Only "k" tested.

Vowel detection is inconsistant. As vowels rely on multiple frames, changing to 16khz as well as reducing frames didn't help, so will be testing 12khz in the future with shorter/more frames.

Saying "cake" (below) with phoneme output.

(https://i.ibb.co/Vp2H1XW/2022-06-waveform-k-ay.png)

Title: Re: Pattern based NLP
Post by: MikeB on July 06, 2022, 02:03:46 pm

The frequencies for "cake" and "tree" are reasonably consistant now.

Response time is only 5-10ms from the end of recording the last sample to the end of processing the data.

"K-ay" (in "cake") and "t-r-ee" (in "tree") are 90-100% consistant with good volume. The second "k" sound in 'cake' is often a low-volume miss.

Robust in *mild* accent change, vagueness/clarity change, volume change, pitch change. Low volume is the worst.

16khz fixed input frequency.

FIR Hilbert, and FIR Low pass filters have been modified to work with factors of 8 to fit the frame sizes. Dont know why I didn't do that before. One Low Pass at 1000hz, the other Low Pass at 8000hz (dual samples feed the FFT).

Frame sizes are one 16ms frame, the rest are 32ms each.
16khz/368ms max. 5888 samples: (G1) 256-512, (G2) 512-512-512-512-512, (G3) 512-512-512-512-512.

Added a Plosive Detector to determine whether G1 samples are a plosive consonant.

Non-plosive consonants are now checked with vowels.

Seeing all the frequencies I need to for the consonants & vowels I want. I just need to add more and detect low volume samples better.

I'm aiming for a small list of words such as "up, down, left, right, start, stop, hi, bye..." so it can become useful straight away. It's also easier to handle strong accents and missing sounds if the word list is smaller. I have many ideas for solving it, but it'll at least be as good as current command based systems. If this works out then more is possible...

Title: Re: Pattern based NLP
Post by: MagnusWootton on July 06, 2022, 03:18:05 pm

Ive done alot of computer vision. (In the form of a corner tracker.) And I can extend to theory to some 1d system for sound.

Ever thought of using k-nearest neighbours on sound bytes to do voice recognition?

Title: Re: Pattern based NLP
Post by: MikeB on July 07, 2022, 08:40:16 am

K-nearest neighbour is a statistical analysis/learning method, and this is the reason why speech recognition always takes at least 0.5 seconds to process, and why speech recognition has never progressed since the 1950's (except for different types of guessing algorithms)...

My approach is halfway between IBM shoebox and modern speech recognition.

Frequencies above & below 1000hz are split, ultimately run through an FFT, then fast frequency analysis performed. Nothing else. The NLP (for homophones, missing words...) uses Compression Pattern Matching.

Speech recognition that can handle all words in all languages are one thing, but an "as fast as possible" approach is still needed in society...

Title: Re: Pattern based NLP
Post by: MikeB on August 09, 2022, 11:52:25 am

This month was mostly fine tuning and an efficiency pass.

Processing takes approx 1ms per 32ms frame.

Framing is the same. One 16ms frame, the rest 32ms.

Resolution has been increased to the below values:
438, 469, 500, 532, 563, 594, 625, 657, 688, 719, 750, 782, 1000, 1250, 1500, 1625, 1750, 1875, 2000, 2125, 2250, 2500, 3000, 3500, 4000, 5500, 8000

HRTF/Fletcher-Munson calibration values for the above are better.

Frequency Transition detection is the same.

More phonemes added (26 unique patterns: 23 vowels. 1 non-plosive consonant. 2 consonants)

That is basically all the vowels I need to add, which include some duplicates for accents. I have more patterns for consonants but I'm not testing those just yet.

Finally added Peak Volume Normalisation which doubled the reliability/robustness.

"Cake" and "Tree" are IMO comfortably robust for a command based system. I think most people would get them in 1-3 shots. That might seem like it has little value, but this is no data, no training, no learning, handles mild accents, vagueness, pitch & volume change, and is instant.

Phoneme Animations:

This month I also planned a total redesign to suit realtime phoneme animations. IE. To send an Animation ID & Length along with audio chunk data, for 3d experiences and games.

Sound waves travel one metre in ~2.9ms, so to be as real as possible, at 10 metres phoneme animations should be 29ms faster to start than the audio chunk data, and be seamless.

So I calculated with 4kb data chunks (128ms total @ 16khz), the initial plosive consonant (48ms) can process and send up to 78ms early (which is 27 metres in sound travel). 3kb data chunks work as well (up to 46ms early or 16 metres). Animations will be 50-100ms late if you look at your own character while you speak.

90% of the quality factor in this will be initial plosive detection (regardless of actual consonant) and this is just sample volume related. Only a general plosive mouth animation needs to be played if the actual consonant was missed. So initial mouth movement will start at the right time, look 75%+ right, and end at the right time.

Current phoneme animations in games are either pre-calculated to pre-recorded audio, or random mouth movements.

Saying "tree":
(https://i.ibb.co/j4WcfZG/2022-08-waveform-t-r-ee.png)

Title: Re: Pattern based NLP
Post by: infurl on August 09, 2022, 12:25:30 pm

Is this something that you could run on a single board computer like a Raspberry Pi? A lot of people would have a use for it in that situation.

Title: Re: Pattern based NLP
Post by: MikeB on August 09, 2022, 01:58:02 pm

Yes it's all single thread. It should run on an ARM Cortex M4F as well. Just need a MEMS microphone.

The FFT is the most complex thing, but it's low res and there are FFT's optimised for the Cortex M series & Rasp Pi's...

Title: Re: Pattern based NLP
Post by: 8pla.net on August 15, 2022, 12:57:58 pm

Quote

Current phoneme animations in games are either pre-calculated to pre-recorded audio, or random mouth movements.

What is a phoneme animation? Do you mean a mouth posture animation?
My guess is, yes, since you mentioned these are game animations.

Phonemes and visemes are closely related. Visemes are graphics, phonemes are audio
of the same speech synthesis. However, one difference is that a single mouth posture
(a viseme) may look the same ( be reused ) for a few different phoneme sounds.

Title: Re: Pattern based NLP
Post by: MikeB on August 16, 2022, 08:55:10 am

Yes I mean mouth pose animation.

I have a few written, but it's a little bit down the to-do list.

The tongue can technically be animated as well (to individualise animations to each of the 20-40 phonemes) but I'm not sure how visible that is for the effort spent....

Title: Re: Pattern based NLP
Post by: 8pla.net on August 16, 2022, 03:22:19 pm

Some are so similar, you may design a set with a fewer number of visemes that still works good.

https://www.youtube.com/watch?v=6c1WsMuhpFo

Of course, there is nothing wrong with doing all of the visemes, if that is your goal.

Title: Re: Pattern based NLP
Post by: MikeB on August 17, 2022, 07:31:59 am

I plan to do it minimalistically at first, but I'm also planning for max quality...

I actually have two K's and two T's recorded right now - one for highs, one for lows. They both make the same basic mouth shape, but the higher toned version also activates the cheek muscles.

EG.
"K" as in "kay" = 3000hz.
"K" as in "kee" = 4000hz = cheek muscles.

I think vowels can be a little muddled, but plosive consants (k, t, p, d, f...) can improve quality a lot if done well.

Great video

Title: Re: Pattern based NLP
Post by: MikeB on September 09, 2022, 10:02:23 am

This month was Word Lists, Word Searching, converting to a library format, and running the main speech processor in a separate thread.

Word Lists are small. As it doesn't use prediction there's no way to tell the difference between homophones using singular words. "Close" and "claws" may be the same word in two different accents, also, so it's better if the user knows which word they want (deliberately picks one). If the list contains "close" but not "claws" then vowel swapping ("aw" to "oh") can correct it. Another issue is silent/quiet seconday syllables (eg. the "s" on the end of "close") - In this case, as words are written as syllables, "clo" can be saved as "close" for a boost in detection,... In future, user filtering should trim the list beforehand. I think that's better as when you're facing a door, "open" or "close" are the two main words you're looking for, not "clone" and similar.

Word Searching uses binary searching. Each word may take one to ten spaces (high/low versions of the same consonant, accented vowels), so a list of 10 words may be 100 records.

Library format, and running the main speech processor in a separate thread.
Now using pThread to run the main processing loop in a new thread. Much more user friendly as main() now only contains new Phoneme checking (for mouth animations), new Words checking, new Phrases checking.

Next month I want to get a solid bunch of action words working such as: Left, right, up, down, start, stop, open, close, yes, no, hi, hello, bye. one, two, three, four...

So that will allow me to test/refine frequencies & word lists further.

Title: Re: Pattern based NLP
Post by: MagnusWootton on September 09, 2022, 04:29:27 pm

Quote from: MikeB on July 07, 2022, 08:40:16 am

K-nearest neighbour is a statistical analysis/learning method, and this is the reason why speech recognition always takes at least 0.5 seconds to process, and why speech recognition has never progressed since the 1950's (except for different types of guessing algorithms)...

My approach is halfway between IBM shoebox and modern speech recognition.

Frequencies above & below 1000hz are split, ultimately run through an FFT, then fast frequency analysis performed. Nothing else. The NLP (for homophones, missing words...) uses Compression Pattern Matching.

Speech recognition that can handle all words in all languages are one thing, but an "as fast as possible" approach is still needed in society...

Sorry for being a bit late, but k-nearest is only slow if the database of ids is too high, if u keep under 100 it goes extremely quickly.

just full linear test every one in the database and its really easy to code it as well, no skill job thats for sure.

the fft actually goes a little slow, back in the olden days too many fft effects in fruity used to slow you down, these days its alot different tho, goes alot faster with the 4 cores on the cpu, I think thats why.
Doing it fast would be staying in the time domain, or at least making it some sparse fft call, not every frame.

Title: Re: Pattern based NLP
Post by: MikeB on September 10, 2022, 08:47:51 am

It may help. After I finish everything and need more accuracy...

All speech recognition I've seen look for ultra precise frequencies (large FFT data set), and use large lookup tables. This takes the longest as I've never seen speech rec faster than ~0.5sec.

I've already done it the hard way using low res FFT and watching transitions. I have a table of 23 vowels plus accented versions, so the only place I would add it is in helping "alias" the input frequency & transition data to the vowel/consonant tables better.

It's weak to nasally/vague and wavering voice... but i'd rather not guess at accents, etc I want to use method for that.

Title: Re: Pattern based NLP
Post by: MagnusWootton on September 10, 2022, 06:00:00 pm

Maybe u can sum up the adjacent samples, kinda like a low pass filter, but dont divide it back down, and it will speed it up a huge amount, that hint was given to me on Goertzels singularity channel by a person there.

Your project is cool, I like all recognition projects, really good and practical. really works, kinda magic watching the computer do it.

Title: Re: Pattern based NLP
Post by: MikeB on November 06, 2022, 07:00:55 am

Taking a few months to learn Unity.

A few small updates:

Phoneme-Word lists are improved. Combined 'high' and 'low' version of consonants, and differently accented vowels through list indexing. Phoneme-Word lists now contain one entry per word, unless split by syllables.
Added a 'sudden ear sensitivity' boost to voice in the first two consonant frames (16, 32ms. 1000-5500hz), 5x normal values.
Vol Peak Normalisation fixed.
Both the 5x boost to consonant frames and the Vol Peak Norm changes improved detection of "cake" to 1 in 2 (1 in 3 previously). Currently, problems detecting "r". Vowels much improved.

Future:

Fixing basic words in order to have phrase detection. A variety of single words.
Testing with recorded voice (streamers playing Lifeline, Bot Colony). Wakeword/Keyword spotting benchmark samples.

Title: Re: Pattern based NLP
Post by: MikeB on January 14, 2023, 08:22:30 am

I returned to the NLP as it's a major part of the Speech Recognition, and the Chatbot component as a whole.

Previously in the NLP, there were ~3500 words and ~1000 pattern sentences. This only solved 550-1000 sentences in the WiC benchmark as the words weren't set up correctly. Word grouping was very basically split.

Now the Grammar interpretation/word-compression groups are set up as originally intended. 39 total groups now for all words to compress into (3-4x).

So currently, with ~2200 words and only 16 pattern sentences: 165(+) sentences are solved in the WiC benchmark. So already, there's a lot less false positives and the patterns are more targetting the sentences they're intended for.

Max pattern sentences will be around 1000, and the goal is to solve around 70-75% of the 5500 sentence benchmark. So hopefully that's possible.

Title: Re: Pattern based NLP
Post by: MikeB on February 06, 2023, 08:48:17 am

Up to 2900 words and 700 WiC sentences and ran into a problem.

The WiC test itself isn't specific about what "context" is. Ie. The same word selected in two sentences - do they have the same "context".

Before I matched for different Intentions. The words before and after the highlighted word. Eg. (1) "for one person to do". (2) "each person is unique". Different intentions, but same word meaning.

Now it appears to mean Homophones.. So in any case where the word is the same it's a match.

So I'm switching to two lists. One for Homophones. One for Intentions. The Intentions list is for chatbots / phrase meaning comparison.

Title: Re: Pattern based NLP
Post by: MikeB on March 07, 2023, 10:22:39 am

NLP:

Fully converted all words into correct groups (~3000 words in 49 total word groups).
There are currently 11 Homophone/sentence groups for the WiC test (~250 sentences). Eventually will be around 50 groups, 1000 sentences.

Only testing for a good WiC score right now. It solves a lot for the short amount of groups/sentences, and it's all "as intended" now, so I just need to go through the sentences (with homophones in mind) for a good score.

Speech rec:

Updated Hilbert & FIR filters.
Updated Fletcher-Munson Curve/Equal loudness calibration.
Added loading from a .wav file.
Refreshed grouped frequencies under 1000hz. (407hz, 500hz, 594hz, 813hz, 1000hz...)
Refreshed all vowel frequencies (28), and added a few non-plosive consonants.
Removed Sudden Sensitivity Boost for the first plosive frames.

Loading from a .wav file substancially increased reliable testing, so I'm now refreshing everything.

Formant groups will be changed from three to six: 0-375hz, 375-1000hz, 800-2500hz, 2500-3500hz, 3500-4500hz, 4500-6000hz, 6000-8000hz.

Most vowels only use up to 2500hz. This is perfect, except for "a", "e", "i" short vowels which use up to 3500hz. The difference (when spoken) is whether your cheeks are activated or not (showing teeth). So accounting for both (under and over 2500hz separately), "a", "e", "i" can now have low and high versions represented in viseme animations. One with cheeks activated, one without. There are also a few long vowels which can have the same low & high versions. "ee", "ay", "ew", "ow", "oi", "ier", "uah".

Definetely will also be using some kind of MFCC to un-fuzz low frequencies post FFT.

Title: Re: Pattern based NLP
Post by: MikeB on April 13, 2023, 02:55:51 pm

Speech Rec:

Added Frequency Spectrogram to visually output frequency groups/blocks.
Removed Hilbert @ 2000hz filter. It was causing some frequencies to be inconsistent.
Removed FIR LP @ 1000hz and split samples, split FFT. It did improve frequencies slightly, but not enough to warrant the extra processing time and code complexity.
Updated Fletcher-Munson Curve/Equal loudness.
Updated frequency group ranges.
Updated/fixed a bug with FFT output.
Updated Vowel recognition.
Now integrating into a phoneme extraction tool.
Still no MFC.
Faster (1-5ms per syllable).

Combined vowels "i" and "ee":
It's unable to tell the difference between the short vowel "i" and long vowel "ee", as the FFT decimates frequencies too much to reliably detect a 20hz frequency drop. An MFC setup only for low frequencies may help,... but the real solution is an FFT with higher resolution in that area (and less resolution in upper frequency areas).

Other errors:
There is an error with the hamming window, sine wave sync/frame, or FFT. (Vertical grey bars in the spectrogram). Otherwise it's looking good.

Testing sounds "kay", "key","kai".

(https://i.ibb.co/QYrX6DR/2023-04-phoneme-Extractor.png)

Title: Re: Pattern based NLP
Post by: MikeB on May 12, 2023, 04:16:07 pm

Speech Rec / Phoneme Extraction Tool:

Conversion of main processing code from C to C# (50%).
Wav file loading (8/16/24/32bit PCM, IEEE 32bit float)
GUI: Better Audio wave sample display (dB x time)
GUI: Loudness Curve - movable points
GUI: other
Loading FFT with a Hann window - grey vertical bars gone in spectrogram.

Title: Re: Pattern based NLP
Post by: MikeB on June 20, 2023, 11:40:38 am

Speech Recognition / Phoneme Extraction Tool:

Conversion of main processing code (including FFT) from C to C# - 100%.
GUI/code: Loudness Curve
GUI/code: Phoneme output
GUI
Updated frequency group definitions (312, 468, 562, 656, 750, 875, 1000, 1333, 1666, 2000, 2500, 3000, 4000, 5313, 8000)
Updated equal-loudness
New equal-loudness calibration for the first 16ms frame (+50%)
Removed last FIR filter @ 8000hz. Did improve frequencies (loudness) slightly but not enough to justify the processing time.
Combined Find Zero-Crossing function with Noise Find. Speed & quality improvement.
Replaced Hann window with a custom wide lobe cosine window/Inverse Blackman window.
Improved syllable finding
Updated Wav file loading

Satisfied now with the "Kay, Kee, Kai" test.

Next target is finding "yes" & "no". This is part of the Google Speech Commands (Keyword Spotting) benchmark.

(https://i.ibb.co/CVLrD7N/2023-05-phoneme-Extractor.png)

Title: Re: Pattern based NLP
Post by: MikeB on July 10, 2023, 10:20:33 am

Speech Recognition / Phoneme Extraction Tool:

Updated equal-loudness
Reduced noise floor dynamic step up/down amount
Lowered initial volume pick up
Replaced Inverse Blackman window with a 2x amplified Hann window. Similar quality, large speed gain.
Improved consonant framing (plosive find)
Redesigned consonant identification
Added Vowel formant 1 Focusing. Frequencies starting at or above 656hz are now soley detected to transition at or under this amount (to avoid interference from strong nasal tone).
Reduced nasal tone volume (875-1000hz) by half

Working:
"kay, kee, kai"
"no" x3 variations

Title: Re: Pattern based NLP
Post by: MikeB on August 07, 2023, 02:45:27 pm

Speech Recognition / Phoneme Extraction Tool:

Redesigned consonant plosive-find
Redesigned consonant identification
Updated consonant frequency definitions. Added 2000 (512, 1000, 1500, 2000, 3000, 4000, 5500, 8000)
Consonant viseme upper tone/cheek emphasis changed from boolean to variable.
Updated vowel frequency definitions. Added 2333, 2666, 3500.
Updated vowel identification
Added Vowel format 2 focusing:
1) F2 frequencies starting above 1000hz cannot transition to 1000hz, to avoid interference
from nasal tone, and
2) F2 frequencies where the last two values are the same can only transition up/down
by one.
Improved Vowel F1 transition analysis to include power & average centre, and for flat transitions to re-alias centre frequency to a defined frequency group. This improves fine frequency transition detection, and for flat frequencies resilience to errors.
Updaed Phoneme Emphasis
Updated equal-loudness
Updated GUI

Working:
"kay, kee, kai"
"no" x3 variations
"yes" x3 variations ("ye")

A lot of range in "no" and "yes" (and by extension all other vowels & consonants) is supported now, and will be setting up a benchmark to test it more.

Auto Phoneme Emphasis depends upon the peak volume in a range of frames. There are three ranges (2 - 5 - 5 frames). Volume detection in the first two frames is important as many loud plosives (eg "k", "t", "p") exist there and need to be represented accurately, as well as account for mouth-open, meaning emphasis travels high in advance. They are all precise to the audio and don't use general curves, so if they're wrong they're wrong, but when they're right they're really right. An emphasis multiplier determines max emphasis.

"Yes" and "No":

(https://i.ibb.co/DwQLKFy/2023-08-phoneme-Extractor.png)

(https://i.ibb.co/sPc5HK6/2023-08-phoneme-Extractor2.png)

Title: Re: Pattern based NLP & ASR
Post by: MikeB on September 14, 2023, 08:14:34 am

Lipsync:

Creating a Unity script for Face, Blink, and Viseme morphs to use with Daz3D models.
Face blend shapes (happy, sad/lost, angry, shock), plus four others through combination.
Blink blend shapes. Auto-blink with ranged values. Min, max, stop after open/close (stare/sleep), blink once now, squint.
Viseme blend shapes. Emphasis modifier.

Speech Recognition:

Added Visemes to Speech Recognition engine.
Added/Updated Word Hashing and Word Search.

Working: Viseme lipsync "no" x 3, "yes" x 3. (simulated from speech recognition data)

Lipsync is already as good if not better than current SOTA lip syncing.

The default viseme blendshapes from Daz3D work well. However they're single frames and some of the phonemes need two (start and end). So for "oh"; "ah" and "w" are used.

Both the timing and the two-frame "oh" in "no" boost quality by quite a lot.

The audio used is from a few samples of the Google Speech Commands dataset.

"no" x 3, "yes" x 3.

https://youtu.be/9K1yiI-Z-38

Title: Re: Pattern based NLP & ASR
Post by: MikeB on October 15, 2023, 09:59:11 am

Speech Recognition (benchmark of "yes"/"no"):

C version/Console:
Custom wave file loader for benchmarking
Benchmarking a target word
Noise Floor now based on RMS instead of Volume Peak.
Noise Floor Raise 'step' changed from fixed value to 2x current noise floor RMS.
Voice volume minimum changed to 5x current noise floor RMS (range: 3x - 6x).
Removed one-frame "click/pop" noise.
Combined Vol Peak Normalisation with FFT loading for a speed improvement.
Updated Vowel F1 & F2 transition analysis
Updated Vowel frequency group definitions. First three (281, 375, 468hz...).
Updated Equal Loudness
Updated Consonant identification
Other fixes

Working:
BEFORE AFTER
"no" (405 records):
"no" = 9.14% to: 36.30% (goal 50%)
"n" or "oh" = 36.5% to: 86.9% (goal 90%)

Time: 0-1ms each.

Very tough benchmark. Many samples are thwart with noise (blips, click/pop, static, paper sounds). There was a large change to detection by only changing noise floor from Peak to RMS. Another large change after redesigning consonant identification.

Work on noise elimination is needed, and testing "yes".

Title: Re: Pattern based NLP & ASR
Post by: MikeB on November 07, 2023, 12:51:46 pm

Speech Recognition (benchmark of "yes"/"no"):

C version/Console:
Added filtering for "knock/tap" noise
Changed Vol Peak Normalisation to use RMS x 2.4 instead of highest Vol Peak. Slightly more accurate versus background noise, faster, and phoneme/viseme loudness is represented better.
Updated framing. Now using 256 samples up to the first plosive in a syllable, then 512 samples after. This helps in Consonant framing.
Replaced fixed framing 256-512-512-512-[...] with dynamic method mentioned above.
Updated Consonant Plosive Find step values. Initial min step, and F1 group power from last frame.
Updated Consonant identification
Updated Vowel Formant 1 & 2 Focusing
Updated Equal Loudness (increased/doubled some 4khz resonance numbers to better match spectrograms in Praat. 656hz (550-650), 1333hz (1000-1333), 2000-2333hz (1666-2333), 8000hz was already double).

Speech Commands Benchmark:
NO (405 records):
"no" = 44.69% (goal 50%) . "n" or "oh" = 81.7% (goal 90%)
Error | "yes" = 0.99%

YES (419 records):
"ye"/"yeah" = 22.43% (goal 50%). "y", "e", or "air" = 67.5% (goal 90%)
Error | "no" = 2.63%

Time: 0-2ms each.

A lot of changes were made, especially in Consonant framing to tell the difference between "y" and "n" better. Small improvements in vowel detection.

Many of the problems in detecting more "yes" (the "y") are noise or volume related. Other issues, are in vowel transition detection, and finding the trailing "s" in "yes".

If "yes" improves to >40% I'll then test "on" & "off".

Title: Re: Pattern based NLP & ASR
Post by: MikeB on December 21, 2023, 08:59:52 am

Speech Recognition (benchmark of "yes"/"no"):

C version/Console:
Redesigned Volume Normalisation to improve clarity above 500hz, and per-frame clarity. Samples are now run through a Hilbert filter to reduce bass < 500hz, and Peak Volume Normalisation is at the end of each frame instead of all frames.
The new Hilbert filter and frame-based Peak Vol Normalisation improve these issues:
1: Deep voices no longer interfere with/reduce volume normalisation of frequencies above 500hz.
2: Initial frame is now normalised to itself. Other frames are normalised to the peak volume of current & previous frames (which ever is louder - peak volume does not adjust lower). As Vowels are detected to follow Consonants and are generally louder, both are now maximally normalised.
Deleted first frame 1.5x boost.
Updated Consonant Formants and identification.
Redesigned Plosive detection. Frame one plosive ("k"/"p"/"t") loudness minimum must match a fixed value. Frame two(+) plosive ("n"/"m"/"y"/"r") loudness minimum must match ~50% of the last plosive minimum + ~50% of the power of the last frame.
Updated Vowel F1 & F2 transition detection to gradually favour frequencies towards the end of the sound.
Range tuning. Better detection of plosives in consonants "k", "n", "y", detecting both sudden sound increases and rolling increases at any input sound volume range, noise, etc.
Other fixes.

Speech Commands Benchmark:
NO (405 records):
"no" = 48.40% (goal 50%) . "n", "oh", "o", "uh" = 82% (goal 90%).
Error | "yes" = 0.99%

YES (419 records):
"ye"/"yer"/"yeah" = 28.64% (goal 50%). "y", "e", "er", "air", "ah", "a", "uh" = 70-80% (goal 90%).
words.
Error | "no" = 2.15%

Time: 1-2ms each

+4% increase to "no", and +6% increase to "yes". Consonant identification needs to be reworked as the definition for "y" and "n" are too close. "kay-key-kai" works.

The benchmark is more of a test of noise rejection and volume normalisation than anything else. Very happy with the robustness of this now.

New consonant identification should see at least a 10%+ improvement to "yes" with less error. If error stays around <= 1% then more alternate vowels can be used and a higher result is possible.

Title: Re: Pattern based NLP & ASR
Post by: MikeB on March 07, 2024, 07:36:41 am

I have pages of more work on this but I'm not happy with the results just yet. Both "yes" and "no" are down by 10%.

I'm not at the peak of how far pure frequency analysis can go. Currently, spectrogram analysis is improved for frequencies under 1000hz to a high degree, which improves vowel quality/reliability. There is also some improvement to consonant quality/reliability, but there are still problems separating "n" from "y".

With "n" and "y" improved there will be a definite jump in benchmark results.

I still have one more method to try to improve spectrogram quality.

Apart from the spectrogram, only the rules for identifying consonants & vowels need to be improved.

I don't get much done over summer in Australia, but within the next three months I may return to it.

There's a lot of value in keyword spotting that's so efficient. I also have a lipsync video test in mind that uses hundreds of "yes" and "no" from the Google Speech Commands dataset, all live processed and lipsynced, which will be interesting.