Ai Dreams Forum

Member's Experiments & Projects => General Project Discussion => Topic started by: MikeB on May 24, 2020, 12:16:50 pm

Title: Pattern based NLP
Post by: MikeB on May 24, 2020, 12:16:50 pm
This is a project I've been working on for a few years. In 2019 I was testing out the theory on Pandora Bots, and this year I'm converting it to C/C++.

The main goal is to be small, fast, solid-state. Not algorithms, deducing/knowledge, learning...

It works by matching singular words, first through spell check, then into another full list of words which assigns each word a token (one physical character/byte symbol). There are approx 20-30 groups of symbols which all words match into. The 20-30 symbols are then used in pattern sentences. There are a few hundred pattern sentences. From each pattern sentence a broad intention can be gathered and this used in chatbot reponses.

The Chatbot responses are fixed to around 50-100 and are represented as a one character/byte symbol. These are used to create duplicate / cross language responses. They can also be voice recorded and assigned to each symbol.

It's aimed at time/space restricted Chatbots, for example in games, where processing needs to happen in milliseconds, but also take into account broad user input.

Size and Response times:
In 2019 in pandora bots (1000 words, 200 sentences) the size is ~500kb, and ~1 second response time.
In 2020 in C/C++ on an Arm Cortex M4 @ 120mhz it's ~159kb and 15-100 milliseconds.
In 2021 in C/C++ on a modern PC @ 2.6ghz with Binary Searching (set up time of 70ms), processing is ~1ms / sentence.

A few features:
 < 500kb including word databases.
 One sentence generally takes less than 1 ms to process.
 Limited chatbot responses make it easy to voice record and/or change personality.
 Private information stripped during word compression (no names/places).
 Fine differentiation of intentions, eg between: Wondering, Questions, and Directions - "can you speak english" "do you speak english" "speak english".
 Can count occurance of emotional words, logical words, burning-analyser words, light-sense words to reply in kind better.

For the problem of chatbots in games or time/space restricted platforms, it solves:
 Too much data or processing power required.
 Cannot change the personality/no personality.
 Cannot change the language/only one language.
 Chatbot escaping the topic due to bad intention reading.
 Chatbot returning bad views / knowledge calculation (only fixed responses allowed).
 Chatbot terrible voice synthesis (can pre-record all fixed responses including randomised duplicate recordings).
 Not being white box/solid state/predictable.
Title: Re: Pattern based NLP for games
Post by: MikeB on May 24, 2020, 12:48:43 pm
I will be working on it throughout this year and aim to make it into an API at least by the end of the year...

So far I have tested the Plain C version on the 2017/2018 Loebner prize questions...

All the words in each question can be changed for slightly different words and still return the same response (randomised differently).. If the same test is run again the responses could change to their random alternate versions...

Quote
Loebner 2018.
   Good afternoon.   - "greetings"
   Do you consent to having this conversation recorded? - "for me, not a problem"
   Have we met before? - "cannot say, good to meet you"
   Do you know how to make toast? - "no, not sure about this" [toast]
   Which languages can you use? - "many many, english, espanol, italiano... what about you?" [only english]
   How many letters are in the word "abracadabra"? - "who am I? Pietro Bembo?"
   Where should one look for love? - "interesting! i don't know" [doesn't know "love"]
   How are you feeling right now? - "I'm very good, and how are you?"
   Who said "I have a dream"? - "no no quotes"
   How old are you? - "thousands... cannot say"
   When might I need to know how many times a wheel has rotated? - "who am I? Fibonacci?"
   What will you do later today? - "just came back from surfing, may go out again!"
   What is the third angle in a triangle with internal angles of 90 degrees and 30 degrees? - "no no maths"
   Who is your favourite artist? - "interesting! many many..."
   What do you hold when you shake hands? - "words are not my profession" [hands]
   Do you have any legs? - "no, words elude me [legs]
   Can you recommend me a film? - "yes, thousands..." [doesn't know "film"]
   Do you understand Winograd Schemas? - "nay, confusing words" [doesn't know winograd schemas]
   If a chicken roosts with a fox they may be eaten. What may be eaten? - "OK. from memory... somebody..."
   I had to go to the toilet during the film because it was too long. What was too long? - "alright. alright. from memory... that thing..."
Title: Re: Pattern based NLP for games
Post by: ivan.moony on May 24, 2020, 09:01:48 pm
Sounds like a great improvement over current chatbot technology like AIML. What do you plan to do with it?
Title: Re: Pattern based NLP for games
Post by: 8pla.net on May 25, 2020, 12:13:13 am
C Language is a good choice, I think.
Title: Re: Pattern based NLP for games
Post by: MikeB on August 07, 2020, 06:49:29 am
Sounds like a great improvement over current chatbot technology like AIML. What do you plan to do with it?

I'll be trying to integrate it as an Unreal Asset and/or approach a few different people who already do chat interfaces... In some ways it's better than AIML (you don't have to choose between a menu reply system or 10,000 custom responses)... but in other ways it's not very flexible. You have the ~100 fixed phrases, but they must be an alternative of one of the preprogrammed ones... and there's a section for custom reponses, but the input is choosing one of the fixed intentions/topics/perspectives and the output is one of the fixed ~100 phrases.

So you couldn't talk specifically about a product or idea. You'd use a secondary bot that has a list of all the keywords you're looking for, then you could join the intention with those.
Title: Re: Pattern based NLP for games
Post by: MikeB on August 07, 2020, 06:56:27 am
C Language is a good choice, I think.

It compiled tiny in C, but I had to move to C++ now to make a windows DLL and get 16-bit wide chars. 400kb  :(
Title: Re: Pattern based NLP for games
Post by: squarebear on August 07, 2020, 08:35:57 am
The size and speed in pandora bots (1000 individual words with sentences) is ~500kb, and 1 to 2 seconds response time.
I've not found such a delay. I have a bot with over 350,000 categories and it responds almost instantly. www.kuki.bot
Perhaps you are using AIML in a non standard way?
Title: Re: Pattern based NLP for games
Post by: 8pla.net on August 07, 2020, 01:14:40 pm
C Language is a good choice, I think.

It compiled tiny in C, but I had to move to C++ now to make a windows DLL and get 16-bit wide chars. 400kb  :(

Do both then,  C Language and C++...  You may as well.  They are compatible.

And, I would suggest making a Linux version, too, like ChatScript has.


Title: Re: Pattern based NLP for games
Post by: MikeB on September 15, 2020, 09:22:31 am
The size and speed in pandora bots (1000 individual words with sentences) is ~500kb, and 1 to 2 seconds response time.
I've not found such a delay. I have a bot with over 350,000 categories and it responds almost instantly. www.kuki.bot
Perhaps you are using AIML in a non standard way?

I used about 2000 categories, but it re-searches several times. So 10 words can be 2000 x 5 x 10. If it's only 5 words or less it's instant....
Title: Re: Pattern based NLP for games
Post by: MikeB on September 15, 2020, 10:13:19 am
Recompiled to C++ DLL, C++ windows console (8bit standard english characters). 250kb

Approx 500 spellcheck words, 1200 words, 100 symbolic sentences, 50 chatbot recognised intentions, 50 chatbot fixed english phrases

1ms response time.

In the image below, the chatbot response is wrong (picking up general "how is your *" instead of "how are you"), but this is what it's like as a demo.

"explain is I/you motion-moving logic-direct" are the uncompressed symbols. One per word...

It's still basically an I-Don't-Know Bot, but the instant intention pickup is useful. You can still talk ON the topic/intention... and the ~50 fixed output phrases means it can all be voice recorded...

(https://i.ibb.co/hZ8HRN8/debug.jpg)
Title: Re: Pattern based NLP
Post by: MikeB on September 25, 2020, 09:29:25 am
Updated the word searching in Misspelled Words and Tokenise Word lists for a faster way of doing it.

The old way was scrolling through every character in the input sentence for each of the words in the 500 - 1300 word lists.

The new way is basically how people do it:
First: Look at the start character.
Second: Look at the length of the word.
Third: Look at the last character.
Forth: Is it only one character long?
Fifth: Check every character from 2nd to the last.

You break out (or continue;) the loop if any one of those fails. On average it's something like a 1 in 26 shot for the first, 1 in 5 for the second, 1 in 5 for the third, 1 in 5 for the forth...

Seemed to double or triple the speed. A 20 word sentence (2-3ms) now takes 0-1 ms.

Can't have spaces in the words though so will have to make a short "Catchphrase" word list.
Title: Re: Pattern based NLP
Post by: MikeB on October 12, 2020, 09:16:23 am
Decided to make it into a full NLP including Thesaurus, Sentiment (like/dislike), Email Spam, Aggressive language detection as well as the Chatbot.

Here's the Thesaurus. Everything takes 0-1ms.

It's fast because the words are already categorised in groups with each other, so it's just a reverse look-up. However it does still need some topic searching because some groups have over 50 words.

(https://i.ibb.co/W6RX1Gg/debug2.jpg)
Title: Re: Pattern based NLP
Post by: MikeB on October 26, 2020, 05:50:21 am
Here's an example of the Spam detection and differentiation.

The differentiation is between the phrases:
"Do not miss out"
"Do not miss out on great fun"
"Do not miss out on great offers"

The Thesaurus also shows all the alternative words ( max 8 ) that could have been used to output the same thing.

The Chatbots response is:
"For what purpose?"
"Ok. Not a problem."
"Gah, no selling. I'm not buying."

Still shifting the words around into different categories. There's now 1500 words (+300).

Also the word searching has been changed again to just "quick search" the first letter (using as few instructions as possible in a tight loop, so it can move onto the next fast), before searching the rest of the word. Also using a rebellious "goto" command to get to the next iteration faster.

Next: Chatbot (Alternate Language output), Language Translate, Tone/Harrassment identification.

(https://i.ibb.co/Kzrdqdn/debug3-Spam-Thesaurus.jpg)
Title: Re: Pattern based NLP
Post by: MikeB on November 26, 2020, 08:28:32 am
Working on a new utility to handle entry into the Chatbot Decisions file (Handles input from the NLP as tokens I T T P, and outputs S S S speech tokens).

(https://i.ibb.co/TqQZVDx/debug-Utility.png)
Title: Re: Pattern based NLP
Post by: MikeB on December 14, 2020, 05:29:14 am
Added a Start Page/Test Page to the utility.

The NLP processing/debug itself isn't changeable in the utility (word symbolising), that's still left to the console app. The utility is for setting up Chatbots, and some separate Spam and Tone options not related to the chatbot.

Spam detection is symbol based not literal, so synonyms of the word "offers" are all detected together, not just single words. This is multi-language as well.

The Thesaurus is a simple reverse lookup on word and a secondary word-topic so there's no setup apart from how many words to return.

Tone detection is an output of approx 10 levels from light patronising/grooming/objectifying to "i hate everything, all x's are x". Tested this in an early alpha version but is not implemented in the nlp and utility yet.

(https://i.ibb.co/GtqQqDh/debug-Utility2.png)
Title: Re: Pattern based NLP
Post by: MikeB on January 06, 2021, 08:02:43 am
Recently added both:
-Tone (9 levels - 3 negative, 3 positive, 3 grooming behaviour/patronising)
-WiC Challenge test (Words in Context - https://pilehvar.github.io/wic/)

The WiC test is one of the few NLP tests that can actually be done on this pattern based NLP, as it's not specifically prediction or knowledge based.

The WiC test (training data & results) is ~5500 lines. It completes in only 2 seconds (1980ms-2000ms), however many of the lines include deep knowledge or some other non-literal meaning to trick everyone, including people, so it'll also trick this NLP... The human score is only 80%. Most NLP's get 60-75%.

Most of the NLP set up is complete, so this year I'll be adding words & sentences in order to get through this test... 

O0

(https://i.ibb.co/txrxM0Q/debug-Utility3.png)
Title: Re: Pattern based NLP
Post by: ivan.moony on January 06, 2021, 09:42:45 am
Hi MikeB :)

May I ask, how do you derive answers to the tests?
Title: Re: Pattern based NLP
Post by: MikeB on January 07, 2021, 03:35:29 pm
Hi MikeB :)

May I ask, how do you derive answers to the tests?

Hi Ivan, I ignore the selected word that the test says to match, altogether, and just look to see if the underlying intention is the same.

In the line "He wore a jock strap with a metal cup. Bees filled the waxen cups with honey."... the word "cup" means the same. A traditional NLP would see if "metal cup" and "waxen cup" means the same based on knowledge linking, but in the pattern matching NLP I just look to see if the basic underlying intention is the same. So both of these sentences would come under "Person describing" with sub tags "clothing, material,..." and some others. If one sentence was a catchphrase or greatly different then it would return not a match.

Another example... "I try to avoid the company of gamblers. We avoided the ball."...the word "avoid" means the same. Both have the intention "Person explain", so this would return true.

It should get at least 60% doing it this way. There is a way to add catchphrases to get a few more, and some other things I can do with tags. Trying to keep real knowledge linking and deducing as far away as possible...
Title: Re: Pattern based NLP
Post by: MikeB on January 14, 2021, 02:53:06 pm
Just comparing the two Intentions isn't working out too well. Going to start a specialised way of doing it (still without knowledge) by looking at the words before & after the selected word.
Title: Re: Pattern based NLP
Post by: MikeB on February 01, 2021, 07:27:24 am
Restructered the WiC / Word in Context test to look at the words before & after the indicated word, similar to how we do it.

A brief overview...

1) Both sentences are formatted (look for odd symbols, double spaces, spelling, words spaced out like "h e l l o", extended laughing "hehehehe...").
2) Pattern-match each word to a predefined symbol from a list (only ~20 different symbols total, out of ~2300 english words. No stemming.).
3) Analyse WiC:
 a) Input: Both sentences, the 'lookup word', and both locations of the word.
 b) WiC function: Check the 'lookup word' (now a symbol shared with ~100 similar words) exists in the WiC / sentence compatibility table (~50-100 entries).
 c) WiC function: If at least one match, check all other words. Highest word count (3-5 words) is selected as a match. Remember compatibility ID. Now check second sentence for a match. Return match true/false.

This is much more detailed than just checking the intention, as it can pick up the same context even if one sentence is an "instruction" and the other is a "person describing". EG. "come/came" (1) "Come out of the closet" (2) "He came singing down the road".

I got the time down from 2000-2600ms, to ~1400ms by removing most of the pre-formatting and only keeping 'Double Space' check as the test is already formatted...

Score is not worthy of publishing because I've only checked about 100 of the ~5500 records! A lot of sentences are reused though so shouldn't have to check all of them.

(https://i.ibb.co/6Xdr188/debug-Utility4.png)
Title: Re: Pattern based NLP
Post by: MikeB on March 01, 2021, 09:49:25 am
Still working on the WIC test.

Making progress of about 0.1% per day. (20-30% to go)

There's now 3700 words (+1400). 900 WIC pattern sentences (+800). Re-added spell-checking, so the full WIC test takes about 2.5 seconds to complete.

The scale and pickup is actually immense. Each of the 900 WIC pattern sentences has 3-6 "Symbolic Words". Each Symbolic Word represents 10-500 words. So each of the 900 WIC sentences actually picks up 500,000 - 20,000,000 variations.

Many times I add 10-20 WIC patterns (~100,000,000 word-sentence variations) and it only picks up one solitary record in the 5428 record WIC test... So the test is basic... but the word formatting is still broad enough that you can't just cheese the test.

Another problem is lack of words... I'm estimating I'll need at least 5000-7000 total to get a good result, and all these are hand entered in specific categories , so it's going to take some months...

One side effect is that I'm probably going to drop the old "Intention" categories I used to use for the chatbot and use these new WIC categories instead as it picks up an interesting variety. There are about 50 different groups (will be merging some) along the lines of:
"person or thing started to move / person or thing has him..."
"the object/concept of a had-thing"
"had the concept when..."
"a motion was taken / apply a rule / have-take the concept-chance to..."
"i play/avoid the / objects moved/ordered/fell to the
"logic-action an object"
"moving-action the object"
"an object of objects / vivid objects/objectives of"

So these will be better in chatbot programming.
Title: Re: Pattern based NLP
Post by: MikeB on March 05, 2021, 08:51:40 am
Sped up the processing thanks to Infurl's suggestion of adding Binary Searches.

Huge results.

Added to Spell Checking (800 words), and Word-token assignment (3700 words).

The original lists are unsorted, so they are hashed & sorted in program. (Hashed by ASCII adding.) There are typically 0-5 duplicate hash ids/collisions so the correct matches are checked letter-by-letter as well.

Processing 5428 lots of two sentences:
Before: 2600ms
After: 76ms of preparing. Hashing & sorting spelling and word list.
After: linear searching the hash lists: 1700ms (900ms faster)
After: binary searching the hash lists: 930ms (1670ms faster)

There are other processes, but for the spell/word search alone, Hashed/Linear seems to make it ~50% faster, and Hashed/Binary seems to make it ~90% faster.
Title: Re: Pattern based NLP
Post by: ivan.moony on March 05, 2021, 11:12:46 am
Great speedup! O0

And the good thing is that, using binary search, growing the search set doesn't slow down in linear scale, it slows down in logarithmic scale (that's almost as good as constant speed). The bigger the search set is, more you see the difference between linear search and binary search.
Title: Re: Pattern based NLP
Post by: infurl on March 06, 2021, 02:17:16 am
The original lists are unsorted, so they are hashed & sorted in program. (Hashed by ASCII adding.) There are typically 0-5 duplicate hash ids/collisions so the correct matches are checked letter-by-letter as well.
...
After: 76ms of preparing. Hashing & sorting spelling and word list.

Pro-tip #2. There is no reason that you would have to do the preparation such as hashing and sorting at run-time. You could break out the portion of the code that does that preparation into a separate program which you run at compile time. This program does all the necessary preparation and then prints out all the data structures in a format that can be included by your final program and compiled in place into its final form. That will save you a chunk of time every time you run the actual program.

In my case I am parsing and processing millions of grammar rules which can take a considerable amount of time just to prepare. Although small grammars can be processed from start to finish at run-time, I have found it much faster to compile the different files that make up the grammar into intermediate partially processed files; these files in turn get loaded and merged into a final grammar definition which is then saved in source files that can be compiled and linked directly into my parser software, as well as a database format which can be loaded as a binary file at run-time.

That last feature has lots of advantages. The preprocessed files were so large that it was taking a long time just to compile them, but the best thing is that by separating the data files from the software, I can choose completely different processing options on the command line.
Title: Re: Pattern based NLP
Post by: MikeB on March 15, 2021, 07:08:02 am
That might be an idea.. If it gets longer than 500ms to load then I might do that...  Only expecting about 5000-7000 words, but if I add more languages it could take a while.

I'm trying to keep the data slightly linked-in to the software so that it's harder to work out how it works, but it seems like it's just written in the DLL in plain english anyway... so may end up separating them.
Title: Re: Pattern based NLP
Post by: MikeB on March 23, 2021, 08:38:35 am
I finally ran into a problem with the word-token grouping not being separated enough, so I'm redoing all the groups.

Originally I'm using 6 Logical groups, 6 Emotive groups, 1 generic 'Possessive/having' group, and a bunch of others including Person (1st/2nd/3rd person). The past/present/future tense is included in the logical/emotive groups as well, and this leads to having to do double-ups in the WIC entries to include bad grammar ("I run away", "I running away", "I runned away", "I ran away")... I'd rather have them in the same group and use a past/present/future tag on the word to analyse later.... The context of "one person running" is the same (it's not "running a fridge"/operating - if it is, it's easy to pick up the extra words...).

The original theory is for 24 word groups, but 16 seems to be the best after I laid all the main keywords out. No prefix/suffix separation anymore...
4 Logical (concepts),
4 Emotive (everything that moves),
4 Burning (analytical/possessive/having/working),
4 Light (romantic/sense/pose/art).

The WIC entries should go down from ~800 to 200-400 with the same pickups and have more range... especially around the 'Burning' and 'Light' categories.
Title: Re: Pattern based NLP
Post by: MikeB on November 08, 2021, 08:33:01 am
I'm still working on this. I recently completed the concept for English words into the new Grammar categories.

There are:

Four main groups for Nouns, verbs (actions), and adjectives (modifiers). The groups are: Moving/living things, Analytical/laws/concepts, Logical subparts/binary actions, Light sense/stories/beautiful terms. (IE. The Four elemental/original groups each contain three sub groups - Noun, Verb, Adjective).

Seven other groups: Articles/Quantifier, Person/Agent, Question/Interrogative, Time-spatial, Direction-spatial, Conjuction/sentence breakers, Exlamation/grunt/hi/bye.

All Eleven groups are also encoded with present/future/past and precise/optimistic/explaining at the same time. IE All Present things are Precise, all Future things are Optimistic, all Past things are Explaining. In the Four main/elemental groups: Nouns are Precise/Present, Verbs are Future/Optimistic, Adjectives are Past/Explaining.

The last thing that breaks all grammar common sense is that each word is only permitted to be in one category only. So words like "brush" must either be an action (verb) or the name of a thing (noun). The default is to be an action (verb) as nouns aren't heavily relied on in sentence matching.

The whole concept is speed over quality... but as there's nothing for 3d environments between hard-written dialogue trees and GPT-3, this will sit right inbetween...

There are 3500 words to individually convert over so will take until the new year to do as I'm also looking at Speech Recognition software.

Speech Recognition software today uses Algorithms/Ngrams/NN's and is really slow (1-3 seconds response time) and uses a lot of power... The speed of my FSM/FST/binary NLP is 0.1ms to process a sentence (all words & intention)... So if the speech rec software is fast as well then it's more suited to 3d enviroments even if not as good....

Combining the NLP with speech rec is as simple as writing phonemes next to each word in the dictionary... If the user is speaking via voice then the word-text searching can be skipped altogether... it can go straight from voice phoneme->symonym symbols->pattern sentence pickup->intention grouping.

For audio processing I'm looking at OpenAL Soft (Open Audio Language) right now. There's nothing in the libraries for voice recognition, or even microphone low/high/band filter passing, but it's low-level enough to work on and have both speed and cross compatibility with other OS's.

The fastest approach I've seen is to take about 50ms of audio (shortest phonemes), generalising the pitch then associating it with a phoneme (tuned to your accent). This is about 1ms fast... but again, sits inbetween the best and something hard-coded.

One of the benefits if it works is that a responding chatbot can completely vary the response time to suit the situation... including interrupting the user, which adds another layer of humanness, but depends on how well words & intentions are picked up.
Title: Re: Pattern based NLP
Post by: MikeB on November 24, 2021, 03:06:58 pm
Update on audio speech recognition.

Traditional speech recognition uses FFT/DCT/DTT Fast Fourier Transform's to decode audio into voice phonemes. These capture 3 voice formants (frequency ranges specific to a phoneme) from one 'signature'. However these use nested-loops and are slow to process. DTT is the fastest but I want to try it another way...

Most spoken phonemes have a range of different frequency areas combined to make the sound - bass/warmness, middle range, high range. EG. "oh" is mostly bass. "ee" bass-middle. "ss" high.

The way I want to try is separating common frequency ranges first initially, then measure the power & complexity afterwards to tell if one range is loud/complex versus the others.

Separating the frequency ranges (band-passing) can be done in real time using just a few instructions, using pre-calculated IIR filters (http://www.schwietering.com/jayduino/filtuino/index.php). FIR filters are better quality but slow.

There is 20ms inbetween recorded audio frames to process the data, so I'm aiming to get both phoneme and NLP processing out the way in 1-10ms. Using the same thread as the one capturing the data.

This is some captured data for the word "wikipedia". The asterisks (*) represent good power & complexity levels versus background noise.

Currently there's little noise filtering and the band-pass filters need tightening up, but eventually if the results are strongly reproducable then they can be added to tables as base values...

(https://i.ibb.co/D4N3wWg/waveform-wikipedia.png)
Title: Re: Pattern based NLP
Post by: MikeB on November 26, 2021, 06:17:06 am
An update.

I changed the IIR filters to Resonators centered around 600hz, 1250hz, and 3150hz and now have double the signal-to-noise with more stable numbers.

This amplifies the signal for certain types of sounds, but in order for this to work I feel like I need about 10 filters, centered around different frequencies.

One FIR filter or a Fast Fourier Transform (normal approaches to speech rec) are approx 50-100x slower than one pre-calculated IIR resonator filter, so there's plenty of room...

Signal is only 12.5%-25% over noise background, and you need to speak close to the mic, so SNR needs to be improved by at least twice again to work...

(https://i.ibb.co/RBSH63F/waveform-wikipedia.png)