Pattern based NLP

  • 25 Replies
  • 32672 Views
*

MikeB

  • Nomad
  • ***
  • 92
Re: Pattern based NLP
« Reply #15 on: January 06, 2021, 08:02:43 am »
Recently added both:
-Tone (9 levels - 3 negative, 3 positive, 3 grooming behaviour/patronising)
-WiC Challenge test (Words in Context - https://pilehvar.github.io/wic/)

The WiC test is one of the few NLP tests that can actually be done on this pattern based NLP, as it's not specifically prediction or knowledge based.

The WiC test (training data & results) is ~5500 lines. It completes in only 2 seconds (1980ms-2000ms), however many of the lines include deep knowledge or some other non-literal meaning to trick everyone, including people, so it'll also trick this NLP... The human score is only 80%. Most NLP's get 60-75%.

Most of the NLP set up is complete, so this year I'll be adding words & sentences in order to get through this test... 

O0


*

ivan.moony

  • Trusty Member
  • ************
  • Bishop
  • *
  • 1509
    • contrast-zone
Re: Pattern based NLP
« Reply #16 on: January 06, 2021, 09:42:45 am »
Hi MikeB :)

May I ask, how do you derive answers to the tests?
There exist some rules interwoven within this world. As much as it is a blessing, so much it is a curse.

*

MikeB

  • Nomad
  • ***
  • 92
Re: Pattern based NLP
« Reply #17 on: January 07, 2021, 03:35:29 pm »
Hi MikeB :)

May I ask, how do you derive answers to the tests?

Hi Ivan, I ignore the selected word that the test says to match, altogether, and just look to see if the underlying intention is the same.

In the line "He wore a jock strap with a metal cup. Bees filled the waxen cups with honey."... the word "cup" means the same. A traditional NLP would see if "metal cup" and "waxen cup" means the same based on knowledge linking, but in the pattern matching NLP I just look to see if the basic underlying intention is the same. So both of these sentences would come under "Person describing" with sub tags "clothing, material,..." and some others. If one sentence was a catchphrase or greatly different then it would return not a match.

Another example... "I try to avoid the company of gamblers. We avoided the ball."...the word "avoid" means the same. Both have the intention "Person explain", so this would return true.

It should get at least 60% doing it this way. There is a way to add catchphrases to get a few more, and some other things I can do with tags. Trying to keep real knowledge linking and deducing as far away as possible...

*

MikeB

  • Nomad
  • ***
  • 92
Re: Pattern based NLP
« Reply #18 on: January 14, 2021, 02:53:06 pm »
Just comparing the two Intentions isn't working out too well. Going to start a specialised way of doing it (still without knowledge) by looking at the words before & after the selected word.

*

MikeB

  • Nomad
  • ***
  • 92
Re: Pattern based NLP
« Reply #19 on: February 01, 2021, 07:27:24 am »
Restructered the WiC / Word in Context test to look at the words before & after the indicated word, similar to how we do it.

A brief overview...

1) Both sentences are formatted (look for odd symbols, double spaces, spelling, words spaced out like "h e l l o", extended laughing "hehehehe...").
2) Pattern-match each word to a predefined symbol from a list (only ~20 different symbols total, out of ~2300 english words. No stemming.).
3) Analyse WiC:
 a) Input: Both sentences, the 'lookup word', and both locations of the word.
 b) WiC function: Check the 'lookup word' (now a symbol shared with ~100 similar words) exists in the WiC / sentence compatibility table (~50-100 entries).
 c) WiC function: If at least one match, check all other words. Highest word count (3-5 words) is selected as a match. Remember compatibility ID. Now check second sentence for a match. Return match true/false.

This is much more detailed than just checking the intention, as it can pick up the same context even if one sentence is an "instruction" and the other is a "person describing". EG. "come/came" (1) "Come out of the closet" (2) "He came singing down the road".

I got the time down from 2000-2600ms, to ~1400ms by removing most of the pre-formatting and only keeping 'Double Space' check as the test is already formatted...

Score is not worthy of publishing because I've only checked about 100 of the ~5500 records! A lot of sentences are reused though so shouldn't have to check all of them.


*

MikeB

  • Nomad
  • ***
  • 92
Re: Pattern based NLP
« Reply #20 on: March 01, 2021, 09:49:25 am »
Still working on the WIC test.

Making progress of about 0.1% per day. (20-30% to go)

There's now 3700 words (+1400). 900 WIC pattern sentences (+800). Re-added spell-checking, so the full WIC test takes about 2.5 seconds to complete.

The scale and pickup is actually immense. Each of the 900 WIC pattern sentences has 3-6 "Symbolic Words". Each Symbolic Word represents 10-500 words. So each of the 900 WIC sentences actually picks up 500,000 - 20,000,000 variations.

Many times I add 10-20 WIC patterns (~100,000,000 word-sentence variations) and it only picks up one solitary record in the 5428 record WIC test... So the test is basic... but the word formatting is still broad enough that you can't just cheese the test.

Another problem is lack of words... I'm estimating I'll need at least 5000-7000 total to get a good result, and all these are hand entered in specific categories , so it's going to take some months...

One side effect is that I'm probably going to drop the old "Intention" categories I used to use for the chatbot and use these new WIC categories instead as it picks up an interesting variety. There are about 50 different groups (will be merging some) along the lines of:
"person or thing started to move / person or thing has him..."
"the object/concept of a had-thing"
"had the concept when..."
"a motion was taken / apply a rule / have-take the concept-chance to..."
"i play/avoid the / objects moved/ordered/fell to the
"logic-action an object"
"moving-action the object"
"an object of objects / vivid objects/objectives of"

So these will be better in chatbot programming.

*

MikeB

  • Nomad
  • ***
  • 92
Re: Pattern based NLP
« Reply #21 on: March 05, 2021, 08:51:40 am »
Sped up the processing thanks to Infurl's suggestion of adding Binary Searches.

Huge results.

Added to Spell Checking (800 words), and Word-token assignment (3700 words).

The original lists are unsorted, so they are hashed & sorted in program. (Hashed by ASCII adding.) There are typically 0-5 duplicate hash ids/collisions so the correct matches are checked letter-by-letter as well.

Processing 5428 lots of two sentences:
Before: 2600ms
After: 76ms of preparing. Hashing & sorting spelling and word list.
After: linear searching the hash lists: 1700ms (900ms faster)
After: binary searching the hash lists: 930ms (1670ms faster)

There are other processes, but for the spell/word search alone, Hashed/Linear seems to make it ~50% faster, and Hashed/Binary seems to make it ~90% faster.

*

ivan.moony

  • Trusty Member
  • ************
  • Bishop
  • *
  • 1509
    • contrast-zone
Re: Pattern based NLP
« Reply #22 on: March 05, 2021, 11:12:46 am »
Great speedup! O0

And the good thing is that, using binary search, growing the search set doesn't slow down in linear scale, it slows down in logarithmic scale (that's almost as good as constant speed). The bigger the search set is, more you see the difference between linear search and binary search.
« Last Edit: March 05, 2021, 01:02:30 pm by ivan.moony »
There exist some rules interwoven within this world. As much as it is a blessing, so much it is a curse.

*

infurl

  • Administrator
  • **********
  • Millennium Man
  • *
  • 1148
  • Humans will disappoint you.
    • Home Page
Re: Pattern based NLP
« Reply #23 on: March 06, 2021, 02:17:16 am »
The original lists are unsorted, so they are hashed & sorted in program. (Hashed by ASCII adding.) There are typically 0-5 duplicate hash ids/collisions so the correct matches are checked letter-by-letter as well.
...
After: 76ms of preparing. Hashing & sorting spelling and word list.

Pro-tip #2. There is no reason that you would have to do the preparation such as hashing and sorting at run-time. You could break out the portion of the code that does that preparation into a separate program which you run at compile time. This program does all the necessary preparation and then prints out all the data structures in a format that can be included by your final program and compiled in place into its final form. That will save you a chunk of time every time you run the actual program.

In my case I am parsing and processing millions of grammar rules which can take a considerable amount of time just to prepare. Although small grammars can be processed from start to finish at run-time, I have found it much faster to compile the different files that make up the grammar into intermediate partially processed files; these files in turn get loaded and merged into a final grammar definition which is then saved in source files that can be compiled and linked directly into my parser software, as well as a database format which can be loaded as a binary file at run-time.

That last feature has lots of advantages. The preprocessed files were so large that it was taking a long time just to compile them, but the best thing is that by separating the data files from the software, I can choose completely different processing options on the command line.

*

MikeB

  • Nomad
  • ***
  • 92
Re: Pattern based NLP
« Reply #24 on: March 15, 2021, 07:08:02 am »
That might be an idea.. If it gets longer than 500ms to load then I might do that...  Only expecting about 5000-7000 words, but if I add more languages it could take a while.

I'm trying to keep the data slightly linked-in to the software so that it's harder to work out how it works, but it seems like it's just written in the DLL in plain english anyway... so may end up separating them.

*

MikeB

  • Nomad
  • ***
  • 92
Re: Pattern based NLP
« Reply #25 on: March 23, 2021, 08:38:35 am »
I finally ran into a problem with the word-token grouping not being separated enough, so I'm redoing all the groups.

Originally I'm using 6 Logical groups, 6 Emotive groups, 1 generic 'Possessive/having' group, and a bunch of others including Person (1st/2nd/3rd person). The past/present/future tense is included in the logical/emotive groups as well, and this leads to having to do double-ups in the WIC entries to include bad grammar ("I run away", "I running away", "I runned away", "I ran away")... I'd rather have them in the same group and use a past/present/future tag on the word to analyse later.... The context of "one person running" is the same (it's not "running a fridge"/operating - if it is, it's easy to pick up the extra words...).

The original theory is for 24 word groups, but 16 seems to be the best after I laid all the main keywords out. No prefix/suffix separation anymore...
4 Logical (concepts),
4 Emotive (everything that moves),
4 Burning (analytical/possessive/having/working),
4 Light (romantic/sense/pose/art).

The WIC entries should go down from ~800 to 200-400 with the same pickups and have more range... especially around the 'Burning' and 'Light' categories.

 


Neural nets speeding up numerical mathematics.
by infurl (AI News )
Today at 02:27:59 am
New challenge: Online Turing test
by ruebot (AI News )
April 20, 2021, 03:35:10 pm
Ingenuity flies on Mars
by LOCKSUIT (AI News )
April 19, 2021, 06:11:46 pm
Wow!
by LOCKSUIT (AI News )
April 11, 2021, 10:49:55 pm
A Robot's Self-Portrait Has Sold For Almost $700,000 As An NFT
by ivan.moony (AI News )
April 09, 2021, 09:16:56 pm
A.I. Escher
by infurl (AI News )
April 04, 2021, 03:50:12 am
One hundred and fifty thousand brains.
by infurl (AI News )
April 02, 2021, 10:39:03 pm
Figuring out what song you are listening to from your brainwaves
by frankinstien (AI News )
March 25, 2021, 03:40:09 pm

Users Online

100 Guests, 0 Users

Most Online Today: 117. Most Online Ever: 2369 (November 21, 2020, 04:08:13 pm)

Articles