I'm still working on this. I recently completed the concept for English words into the new Grammar categories.
There are:
Four main groups for Nouns, verbs (actions), and adjectives (modifiers). The groups are: Moving/living things, Analytical/laws/concepts, Logical subparts/binary actions, Light sense/stories/beautiful terms. (IE. The Four elemental/original groups each contain three sub groups - Noun, Verb, Adjective).
Seven other groups: Articles/Quantifier, Person/Agent, Question/Interrogative, Time-spatial, Direction-spatial, Conjuction/sentence breakers, Exlamation/grunt/hi/bye.
All Eleven groups are also encoded with present/future/past and precise/optimistic/explaining at the same time. IE All Present things are Precise, all Future things are Optimistic, all Past things are Explaining. In the Four main/elemental groups: Nouns are Precise/Present, Verbs are Future/Optimistic, Adjectives are Past/Explaining.
The last thing that breaks all grammar common sense is that each word is only permitted to be in one category only. So words like "brush" must either be an action (verb) or the name of a thing (noun). The default is to be an action (verb) as nouns aren't heavily relied on in sentence matching.
The whole concept is speed over quality... but as there's nothing for 3d environments between hard-written dialogue trees and GPT-3, this will sit right inbetween...
There are 3500 words to individually convert over so will take until the new year to do as I'm also looking at Speech Recognition software.
Speech Recognition software today uses Algorithms/Ngrams/NN's and is really slow (1-3 seconds response time) and uses a lot of power... The speed of my FSM/FST/binary NLP is 0.1ms to process a sentence (all words & intention)... So if the speech rec software is fast as well then it's more suited to 3d enviroments even if not as good....
Combining the NLP with speech rec is as simple as writing phonemes next to each word in the dictionary... If the user is speaking via voice then the word-text searching can be skipped altogether... it can go straight from voice phoneme->symonym symbols->pattern sentence pickup->intention grouping.
For audio processing I'm looking at OpenAL Soft (Open Audio Language) right now. There's nothing in the libraries for voice recognition, or even microphone low/high/band filter passing, but it's low-level enough to work on and have both speed and cross compatibility with other OS's.
The fastest approach I've seen is to take about 50ms of audio (shortest phonemes), generalising the pitch then associating it with a phoneme (tuned to your accent). This is about 1ms fast... but again, sits inbetween the best and something hard-coded.
One of the benefits if it works is that a responding chatbot can completely vary the response time to suit the situation... including interrupting the user, which adds another layer of humanness, but depends on how well words & intentions are picked up.