V-Parser: developing an option towards NLP and some other thingies

ivan.moony · « **on:** July 17, 2017, 02:50:08 pm »

One way to parse a natural language text is to parse it by certain BNF notation grammar rules of relevant language. BNF could be a good choice because it is able to describe context free grammars (CFG) that allow ambiguity parses, while natural languages grammars are full of ambiguous alternations in the first-pass parse. Yet in the second pass we resolve which alternation is the right one, according to common sense rules. In example of: "I poked a monkey with a stick" we actually have to decide whether the narrator used a stick to poke the monkey or the monkey holds a stick. This is a very difficult task and even modern AI contests are arranged around resolving ambiguities like this. So, it is necessary to parse all ambiguities in order to pick the right one.

Recently I've been working on a CFG parser (the first pass), and what I've got so far is a fairly simple top-down algorithm with support for ambiguous parses that looks like this:

Code

function parse (startSymbol, text) {
    success = false;
    node = START_SYMBOL;

    while (true) {
        if (node is sequence) {
            node = node.FIRST_SEQUENCE_ELEMENT;

        } else if (node is alternation) {
            node = node.FIRST_ALTERNATION_ELEMENT;

        } else if (node.rule is terminal) {
            length = SCAN_TERMINAL (node.POSITION, node.TERMINAL);
            
            if (node.POSITION + length === text.length && length > -1)
                success = true;

            if (node.AHEAD_ITEM !== null && length > -1) {
                node = node.AHEAD_ITEM;

            } else if (node.NEXT_ALTERNATION !== null) {
                node = node.NEXT_ALTERNATION;

            } else
                break;
        }
    }

    return success;
}

Please note that this is only check-out parser that reports a success or a failure, providing a BNF-ish grammar and a text to parse. Some extra code is needed to record a pass through to an abstract syntax tree. Also, unavoidable "memo table" speed-up is omitted in this pseudo-code for simplicity.

The algorithm subsumes that we know every ahead item and next alternation for each node. This is not so hard to achieve, even on the fly, and releases us of having a depth stack while parsing. The algorithm is arranged to exhibit depth-first parse through a grammar tree. The difficulty with depth-first parsing is that it chokes on "left recursion", entering an infinite loop. Left recursion could be resolved by detecting it when we reach it, skipping it over, and placing multiple AHEAD_ITEM-s for each recursive symbol. I think it would work (parser just considers several alternative ahead items, one that ends the recursion and others that continue recursion), but I am just not happy with this solution, it just doesn't "feel right" for me, I don't know why.

I am hoping for an universal solution that handles left recursion without extra ad-hoc code. The plan is to try to rewrite the algorithm to walk through symbols in breadth-first manner. I think some node position chart is necessary in order for breadth-first version to work. Could be cool, I have to check it.

[Edit]
Utilizing memo table is basically the same problem as recursion. In both cases multiple ahead items are continued from two or more same symbols that start at the same place.

infurl · « **Reply #1 on:** July 18, 2017, 08:53:47 pm »

Using a memo table is the same principle as chart parsing. It stores intermediate results so it doesn't have to recalculate them. GLR parsers work using shift-reduce instead of recursion and don't choke on left-recursive rules. In fact they are most efficient when rules are written to use left-recursion.

ivan.moony · « **Reply #2 on:** July 18, 2017, 09:04:55 pm »

Considering GLR is basically bottom-up, how does it perform regarding to speed? Once I programmed a shift-reduce parser, but it was a way too slow because it was checking all the combinations from bottom towards up, no matter if an upper symbol fits at given place. I think that a major reason for speed of top down parsers is that they check only what they expect at given positions.

LOCKSUIT · « **Reply #3 on:** July 18, 2017, 11:35:06 pm »

The best thing I can offer for LNNs or VNNs, is that from naturally living a life, we make enough connections so that the network recognizes the correct sequence selection and also outputs the correct sequence selection to say externally or internally to one's self. And also allows discovering by mere words, or even images, bits, or particles.

ivan.moony · « **Reply #4 on:** July 19, 2017, 09:51:48 am »

Actually, there is analogy between parsers and neural networks. They both move up to the upper symbol if conditions are satisfied. With NN, only some of the conditions would be satisfied to move up, while with parsers all of the conditions should be satisfied to move up. In both cases we feed the bottom layer and we get the top layer by an algorithm. I think that some kind of a fuzzy parser implementation could serve as a replacement for NN.

8pla.net · « **Reply #5 on:** July 19, 2017, 03:37:21 pm »

The parameter named "startSymbol" is never used in the function.

ivan.moony · « **Reply #6 on:** July 19, 2017, 05:38:51 pm »

@8pla.net: it's a pseudocode. Upper case letters are used where some sort of calculation is needed. In the case of "START_SYMBOL", it is an object constructed from "startSymbol", along with some extra informations. For example, "START_SYMBOL" in javascript could look like this:

Code

{
    position: 0,
    rule: startSymbol,
    aheadItem: null,
    nextAlternation: null
}

"node.FIRST_SEQUENCE_ELEMENT" could look like this:

Code

{
    position: node.position,
    rule: node.rule.sequence[0],
    aheadItem: {parentSequence: node, index: 1},
    nextAlternation: node.nextAlternation
}

and "node.AHEAD_ITEM" could look like this:

Code

{
    position: node.position + length,
    rule: node.aheadItem.parentSequence.sequence[node.aheadItem.index],
    aheadItem: (
        (node.aheadItem.index === node.aheadItem.parentSequence.length - 1)?
            node.aheadItem.parentSequence.aheadItem:
            {parentSequence: node.aheadItem.parentSequence, index: node.aheadItem.index + 1}
    ),
    nextAlternation: node.nextAlternation
}

Anyway, the top post pseudocode doesn't handle memos and left recursion. Ive been developing it further, and I think this time I have a complete version. This is the pseudocode:

Code

FUNCTION parse (startSymbol, text)
    VAR schedule as FIFO_QUEUE, memo as array, branch as object, node as object, success as boolean
    
    success = false
    schedule.ENQUEUE (START_SYMBOL)

    WHILE schedule is not empty
        branch = schedule.DEQUEUE ()

        IF branch is in memo || branch is terminal
            IF branch is in memo
                node = memo.GET (branch.POSITION, branch.RULE)

            ELSE
                node = branch.SCAN_TERMINAL ()

            IF node.SUCCESS AND branch.AHEAD_NETWORK is null AND text.length in node.RIGHT_EXTENTS
                success = true

            IF node.SUCCESS AND branch.AHEAD_NETWORK is not null
                FOR each a in branch.AHEAD_NETWORK
                    node.AHEAD_NETWORK.PUSH (a)
                    FOR each position in node.RIGHT_EXTENDS
                        schedule.ENQUEUE (a)

            ELSE IF branch.NEXT_ALTERNATION is not null
                schedule.ENQUEUE (branch.NEXT_ALTERNATION)

        ELSE
            IF branch is alternation
                schedule.ENQUEUE (branch.FIRST_ALTERNATION_ELEMENT)

            ELSE IF branch is sequence
                schedule.ENQUEUE (branch.FIRST_SEQUENCE_ELEMENT)

        memo.UPDATE (branch)

    RETURN success

It is about a network of ahead symbols. We walk through symbols in a depth-first manner. When a symbol is parsed for the first time, it's ahead symbol is stitched on memo chart. Later, in the process of parsing, if the symbol repeats on the same location, its new ahead symbols are additionally stitched and processed along with its old ahead symbols. I would call this as "ahead network" because it contains some structure in which parent ahead symbols are inherited to end-of-sequence and sub-alternations children symbols. Something like:

Code

... -> [
  sub-sub-node -> [
    sub-node -> [
      node -> [node_aheads],
      [sub-node_aheads]
    ],
    [sub-sub-node_aheads]
  ],
  ...
]

this way each network element can be enriched by new ahead symbols in the process of parsing, waiting for the future processing, if the same symbol gets re-considered. Note that inheriting works only downwards, and not upwards, which is exactly what we need.

The algorithm handles left recursion in the same way it handles symbols matched in memo chart. Recursive symbol gets enriched by a new ahead symbol that continues recursion, along with recursion terminating ahead symbol. It should also have no problems with nulls. This algorithm is constructed using depth-first tree walking policy, but the ahead system should work if we rewrite it to any tree walking policy, including breadth-first and other possible variations.

Another new thing I plan to introduce is what I would call "Compact Parse Forest". In CPF there are no additional syntax tree returned from the parsing function. Instead, all the symbol left and right extents are stored directly into the grammar tree, as arrays belonging to relevant symbols. If we later want to reconstruct the abstract syntax forest, we have to see which previous node right extent matches with which next node left extent in a sequence. A bit more complicated for extracting, but uses less memory and makes exposed kind of parsing possible (a kind of parsing based only on walking through ahead symbols). When processing a final result, special lens that do the matching could be used in a form of javascript proxy handlers, so the end user would see no difference, while saving a memory for other stuff. CPF would also directly be used instead of memo table, as it has all the required informations - a symbol and its extents, and this additionally saves the memory. Someone would say that I'm obsessed with memory issues in a time when computers have several gigabytes of ram, but remember that more memory consumption means more reading and writing to memory, and that means more time spent to read/write the data. Maybe I'm also hoping to break some speed records for parsing CFG grammar based text data, but we'll see.

And last, but not least, an innovation would also be a human readable parse forest representation based on indexed continuations. I'll present this innovation a bit later, but first I have to check how the algorithm behaves in the practice, and does it work as it is intended anyway.

8pla.net · « **Reply #7 on:** July 20, 2017, 01:29:20 am »

What does SCAN_TERMINAL() do?

Code

function SCAN_TERMINAL(x,y)
{
  len = y-x;
  return len;
}

ivan.moony · « **Reply #8 on:** July 20, 2017, 06:25:13 am »

Quote

What does SCAN_TERMINAL() do?

If the terminal is a constant string, it checks if the constant resides at a position in parsed text:

Code

function parseConstant (text, pos, constant)
    if (text.substring (pos, pos + constant.length) === constant)
        return pos + constant.length;
    else
        return -1;
}

but it could be extended to parse a regexp too. Lexer is a regular pre-parsing step where input string is broken into a word array, ready for further processing. But exposed ahead network parsing algorithm uses scannerless method where scanning is done on the fly.

8pla.net · « **Reply #9 on:** July 21, 2017, 01:55:56 am »

C Language:

Code

#include <stdio.h>
#include <string.h>

void main()
{
   const char *scan = "Scan Terminal", *terminal = "Terminal";
   char *node;
   int pos;

   node = strstr(scan, terminal);
   pos = node - scan;

   printf("Scanned, \"%s\". For, \"%s\" (at position %d).\n", scan, node,pos);
}

PROGRAM OUTPUT:

Scanned, "Scan Terminal". For, "Terminal" (at position 5).

@ivan.moony You mean something like this?
@Freddy: May we have syntax highlighting?
@All readers: This source code is for discussion purposes only.

infurl · « **Reply #10 on:** July 21, 2017, 03:17:48 am »

Quote from: ivan.moony on July 18, 2017, 09:04:55 pm

Considering GLR is basically bottom-up, how does it perform regarding to speed? Once I programmed a shift-reduce parser, but it was a way too slow because it was checking all the combinations from bottom towards up, no matter if an upper symbol fits at given place. I think that a major reason for speed of top down parsers is that they check only what they expect at given positions.

Implemented correctly, shift reduce parsers are as fast as parsing gets and GLR parsers are the fastest way to parse context free grammars.

This is achieved by repeatedly expanding constituents until you have lists of terminals in lookup tables so at any given point in the parse, the parser is only looking for certain terminal symbols which in turn yield a specific subset of possible non-terminals. Certainly it can take time to generate the lookup tables that the parser uses, but the algorithms to generate the lookup tables are very efficient and can also be implemented to create parts of the lookup tables "just in time".

This isn't just stuff that I've read, I spent years researching and developing the subject and I've written some extremely sophisticated libraries implementing GLR parsing. My parser can handle context free grammars containing millions of rules and can parse a book in the time that it takes most natural language parsers to parse a sentence. I'm using these libraries for my own purposes and have no immediate plans to release them since so few people seem to have a clue about how any of this actually works, it would be more trouble than it was worth.

Nevertheless, keep going the way you're going. You're doing great and you might get there yourself in another ten or twenty years. ;-)

ivan.moony · « **Reply #11 on:** July 21, 2017, 09:14:50 am »

I just fixed a small bug, a single line loop that updates past nodes is added when advancing to an ahead symbol.

@8pla.net: Almost. The only difference is that scan function doesn't move the parsing pointer arbitrary through text, yet in scanning, the position of pointer is set at fixed value that depends on the previous terminal position and length. Basically, the scanning function is called repeatedly until it reaches the end of text string, when the parsing returns success.

@infurl: Ten or twenty years? Boy, it doesn't sound very promising. I wonder, do you use a word binary search for very big alternations?

Freddy · « **Reply #12 on:** July 21, 2017, 11:10:36 pm »

Quote from: 8pla.net on July 21, 2017, 01:55:56 am

@Freddy: May we have syntax highlighting?

Gosh you ask a lot

It was a pain in the bottom, but I managed to crowbar one in. I picked a dark background, just because I prefer to read code that way and the colours look better.

If you all prefer something else then maybe have a look at these and we can nit pick it : https://jmblog.github.io/color-themes-for-google-code-prettify/

infurl · « **Reply #13 on:** July 22, 2017, 02:37:23 am »

Quote from: ivan.moony on July 21, 2017, 09:14:50 am

@infurl: Ten or twenty years? Boy, it doesn't sound very promising. I wonder, do you use a word binary search for very big alternations?

How long it actually takes you will largely depend on how long you're willing to waste your time trying to reinvent everything yourself like I did. Once you realize that no matter what you invent somebody else already did it and did it better, then you start learning from others and you make much faster progress. It's only then that you stand a real chance of inventing something that actually is new. Hubris is the biggest obstacle to progress. I wish I'd learned that when I was a lot younger.

I tested my libraries extensively over a long period of time and optimized just about everything that could be optimized. Anything that can be precalculated is also pre-sorted and uses binary search, but in a lot of places the data structures are dynamic and so I make heavy use of top-down splay trees to speed up searching. It also collapses symbols into ranges wherever possible so it can do range comparisons instead of list searches.

The main library is a scannerless semantic parser. It goes all the way from raw characters to high level abstractions in one pass. Here's a sample of its output.

Code

<Clause>
    <Subject Clause_type='declarative' Number='plural' Person='third' Verb_type='complex_intransitive'>
        <Determinative_THE Category='definite'>The</Determinative_THE>
        <Nominal Number='plural'>
            <Noun_BOX Case='plain_case' Category='common_noun' Number='plural'>boxes</Noun_BOX>
        </Nominal>
    </Subject>
    <Predicator Clause_type='declarative' Number='plural' Person='third' Verb_type='complex_intransitive'>
        <Nonmodal_BE Number='plural' Person='third' Polarity='negative' Verb_form='present' Verb_type='complex_intransitive' to_infinitival='yes'>are not</Nonmodal_BE>
    </Predicator>
    <Subjective>
        <Preposition_phrase>
            <Preposition_ON Category='preposition'>on</Preposition_ON>
            <Determinative_THE Category='definite'>the</Determinative_THE>
            <Nominal Number='plural'>
                <Noun_TABLE Case='plain_case' Category='common_noun' Number='plural'>tables</Noun_TABLE>
            </Nominal>  
        </Preposition_phrase>
    </Subjective>
</Clause>

ivan.moony · « **Reply #14 on:** July 22, 2017, 03:16:48 pm »

I investigated GLR parsing. Ahead network parser is almost it, but not quite. The difference is that GLR is bottom-up, while ahead network parser is top-down. Ahead network parser also constructs the look ahead table elements on the fly, and ahead symbols are stitched to symbols, while GLR stitches ahead symbols to sequence positions. Thinking about speed, GLR parser could be somewhat faster, but I kind of like a simplicity of ahead network parser. I'd almost abandon my design in the favor of GLR parser, but what a heck, a simplicity is an advantage too.

V-Parser: developing an option towards NLP and some other thingies

ivan.moony

V-Parser: developing an option towards NLP and some other thingies

infurl

Re: V-Parser: developing an option towards NLP and some other thingies

ivan.moony

Re: V-Parser: developing an option towards NLP and some other thingies

LOCKSUIT

Re: V-Parser: developing an option towards NLP and some other thingies

ivan.moony

Re: V-Parser: developing an option towards NLP and some other thingies

8pla.net

Re: V-Parser: developing an option towards NLP and some other thingies

ivan.moony

Re: V-Parser: developing an option towards NLP and some other thingies

8pla.net

Re: Ahead Network Parser: developing an option towards NLP and some other thingies

ivan.moony

Re: Ahead Network Parser: developing an option towards NLP and some other thingies

8pla.net

Re: Ahead Network Parser: developing an option towards NLP and some other thingies

infurl

Re: V-Parser: developing an option towards NLP and some other thingies

ivan.moony

Re: Ahead Network Parser: developing an option towards NLP and some other thingies

Freddy

Re: Ahead Network Parser: developing an option towards NLP and some other thingies

infurl

Re: Ahead Network Parser: developing an option towards NLP and some other thingies

ivan.moony

Re: Ahead Network Parser: developing an option towards NLP and some other thingies

Recent Topics

Recent News

Users Online

Articles