Ai Dreams Forum

Member's Experiments & Projects => General Project Discussion => Topic started by: ivan.moony on July 17, 2017, 02:50:08 pm

Title: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on July 17, 2017, 02:50:08 pm
One way to parse a natural language text is to parse it by certain BNF (https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) notation grammar rules of relevant language. BNF could be a good choice because it is able to describe context free grammars (CFG) that allow ambiguity parses, while natural languages grammars are full of ambiguous alternations in the first-pass parse. Yet in the second pass we resolve which alternation is the right one, according to common sense rules. In example of: "I poked a monkey with a stick" we actually have to decide whether the narrator used a stick to poke the monkey or the monkey holds a stick. This is a very difficult task and even modern AI contests are arranged around resolving ambiguities like this. So, it is necessary to parse all ambiguities  in order to pick the right one.

Recently I've been working on a CFG parser (the first pass), and what I've got so far is a fairly simple top-down algorithm with support for ambiguous parses that looks like this:
Code: [Select]
function parse (startSymbol, text) {
    success = false;
    node = START_SYMBOL;

    while (true) {
        if (node is sequence) {
            node = node.FIRST_SEQUENCE_ELEMENT;

        } else if (node is alternation) {
            node = node.FIRST_ALTERNATION_ELEMENT;

        } else if (node.rule is terminal) {
            length = SCAN_TERMINAL (node.POSITION, node.TERMINAL);
           
            if (node.POSITION + length === text.length && length > -1)
                success = true;

            if (node.AHEAD_ITEM !== null && length > -1) {
                node = node.AHEAD_ITEM;

            } else if (node.NEXT_ALTERNATION !== null) {
                node = node.NEXT_ALTERNATION;

            } else
                break;
        }
    }

    return success;
}

Please note that this is only check-out parser that reports a success or a failure, providing a BNF-ish grammar and a text to parse. Some extra code is needed to record a pass through to an abstract syntax tree. Also, unavoidable "memo table" speed-up is omitted in this pseudo-code for simplicity.

The algorithm subsumes that we know every ahead item and next alternation for each node. This is not so hard to achieve, even on the fly, and releases us of having a depth stack while parsing.  The algorithm is arranged to exhibit  depth-first (https://en.wikipedia.org/wiki/Depth-first_search) parse through a grammar tree. The difficulty with depth-first parsing is that it chokes on "left recursion", entering an infinite loop. Left recursion could be resolved by detecting it when we reach it, skipping it over, and placing multiple AHEAD_ITEM-s for each recursive symbol. I think it would work (parser just considers several alternative ahead items, one that ends the recursion and others that continue recursion), but I am just not happy with this solution, it just doesn't "feel right" for me, I don't know why.

I am hoping for an universal solution that handles left recursion without extra ad-hoc code. The plan is to try to rewrite the algorithm to walk through symbols in breadth-first manner (https://en.wikipedia.org/wiki/Breadth-first_search). I think some node position chart is necessary in order for breadth-first version to work. Could be cool, I have to check it.

[Edit]
Utilizing memo table is basically the same problem as recursion. In both cases multiple ahead items are continued from two or more same symbols that start at the same place.
Title: Re: V-Parser: developing an option towards NLP and some other thingies
Post by: infurl on July 18, 2017, 08:53:47 pm
Using a memo table is the same principle as chart parsing. It stores intermediate results so it doesn't have to recalculate them. GLR parsers work using shift-reduce instead of recursion and don't choke on left-recursive rules. In fact they are most efficient when rules are written to use left-recursion.
Title: Re: V-Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on July 18, 2017, 09:04:55 pm
Considering GLR is basically bottom-up, how does it perform regarding to speed? Once I programmed a shift-reduce parser, but it was a way too slow because it was checking all the combinations from bottom towards up, no matter if an upper symbol fits at given place. I think that a major reason for speed of top down parsers is that they check only what they expect at given positions.
Title: Re: V-Parser: developing an option towards NLP and some other thingies
Post by: LOCKSUIT on July 18, 2017, 11:35:06 pm
The best thing I can offer for LNNs or VNNs, is that from naturally living a life, we make enough connections so that the network recognizes the correct sequence selection and also outputs the correct sequence selection to say externally or internally to one's self. And also allows discovering by mere words, or even images, bits, or particles.
Title: Re: V-Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on July 19, 2017, 09:51:48 am
Actually, there is analogy between parsers and neural networks. They both move up to the upper symbol if conditions are satisfied. With NN, only some of the conditions would be satisfied to move up, while with parsers all of the conditions should be satisfied to move up. In both cases we feed the bottom layer and we get the top layer by an algorithm. I think that some kind of a fuzzy parser implementation could serve as a replacement for NN.
Title: Re: V-Parser: developing an option towards NLP and some other thingies
Post by: 8pla.net on July 19, 2017, 03:37:21 pm
The parameter named "startSymbol" is never used in the function.
Title: Re: V-Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on July 19, 2017, 05:38:51 pm
@8pla.net: it's a pseudocode. Upper case letters are used where some sort of calculation is needed. In the case of "START_SYMBOL", it is an object constructed from "startSymbol", along with some extra informations. For example, "START_SYMBOL" in javascript could look like this:
Code: [Select]
{
    position: 0,
    rule: startSymbol,
    aheadItem: null,
    nextAlternation: null
}
"node.FIRST_SEQUENCE_ELEMENT" could look like this:
Code: [Select]
{
    position: node.position,
    rule: node.rule.sequence[0],
    aheadItem: {parentSequence: node, index: 1},
    nextAlternation: node.nextAlternation
}
and "node.AHEAD_ITEM" could look like this:
Code: [Select]
{
    position: node.position + length,
    rule: node.aheadItem.parentSequence.sequence[node.aheadItem.index],
    aheadItem: (
        (node.aheadItem.index === node.aheadItem.parentSequence.length - 1)?
            node.aheadItem.parentSequence.aheadItem:
            {parentSequence: node.aheadItem.parentSequence, index: node.aheadItem.index + 1}
    ),
    nextAlternation: node.nextAlternation
}

Anyway, the top post pseudocode doesn't handle memos and left recursion. Ive been developing it further, and I think this time I have a complete version. This is the pseudocode:
Code: [Select]
FUNCTION parse (startSymbol, text)
    VAR schedule as FIFO_QUEUE, memo as array, branch as object, node as object, success as boolean
   
    success = false
    schedule.ENQUEUE (START_SYMBOL)

    WHILE schedule is not empty
        branch = schedule.DEQUEUE ()

        IF branch is in memo || branch is terminal
            IF branch is in memo
                node = memo.GET (branch.POSITION, branch.RULE)

            ELSE
                node = branch.SCAN_TERMINAL ()

            IF node.SUCCESS AND branch.AHEAD_NETWORK is null AND text.length in node.RIGHT_EXTENTS
                success = true

            IF node.SUCCESS AND branch.AHEAD_NETWORK is not null
                FOR each a in branch.AHEAD_NETWORK
                    node.AHEAD_NETWORK.PUSH (a)
                    FOR each position in node.RIGHT_EXTENDS
                        schedule.ENQUEUE (a)

            ELSE IF branch.NEXT_ALTERNATION is not null
                schedule.ENQUEUE (branch.NEXT_ALTERNATION)

        ELSE
            IF branch is alternation
                schedule.ENQUEUE (branch.FIRST_ALTERNATION_ELEMENT)

            ELSE IF branch is sequence
                schedule.ENQUEUE (branch.FIRST_SEQUENCE_ELEMENT)

        memo.UPDATE (branch)

    RETURN success

It is about a network of ahead symbols. We walk through symbols in a depth-first manner. When a symbol is parsed for the first time, it's ahead symbol is stitched on memo chart. Later, in the process of parsing, if the symbol repeats on the same location, its new ahead symbols are additionally stitched and processed along with its old ahead symbols. I would call this as "ahead network" because it contains some structure in which parent ahead symbols are inherited to end-of-sequence and sub-alternations children symbols. Something like:
Code: [Select]
... -> [
  sub-sub-node -> [
    sub-node -> [
      node -> [node_aheads],
      [sub-node_aheads]
    ],
    [sub-sub-node_aheads]
  ],
  ...
]
this way each network element can be enriched by new ahead symbols in the process of parsing, waiting for the future processing, if the same symbol gets re-considered. Note that inheriting works only downwards, and not upwards, which is exactly what we need.

The algorithm handles left recursion in the same way it handles symbols matched in memo chart. Recursive symbol gets enriched by a new ahead symbol that continues recursion, along with recursion terminating ahead symbol. It should also have no problems with nulls. This algorithm is constructed using depth-first tree walking policy, but the ahead system should work if we rewrite it to any tree walking policy, including breadth-first and other possible variations (https://en.wikipedia.org/wiki/Tree_traversal).

Another new thing I plan to introduce is what I would call "Compact Parse Forest". In CPF there are no additional syntax tree returned from the parsing function. Instead, all the symbol left and right extents are stored directly into the grammar tree, as arrays belonging to relevant symbols. If we later want to reconstruct the abstract syntax forest, we have to see which previous node right extent matches with which next node left extent in a sequence. A bit more complicated for extracting, but uses less memory and makes exposed kind of parsing possible (a kind of parsing based only on walking through ahead symbols). When processing a final result, special lens that do the matching could be used in a form of javascript proxy handlers (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Proxy), so the end user would see no difference, while saving a memory for other stuff. CPF would also directly be used instead of memo table, as it has all the required informations - a symbol and its extents, and this additionally saves the memory. Someone would say that I'm obsessed with memory issues in a time when computers have several gigabytes of ram, but remember that more memory consumption means more reading and writing to memory, and that means more time spent to read/write the data. Maybe I'm also hoping to break some speed records for parsing CFG grammar based text data, but we'll see.

And last, but not least, an innovation would also be a human readable parse forest representation based on indexed continuations. I'll present this innovation a bit later, but first I have to check how the algorithm behaves in the practice, and does it work as it is intended anyway.
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: 8pla.net on July 20, 2017, 01:29:20 am
What does SCAN_TERMINAL() do?

Code: [Select]
function SCAN_TERMINAL(x,y)
{
  len = y-x;
  return len;
}
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on July 20, 2017, 06:25:13 am
Quote
What does SCAN_TERMINAL() do?
If the terminal is a constant string, it checks if the constant resides at a position in parsed text:
Code: [Select]
function parseConstant (text, pos, constant)
    if (text.substring (pos, pos + constant.length) === constant)
        return pos + constant.length;
    else
        return -1;
}

but it could be extended to parse a regexp too. Lexer (https://en.wikipedia.org/wiki/Lexical_analysis) is a regular pre-parsing step where input string is broken into a word array, ready for further processing. But exposed ahead network parsing algorithm uses scannerless (https://en.wikipedia.org/wiki/Scannerless_parsing) method where scanning is done on the fly.
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: 8pla.net on July 21, 2017, 01:55:56 am
C Language:

Code: [Select]
#include <stdio.h>
#include <string.h>

void main()
{
   const char *scan = "Scan Terminal", *terminal = "Terminal";
   char *node;
   int pos;

   node = strstr(scan, terminal);
   pos = node - scan;

   printf("Scanned, \"%s\". For, \"%s\" (at position %d).\n", scan, node,pos);
}

PROGRAM OUTPUT:

Scanned, "Scan Terminal". For, "Terminal" (at position 5).

@ivan.moony You mean something like this?
@Freddy:  May we have syntax highlighting?
@All readers: This source code is for discussion purposes only.
Title: Re: V-Parser: developing an option towards NLP and some other thingies
Post by: infurl on July 21, 2017, 03:17:48 am
Considering GLR is basically bottom-up, how does it perform regarding to speed? Once I programmed a shift-reduce parser, but it was a way too slow because it was checking all the combinations from bottom towards up, no matter if an upper symbol fits at given place. I think that a major reason for speed of top down parsers is that they check only what they expect at given positions.

Implemented correctly, shift reduce parsers are as fast as parsing gets and GLR parsers are the fastest way to parse context free grammars.

This is achieved by repeatedly expanding constituents until you have lists of terminals in lookup tables so at any given point in the parse, the parser is only looking for certain terminal symbols which in turn yield a specific subset of possible non-terminals. Certainly it can take time to generate the lookup tables that the parser uses, but the algorithms to generate the lookup tables are very efficient and can also be implemented to create parts of the lookup tables "just in time".

This isn't just stuff that I've read, I spent years researching and developing the subject and I've written some extremely sophisticated libraries implementing GLR parsing. My parser can handle context free grammars containing millions of rules and can parse a book in the time that it takes most natural language parsers to parse a sentence. I'm using these libraries for my own purposes and have no immediate plans to release them since so few people seem to have a clue about how any of this actually works, it would be more trouble than it was worth.

Nevertheless, keep going the way you're going. You're doing great and you might get there yourself in another ten or twenty years. ;-)
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on July 21, 2017, 09:14:50 am
I just fixed a small bug, a single line loop that updates past nodes is added when advancing to an ahead symbol.

@8pla.net: Almost. The only difference is that scan function doesn't move the parsing pointer arbitrary through text, yet in scanning, the position of pointer is set at fixed value that depends on the previous terminal position and length. Basically, the scanning function is called repeatedly until it reaches the end of text string, when the parsing returns success.

@infurl: Ten or twenty years? Boy, it doesn't sound very promising. I wonder, do you use a word binary search for very big alternations?
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: Freddy on July 21, 2017, 11:10:36 pm
@Freddy:  May we have syntax highlighting?

Gosh you ask a lot ;)

It was a pain in the bottom, but I managed to crowbar one in. I picked a dark background, just because I prefer to read code that way and the colours look better.

If you all prefer something else then maybe have a look at these and we can nit pick it : https://jmblog.github.io/color-themes-for-google-code-prettify/
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: infurl on July 22, 2017, 02:37:23 am
@infurl: Ten or twenty years? Boy, it doesn't sound very promising. I wonder, do you use a word binary search for very big alternations?

How long it actually takes you will largely depend on how long you're willing to waste your time trying to reinvent everything yourself like I did. Once you realize that no matter what you invent somebody else already did it and did it better, then you start learning from others and you make much faster progress. It's only then that you stand a real chance of inventing something that actually is new. Hubris is the biggest obstacle to progress. I wish I'd learned that when I was a lot younger.

I tested my libraries extensively over a long period of time and optimized just about everything that could be optimized. Anything that can be precalculated is also pre-sorted and uses binary search, but in a lot of places the data structures are dynamic and so I make heavy use of top-down splay trees to speed up searching. It also collapses symbols into ranges wherever possible so it can do range comparisons instead of list searches.

The main library is a scannerless semantic parser. It goes all the way from raw characters to high level abstractions in one pass. Here's a sample of its output.

Code: [Select]
<Clause>
    <Subject Clause_type='declarative' Number='plural' Person='third' Verb_type='complex_intransitive'>
        <Determinative_THE Category='definite'>The</Determinative_THE>
        <Nominal Number='plural'>
            <Noun_BOX Case='plain_case' Category='common_noun' Number='plural'>boxes</Noun_BOX>
        </Nominal>
    </Subject>
    <Predicator Clause_type='declarative' Number='plural' Person='third' Verb_type='complex_intransitive'>
        <Nonmodal_BE Number='plural' Person='third' Polarity='negative' Verb_form='present' Verb_type='complex_intransitive' to_infinitival='yes'>are not</Nonmodal_BE>
    </Predicator>
    <Subjective>
        <Preposition_phrase>
            <Preposition_ON Category='preposition'>on</Preposition_ON>
            <Determinative_THE Category='definite'>the</Determinative_THE>
            <Nominal Number='plural'>
                <Noun_TABLE Case='plain_case' Category='common_noun' Number='plural'>tables</Noun_TABLE>
            </Nominal> 
        </Preposition_phrase>
    </Subjective>
</Clause>
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on July 22, 2017, 03:16:48 pm
I investigated GLR parsing. Ahead network parser is almost it, but not quite. The difference is that GLR is bottom-up, while ahead network parser is top-down. Ahead network parser also constructs the look ahead table elements on the fly, and ahead symbols are stitched to symbols, while GLR stitches ahead symbols to sequence positions. Thinking about speed, GLR parser could be somewhat faster,  but I kind of like a simplicity of ahead network parser. I'd almost abandon my design in the favor of GLR parser, but what a heck, a simplicity is an advantage too.
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on July 22, 2017, 07:15:26 pm
Quote
How long it actually takes you will largely depend on how long you're willing to waste your time trying to reinvent everything yourself like I did. Once you realize that no matter what you invent somebody else already did it and did it better, then you start learning from others and you make much faster progress. It's only then that you stand a real chance of inventing something that actually is new. Hubris is the biggest obstacle to progress. I wish I'd learned that when I was a lot younger.

Dream big. The bigger the dream is, more beautiful the world could be.
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on July 28, 2017, 07:14:58 pm
You know that feeling when you program something, it works, but you still having an itch about it? Well, it happened to me with the last version of parsing algorithm. It should work in the practice, I even started implementing it, but something wouldn't give me a break, so I stopped what I was doing and returned to my scratch book. I spent another five days thinking, although the previous version could work fine, but a little bird was inexorably tweeting that the algorithm can be better.

... :headscratch: ...

And it can. You know that feeling when you give everything out of yourself, and the result just feels right? Well, it is just happening to me now, I'm celebrating it with a fresh lemon beer, and I'm enjoying in a good milestone.

 :weee:

See for yourself:
Quote
FUNCTION parse (grammar, text)
        DECLARE chart as ARRAY[LENGTH(text)][], terminals as ARRAY[]
        chart[0][0] := {
                Rule: grammar.START_SYMBOL,
                Aheads: []
        };
        FOR each column in chart
                terminals.CLEAR ()
                WHILE new rows in column
                        // Phase 1: collect
                        FOR each new row in column
                                memo := column.GET_FIRST (row.Rule);
                                IF memo not equals row // found memo
                                        FOR each item in row.Aheads
                                                memo.Aheads.PUSH (item);

                                ELSE IF row.Rule is not terminal
                                        FOR each production of row.Rule
                                                column.PUSH (NODE_WITH_AHEAD_NETWORK (production.RIGHT_SIDE));

                                ELSE // row.Rule is terminal
                                        IF row.SCAN_TERMINAL ()
                                                terminals.PUSH (row);

                        // Phase 2: advance
                        FOR each term in terminals
                                FOR each new item in term.Aheads
                                        IF item not in chart[term.RIGHT_EXTENT]
                                                chart[term.RIGHT_EXTENT].PUSH (item);

        RETURN chart;

Prose: a chart of empty columns is declared with a length of input text, and the top symbol is pushed at chart[0] column. Then the algorithm repeats two phases:
Ahead network structure remained the same, meaning that we have a structure:
Quote
... <- [
        sub-sub-node <- [
                sub-node <- [
                        node <- [node_aheads],
                        [sub-node_aheads]
                ],
                [sub-sub-node_aheads]
        ],
        ...
]
that is capable of inheriting elements from its parent nodes. When we extend an array of ahead symbols of an parent, we don't have to update any of its children because the structure itself do the magic for us. For example, all alternation elements inherit ahead elements from their parents. Also, some complex inheritance structures can happen to be constructed in cases of multiple referencing of symbols over complex grammars. Specific implementation of ahead network parser could need an array  flattening function when ranging over structure of ahead elements, but that depends on skills of a programmer who might choose tail recursion iterator over array flattening.

What do you think? It looks almost like those cool simple pseudo-codes on Wikipedia when you search it for parsing algorithms. Theoretically, it shouldn't be ashamed of its speed, and It should exhibit linear parsing time for any periodically convergent or non-ambiguous grammars, no matter if it has a left or a right recursion. Its linearity of parsing speed should be reverse proportional to a degree of its grammar ambiguity divergence.

Tx little bird, I don't know if it was worth of trouble, but I'll try to implement it soon, after some obligations are done before, and then we'll see how it behaves in practice.

 :favicon:
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on August 03, 2017, 11:20:54 am
It is time to reveal what is going on behind ahead network system. Things are fairly simple, the Javascript function that lazily unfolds ahead items looks like this:
Code: [Select]
function NODE_WITH_AHEAD_NETWORK (parent) {
    "use strict";
    function recurse (index) {
        "use strict";
        var aheads;
        return {
            rule: parent.rule.sequence[index],
            aheads: function () {
                if (!aheads)
                    aheads = (index < parent.rule.sequence.length - 1)?
                        [recurse (index + 1)]:
                        [parent.aheads ()];

                return aheads;
            }
        }
    }
    return recurse (0);
}
It works by incrementing a sequence index. If it reaches end of sequence, it inherits ahead item from its parent. This way, inheriting works for any depth in one assignment cycle, without expensive loops that get heavy weighted on right recursion. I updated a bit the above parser algorithm to match NODE_WITH_AHEAD_NETWORK function and simplified it a bit further. Next week I hope I'll have the first results on parsing speed. The beauty of the algorithm is that it never looks back in the chart, so it might be fast enough, even in Javascript.
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on August 08, 2017, 04:50:54 pm
I tested the algorithm, it works like a charm. Well, almost, it parses and reports errors well, but it's a bit slow. It takes about the same time as an Earley parser, but I suspect it is due to building ahead JSON function conglomerate instead of having immutable code with mutable data. I'll see what can I do about this.

The previous javascript post contained some bugs (naturally), so I'll update it now, according to experimental code on my machine. It works, but it is subject to change, if I manage to optimize that part.
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on August 11, 2017, 10:54:34 pm
I optimized the algorithm, got about 150% of starting speed. Moving function from JSON to outside didn't do much, but I managed to apply a few array optimizations to get the speed up.

I've done some tests with two grammars. One is left recursive and the other is right recursive version of the same self definition grammar for parsing.

Right recursive grammar version looks like this:
Code: [Select]
(
    Parser :> (
        OptSpaceOrComments :> (
            (/(\s+)|(\/\/((.*\n)|(.*$)))|(\/\*[\S\s]*?\*\/)/, OptSpaceOrComments) |
            ''
        ) |
        String :> /("([^"\\\n]|(\\[\S\s]))*")|('([^'\\\n]|(\\[\S\s]))*')|(`([^`\\]|(\\[\S\s]))*`)/ |
        RegExp :> /\/(?!\*)(?!\/)([^\/\\\n]|(\\.))*\/i?/ |
        SymbolName :> /[_A-z][_A-z0-9]*/ |
        Root :> (
            AExp :> (
                Alternation :> (Item :> Root.SExp, '|', Next :> Root.AExp) |
                SExp :> (
                    Sequence :> (Item :> Root.STExp, ',', Next :> Root.SExp) |
                    STExp :> (
                        Subtype :> (Left :> Root.Exp, ':>', Right :> Root.STExp) |
                        Exp :> (
                            Parser.OptSpaceOrComments,
                            (
                                Parser.String |
                                Parser.RegExp |
                                NewSymbol :> Parser.SymbolName |
                                Reference :> (
                                    Parent :> ('' | Root.Group | Root.Reference | Parser.SymbolName),
                                    '.',
                                    Spec :> Parser.SymbolName
                                ) |
                                Group :> ('(', Root.AExp, ')')
                            ),
                            Parser.OptSpaceOrComments
                        )
                    )
                )
            )
        )
    )
).Parser.Root
and I got these rough results when parsing texts by this grammar:
(https://preview.ibb.co/iy4vHF/chart_RR.png)

Left recursive grammar version looks like this:
Code: [Select]
(
    Parser :> (
        OptSpaceOrComments :> (
            (/(\s+)|(\/\/((.*\n)|(.*$)))|(\/\*[\S\s]*?\*\/)/, OptSpaceOrComments) |
            ''
        ) |
        String :> /("([^"\\\n]|(\\[\S\s]))*")|('([^'\\\n]|(\\[\S\s]))*')|(`([^`\\]|(\\[\S\s]))*`)/ |
        RegExp :> /\/(?!\*)(?!\/)([^\/\\\n]|(\\.))*\/i?/ |
        SymbolName :> /[_A-z][_A-z0-9]*/ |
        Root :> (
            AExp :> (
                Alternation :> (Item :> Root.AExp, '|', Next :> Root.SExp) |
                SExp :> (
                    Sequence :> (Item :> Root.SExp, ',', Next :> Root.STExp) |
                    STExp :> (
                        Subtype :> (Left :> Root.STExp, ':>', Right :> Root.Exp) |
                        Exp :> (
                            Parser.OptSpaceOrComments,
                            (
                                Parser.String |
                                Parser.RegExp |
                                NewSymbol :> Parser.SymbolName |
                                Reference :> (
                                    Parent :> ('' | Root.Group | Root.Reference | Parser.SymbolName),
                                    '.',
                                    Spec :> Parser.SymbolName
                                ) |
                                Group :> ('(', Root.AExp, ')')
                            ),
                            Parser.OptSpaceOrComments
                        )
                    )
                )
            )
        )
    )
).Parser.Root
and I got these results  when parsing texts by this grammar:
(https://preview.ibb.co/eOnFHF/chart_LR.png)

I think I can be satisfied by speed, considering that this is a Javascript version operating on 4 years old Celeron M1000 on 1.8 GHz. As you can see, Chromium beats up Firefox for running this parser. Left recursive version performs slightly better. Both versions exhibit linear parsing speed, as they are non-ambuguous.

Ambiguously divergent grammars are parsing texts in logarithm scale, and as such, they are not very interesting for this parser. Natural languages are ambiguously periodically convergent, and as such, should be parsed in linear time.

Don't be intimidated by the grammar format, it is something I invested a lot of effort in and I hope you like it. It is self-explanatory if you are familiar with BNF grammar language (https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form). The two above examples are actually the grammar self definitions in left and right recursive versions.

I guess that the next steps include building English (and possibly other languages) grammars (possibly automatically from Treebank (https://en.wikipedia.org/wiki/Treebank) text files). I think I'm going open source on Github with this, but maybe I want to try to do something weird before publishing (I'll stay silent for now, you'd think I'm crazier than I am).
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: infurl on August 11, 2017, 11:10:45 pm
Nice work. Have you tried memoizing your function calls? It's generally the cheapest way to get a big performance improvement when parsing because it emulates the chart parsing mechanism, though of course there are many more advanced techniques which gain even greater comparative performance improvements.

Your grammar isn't really pure BNF because you've mixed a bunch of regular expressions in with it but most practical grammars do that anyway, that's why there are so many incompatible variations. I stuck to pure BNF for the syntax aspect of my parsing software because in the long run the results are much simpler and clearer. I did try variations like yours along the way. I think it was called ABNF or EBNF. It is used throughout the entire collection of internet RFC documents and was designed for handling pseudo code as well as formal languages.

To cover parsing natural languages you'll need to enhance your grammar in other ways, specifically to handle feature constraints. I found a particularly elegant solution to that problem. I will be very interested to see how you solve it.

Here's the grammar for my grammar (the syntax portion of it). I omitted the character definitions as they're a bit repetitive and should be easy to infer. :)

Code: [Select]
; CONTEXT FREE GRAMMAR ---------------------------------------------------------

Grammar := rules rule_separator
Grammar := rule_separator rules rule_separator

rules := Rule
rules := rules rule_separator Rule

Rule := RuleName defined_as RuleBody

defined_as := char_colon char_equals
defined_as := char_colon char_equals white_space
defined_as := white_space char_colon char_equals
defined_as := white_space char_colon char_equals white_space

RuleName := letter
RuleName := letter rule_name_chars

rule_name_chars := rule_name_char
rule_name_chars := rule_name_chars rule_name_char

rule_name_char := letter
rule_name_char := digit
rule_name_char := char_underscore

RuleBody := rule_items

rule_items := rule_item
rule_items := rule_items white_space rule_item

rule_item := RuleName
rule_item := CharCode

CharCode := digits

rule_separator := end_of_line
rule_separator := line_comment
rule_separator := rule_separator end_of_line
rule_separator := rule_separator line_comment

line_comment := char_semicolon end_of_line
line_comment := char_semicolon line_comment_chars end_of_line

line_comment_chars := line_comment_char
line_comment_chars := line_comment_chars line_comment_char

line_comment_char := alphanumeric
line_comment_char := punctuation
line_comment_char := white_space


;-------------------------------------------------------------------------------
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on August 11, 2017, 11:46:15 pm
Quote from: infurl
Have you tried memoizing your function calls?
Yes, they were memoized from the start, it is a must have. Check out the above function that returns the ahead network.

Quote from: infurl
To cover parsing natural languages you'll need to enhance your grammar in other ways, specifically to handle feature constraints.
I plan to handle grammar as pure CFG for sentence structure. For now, word formation will be left outside the parser, when the parsing is done. I'll just return an ambiguous forest, and it is up to user to check out if nouns, adjectives and verbs are what they are supposed to be, and to pick the combination that fits a word list and a common sense. Only rare word types like adverbs, prepositions, auxiliary verbs and others will be hard coded into a grammar, while the user is responsible to pick up the right combination of other words from given palette. That simple for this iteration, although I have plans for moving a word formation system right into the grammar, in the future, but that requires wiring with functional programming paradigm. I'll probably post a paper before on this subject, I also have some cool solutions worth to be seen.

I like your grammar. It is clean and simple.
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: Don Patrick on August 12, 2017, 08:28:17 am
Good work, Ivan. You've come a long way  O0
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on August 12, 2017, 10:43:08 am
Thank you, Don Patrick :)
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on August 31, 2017, 03:34:40 pm
People, it seems that any grammar can be parsed in linear time in worst case, no matter how ambiguous it is!

Time of parsing (among other values) depends on string size and, until now, it has been a worst case cubic function of string size. This is something I've been quiet about for a month, as I know how crazy it sounds, but after some coding and testing it on the following grammars, it seems that ahead network parser always operates in linear parsing time (a linear function of string size):

Code: [Select]
// grammar1
(
    A :> (
        (.A, .A) |
        'a'
    ),
    'b'
)

//grammar2
(
    A :> (
        (
            .A,
            (
                (.A, .A, .A) |
                'a'
            ),
            .A           
        ) |
        'a'
    ),
    'b'
)

// grammar 3
(
    A :> (
        B :> (
            (
                .A,
                (
                    (.A, 'a', .A) |
                    .B
                ),
                .A
        ) |
        'a'
    ),
    'b'
)

These grammars are parsing thousands of characters in a matter of dozens milliseconds! This speed is obtained using the fact that "compact parse forest" (another invention of mine) maps each grammar rule only once at given position, while still retaining possibility to reconstruct input string. Current fastest parsing algorithms build up the whole shared packed parse forest (SPPF) in explicit format that bounds them to worst case cubic time. Compact parse forest is a kind of implicit format where you can later construct SPPF, or anything else of the interest.

Man, if this is true for all cases of ambiguous grammars... after 50 years of worst case cubic time... I still can't believe it, I must be doing something wrong  :-\
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: infurl on August 31, 2017, 08:55:50 pm
Great results if true. I'm not convinced that you're not just parsing regular expressions though. Try defining a grammar to parse XML and see how that performs. XML is very easy to parse using simple string searches, it was designed that way, but defining it and parsing it using a formal grammar will really stress your parser. Also, how exactly does your compact parse forest improve on the one that was published by Tomita in 1986?
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on August 31, 2017, 10:54:38 pm
I'm not convinced that you're not just parsing regular expressions though. Try defining a grammar to parse XML and see how that performs. XML is very easy to parse using simple string searches, it was designed that way, but defining it and parsing it using a formal grammar will really stress your parser
Thank you for reply, I know how crazy I look even to myself. I plan to do something - a parser generator - to thoroughly test these preliminary results. Will publish it on Github if everything goes ok, probably together with automatic grammar extractor from Treebank files (for natural language) in later iteration. If not, we will have an alternative O(n^3) parser which I can also use for further work.

Also, how exactly does your compact parse forest improve on the one that was published by Tomita in 1986?
I'm not sure what Tomita published in 1986., but I'll refer to what I saw is called "todays most effective parse forest representation": Shared Packed Parse Forest (SPPF) (http://www.bramvandersanden.com/post/2014/06/shared-packed-parse-forest/). Consider grammar top symbol and all the possible paths that lead from top symbol to terminals, in whichever recursion there is. SPPF tracks all the paths, and considering there is maximum n^3 in-between nodes (n is input length), parsing naturally takes the time of O(n^3) in the worst case.

Compact parse forest (CPF) differently maps these nodes and greatly reduces amount of data it uses for forest representation. Main idea came from the fact that a context free grammar itself has a tree-like representation and I decided to reuse it for representing forests. What CPF really does during parsing is stitching a different array to each different grammar node (rule). These arrays contain elements of different pairs of left and right extents that occur inside parsed input for each rule. Thus, we don't store exact paths in CPF, we just have a grammar forest from before parsing where each tree is represented by unique rule that occur in text input on certain positions. Mentioned array only says on which positions a tree (rule) can be found inside the whole forest (input text). Watching from this perspective, a great deal of data is reduced (actual paths to terminals), while we store only rule positions that seem to grow linearly as the input text size grows.

However, there is a drawback in CPF, and that is in order to process it, we have to reconstruct paths to terminals afterwards parsing, which impacts speed of afterwards processing, but maybe it is worth of linear parsing time (if my results don't lie). It is a question and a matter of programming task whether CPF combination suits better than SPPF, but for my personal needs of constructing a language that holds a knowledge base, CPF will probably serve me better than SPPF because I don't plan to reconstruct whole paths, I just need references to certain rule parses. As for statistical analyzing of heavily ambiguous parse forests, CPF could have its advantages over SPPF.

The essence of CPF gave me indices that I could expect linear time parse to construct it, and by now I managed to successfully parse texts by all the grammars in this thread and more in nearly linear time, but we will see how it performs with other grammars when I build the parser generator. Anyway, I'm still not sure how to calculate exact worst case parsing time complexity.
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: Zero on September 04, 2017, 01:58:12 pm
I wish I could understand this.
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on September 04, 2017, 07:40:47 pm
Yes, it is easy to understand, but complex to explain, as always.

Let's try this: regular parsers construct all the paths from the root to all the terminals as abstract syntax forest. I removed path construction outside of the algorithm, so I track only extents. As a great deal of nodes overlap, although they belong to different paths, when constructing whole paths we need more data than when attaching only extents. It is about a node reuse that is more frequent than with regular parsers.

The drawback is that what you don't do right away, you have to do later, and that means constructing paths from extents, if needed. The advantage comes up when we don't need to reconstruct the whole forest, but only fragments. In those cases it might be reasonable to use ahead network parser. This might make sense if you think about it in a frames of natural language, when we refer to a notion from another sentence. Then we don't have to know the whole referred sentence structure, we only need a reference to its part. This partial referencing pattern can be observed in some artificial languages too. And that is where it makes sense to use the proposed speedup.
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: korrelan on September 04, 2017, 08:02:15 pm
Really liking your work Ivan…

Search speed is finite (GHz) storage is plentiful (TB).

 :)
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on September 04, 2017, 08:13:23 pm
Really liking your work Ivan…
Thank you, but I should take these preliminary results with reserve. Looks cool for now, but it's not fully tested yet. :-\

Search speed is finite (GHz) storage is plentiful (TB).
For writing each byte, some time is spent. More there are bytes, more time is needed for their writing.
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: Zero on September 05, 2017, 09:49:36 am
Ok I see the point, thanks.
I think Korrelan is talking about memoization.

Hey, thinking about it, chatbots are like big memoized things... but let's not drift away from parsers!
Title: Re: Ahead Network Parser: developing an option towards NLP and some other thingies
Post by: ivan.moony on October 08, 2017, 06:44:41 pm
Well, this is something that works too (tested in practice), but it is not linear. The speed is about the same as in graphs already published in this thread:
Code: [Select]
PROCEDURE Parse (grammar, text)
    DECLARE chart: [];
    Populate (chart[0], grammar.TOP_RULE, 0, []);
    FOR each nonempty column in chart  // chart can expand during this loop
        FOR each new item in column  // column can expand during this loop
            IF item.Rule is not terminal
                FOR each alternation of item.rule
                    Populate (column, alternation.sequence, 0, [item]);
            ELSE
                IF item.Rule.SCAN_TERMINAL (column.index, text)
                    item.Aheads.ADD_IF_NOT_EXIST (chart[item.RIGHT_EXTENT]);
                    FOR each occurence in item.Occurences
                        Populate (chart[item.RIGHT_EXTENT], occurence.sequence, occurence.index + 1, occurence.Parents);

    RETURN chart;

PROCEDURE Populate (array, sequence, index, parents)
    rule ← sequence[Index];
    item ← array.FIND (rule);
    IF not found item
        item ← NEW {Rule: rule, Occurences: [], Inheritors: [], Aheads:[]};
        array.ADD (item);

    IF Sequence.LENGTH > Index + 1
        AddOccurence (item, sequence, index, parents);

    ELSE
        FOR each parent in parents
            parent.Inheritors.ADD (item);
            FOR each toInherit in parent.Occurences
                AddOccurence (item, toInherit.Sequence, toInherit.Index, toInherit.Parents);

PROCEDURE AddOccurence (item, sequence, index, parents)
    occurence ← item.Occurences.FIND (sequence, index);
    IF not found occurence
        occurence ← NEW {Sequence: sequence, Index: index, Parents: []};
        item.Occurences.ADD (occurence);

    FOR each parent in parents
        occurence.Parents.ADD_IF_NOT_EXIST (parent);

    FOR each inheritor in item.Inheritors
        AddOccurence (inheritor, occurence.Sequence, occurence.Index, occurence.Parents);

    FOR each column in item.Aheads
        Populate (column, occurence.Sequence, occurence.Index + 1, occurence.Parents);

I thought of just adding a few lines to merge parents of the same right extent to make the parsing linear, but I'm onto implementing new (left-align)<->(right-align) philosophy because I don't like combining several different philosophies, using only fragments of them where appropriate, it doesn't feel right. I have at least two different algorithms by now (both cubic, I guess), and I wonder how many parsing algorithms are even possible, and which of them could be linear...