Ahead Network Parser: developing an option towards NLP and some other thingies

  • 32 Replies
  • 1841 Views
*

ivan.moony

  • Trusty Member
  • *********
  • Terminator
  • *
  • 757
  • look, a star is falling
Quote
How long it actually takes you will largely depend on how long you're willing to waste your time trying to reinvent everything yourself like I did. Once you realize that no matter what you invent somebody else already did it and did it better, then you start learning from others and you make much faster progress. It's only then that you stand a real chance of inventing something that actually is new. Hubris is the biggest obstacle to progress. I wish I'd learned that when I was a lot younger.

Dream big. The bigger the dream is, more beautiful the world could be.
« Last Edit: July 23, 2017, 10:18:08 am by ivan.moony »
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

ivan.moony

  • Trusty Member
  • *********
  • Terminator
  • *
  • 757
  • look, a star is falling
You know that feeling when you program something, it works, but you still having an itch about it? Well, it happened to me with the last version of parsing algorithm. It should work in the practice, I even started implementing it, but something wouldn't give me a break, so I stopped what I was doing and returned to my scratch book. I spent another five days thinking, although the previous version could work fine, but a little bird was inexorably tweeting that the algorithm can be better.

... :headscratch: ...

And it can. You know that feeling when you give everything out of yourself, and the result just feels right? Well, it is just happening to me now, I'm celebrating it with a fresh lemon beer, and I'm enjoying in a good milestone.

 :weee:

See for yourself:
Quote
FUNCTION parse (grammar, text)
        DECLARE chart as ARRAY[LENGTH(text)][], terminals as ARRAY[]
        chart[0][0] := {
                Rule: grammar.START_SYMBOL,
                Aheads: []
        };
        FOR each column in chart
                terminals.CLEAR ()
                WHILE new rows in column
                        // Phase 1: collect
                        FOR each new row in column
                                memo := column.GET_FIRST (row.Rule);
                                IF memo not equals row // found memo
                                        FOR each item in row.Aheads
                                                memo.Aheads.PUSH (item);

                                ELSE IF row.Rule is not terminal
                                        FOR each production of row.Rule
                                                column.PUSH (NODE_WITH_AHEAD_NETWORK (production.RIGHT_SIDE));

                                ELSE // row.Rule is terminal
                                        IF row.SCAN_TERMINAL ()
                                                terminals.PUSH (row);

                        // Phase 2: advance
                        FOR each term in terminals
                                FOR each new item in term.Aheads
                                        IF item not in chart[term.RIGHT_EXTENT]
                                                chart[term.RIGHT_EXTENT].PUSH (item);

        RETURN chart;

Prose: a chart of empty columns is declared with a length of input text, and the top symbol is pushed at chart[0] column. Then the algorithm repeats two phases:
  • Phase 1: a loop goes through the column, unfolding non-terminal symbols to the end of the column. Each symbol internally remembers its possibly inherited ahead elements at the each unfold. When a duplicate symbol is reached (memo), its ahead elements are enriched by those of the new position in the grammar. When a terminal is reached, it is pushed at a terminal array and nothing new is written to the column, slowing down the column growth by ending that branch.
  • Phase 2: when the column loop is over (meaning that all terminals in the column and their ahead elements are recorded), each ahead element of all terminals at the terminal array is written at the corresponding distance column in the chart. If there was a null sized terminal in the terminal array, and there were previously new ahead elements put by the unfolding or memo system in the phase 1, the same chart column grows, and it is continued to be processed over again (it returns to the phase 1), until there is no new rows put in it. When there are no new rows to put in the same column, the terminal array is cleared, and the pointer moves to the next non-empty column in the chart and the process repeats.
Ahead network structure remained the same, meaning that we have a structure:
Quote
... <- [
        sub-sub-node <- [
                sub-node <- [
                        node <- [node_aheads],
                        [sub-node_aheads]
                ],
                [sub-sub-node_aheads]
        ],
        ...
]
that is capable of inheriting elements from its parent nodes. When we extend an array of ahead symbols of an parent, we don't have to update any of its children because the structure itself do the magic for us. For example, all alternation elements inherit ahead elements from their parents. Also, some complex inheritance structures can happen to be constructed in cases of multiple referencing of symbols over complex grammars. Specific implementation of ahead network parser could need an array  flattening function when ranging over structure of ahead elements, but that depends on skills of a programmer who might choose tail recursion iterator over array flattening.

What do you think? It looks almost like those cool simple pseudo-codes on Wikipedia when you search it for parsing algorithms. Theoretically, it shouldn't be ashamed of its speed, and It should exhibit linear parsing time for any periodically convergent or non-ambiguous grammars, no matter if it has a left or a right recursion. Its linearity of parsing speed should be reverse proportional to a degree of its grammar ambiguity divergence.

Tx little bird, I don't know if it was worth of trouble, but I'll try to implement it soon, after some obligations are done before, and then we'll see how it behaves in practice.

 :favicon:
« Last Edit: August 08, 2017, 05:14:21 pm by ivan.moony »
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

ivan.moony

  • Trusty Member
  • *********
  • Terminator
  • *
  • 757
  • look, a star is falling
It is time to reveal what is going on behind ahead network system. Things are fairly simple, the Javascript function that lazily unfolds ahead items looks like this:
Code: [Select]
function NODE_WITH_AHEAD_NETWORK (parent) {
    "use strict";
    function recurse (index) {
        "use strict";
        var aheads;
        return {
            rule: parent.rule.sequence[index],
            aheads: function () {
                if (!aheads)
                    aheads = (index < parent.rule.sequence.length - 1)?
                        [recurse (index + 1)]:
                        [parent.aheads ()];

                return aheads;
            }
        }
    }
    return recurse (0);
}
It works by incrementing a sequence index. If it reaches end of sequence, it inherits ahead item from its parent. This way, inheriting works for any depth in one assignment cycle, without expensive loops that get heavy weighted on right recursion. I updated a bit the above parser algorithm to match NODE_WITH_AHEAD_NETWORK function and simplified it a bit further. Next week I hope I'll have the first results on parsing speed. The beauty of the algorithm is that it never looks back in the chart, so it might be fast enough, even in Javascript.
« Last Edit: August 08, 2017, 05:02:53 pm by ivan.moony »
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

ivan.moony

  • Trusty Member
  • *********
  • Terminator
  • *
  • 757
  • look, a star is falling
I tested the algorithm, it works like a charm. Well, almost, it parses and reports errors well, but it's a bit slow. It takes about the same time as an Earley parser, but I suspect it is due to building ahead JSON function conglomerate instead of having immutable code with mutable data. I'll see what can I do about this.

The previous javascript post contained some bugs (naturally), so I'll update it now, according to experimental code on my machine. It works, but it is subject to change, if I manage to optimize that part.
« Last Edit: August 08, 2017, 05:46:43 pm by ivan.moony »
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

ivan.moony

  • Trusty Member
  • *********
  • Terminator
  • *
  • 757
  • look, a star is falling
I optimized the algorithm, got about 150% of starting speed. Moving function from JSON to outside didn't do much, but I managed to apply a few array optimizations to get the speed up.

I've done some tests with two grammars. One is left recursive and the other is right recursive version of the same self definition grammar for parsing.

Right recursive grammar version looks like this:
Code: [Select]
(
    Parser :> (
        OptSpaceOrComments :> (
            (/(\s+)|(\/\/((.*\n)|(.*$)))|(\/\*[\S\s]*?\*\/)/, OptSpaceOrComments) |
            ''
        ) |
        String :> /("([^"\\\n]|(\\[\S\s]))*")|('([^'\\\n]|(\\[\S\s]))*')|(`([^`\\]|(\\[\S\s]))*`)/ |
        RegExp :> /\/(?!\*)(?!\/)([^\/\\\n]|(\\.))*\/i?/ |
        SymbolName :> /[_A-z][_A-z0-9]*/ |
        Root :> (
            AExp :> (
                Alternation :> (Item :> Root.SExp, '|', Next :> Root.AExp) |
                SExp :> (
                    Sequence :> (Item :> Root.STExp, ',', Next :> Root.SExp) |
                    STExp :> (
                        Subtype :> (Left :> Root.Exp, ':>', Right :> Root.STExp) |
                        Exp :> (
                            Parser.OptSpaceOrComments,
                            (
                                Parser.String |
                                Parser.RegExp |
                                NewSymbol :> Parser.SymbolName |
                                Reference :> (
                                    Parent :> ('' | Root.Group | Root.Reference | Parser.SymbolName),
                                    '.',
                                    Spec :> Parser.SymbolName
                                ) |
                                Group :> ('(', Root.AExp, ')')
                            ),
                            Parser.OptSpaceOrComments
                        )
                    )
                )
            )
        )
    )
).Parser.Root
and I got these rough results when parsing texts by this grammar:


Left recursive grammar version looks like this:
Code: [Select]
(
    Parser :> (
        OptSpaceOrComments :> (
            (/(\s+)|(\/\/((.*\n)|(.*$)))|(\/\*[\S\s]*?\*\/)/, OptSpaceOrComments) |
            ''
        ) |
        String :> /("([^"\\\n]|(\\[\S\s]))*")|('([^'\\\n]|(\\[\S\s]))*')|(`([^`\\]|(\\[\S\s]))*`)/ |
        RegExp :> /\/(?!\*)(?!\/)([^\/\\\n]|(\\.))*\/i?/ |
        SymbolName :> /[_A-z][_A-z0-9]*/ |
        Root :> (
            AExp :> (
                Alternation :> (Item :> Root.AExp, '|', Next :> Root.SExp) |
                SExp :> (
                    Sequence :> (Item :> Root.SExp, ',', Next :> Root.STExp) |
                    STExp :> (
                        Subtype :> (Left :> Root.STExp, ':>', Right :> Root.Exp) |
                        Exp :> (
                            Parser.OptSpaceOrComments,
                            (
                                Parser.String |
                                Parser.RegExp |
                                NewSymbol :> Parser.SymbolName |
                                Reference :> (
                                    Parent :> ('' | Root.Group | Root.Reference | Parser.SymbolName),
                                    '.',
                                    Spec :> Parser.SymbolName
                                ) |
                                Group :> ('(', Root.AExp, ')')
                            ),
                            Parser.OptSpaceOrComments
                        )
                    )
                )
            )
        )
    )
).Parser.Root
and I got these results  when parsing texts by this grammar:


I think I can be satisfied by speed, considering that this is a Javascript version operating on 4 years old Celeron M1000 on 1.8 GHz. As you can see, Chromium beats up Firefox for running this parser. Left recursive version performs slightly better. Both versions exhibit linear parsing speed, as they are non-ambuguous.

Ambiguously divergent grammars are parsing texts in logarithm scale, and as such, they are not very interesting for this parser. Natural languages are ambiguously periodically convergent, and as such, should be parsed in linear time.

Don't be intimidated by the grammar format, it is something I invested a lot of effort in and I hope you like it. It is self-explanatory if you are familiar with BNF grammar language. The two above examples are actually the grammar self definitions in left and right recursive versions.

I guess that the next steps include building English (and possibly other languages) grammars (possibly automatically from Treebank text files). I think I'm going open source on Github with this, but maybe I want to try to do something weird before publishing (I'll stay silent for now, you'd think I'm crazier than I am).
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

infurl

  • Trusty Member
  • *******
  • Starship Trooper
  • *
  • 276
  • Humans will disappoint you.
    • Home Page
Nice work. Have you tried memoizing your function calls? It's generally the cheapest way to get a big performance improvement when parsing because it emulates the chart parsing mechanism, though of course there are many more advanced techniques which gain even greater comparative performance improvements.

Your grammar isn't really pure BNF because you've mixed a bunch of regular expressions in with it but most practical grammars do that anyway, that's why there are so many incompatible variations. I stuck to pure BNF for the syntax aspect of my parsing software because in the long run the results are much simpler and clearer. I did try variations like yours along the way. I think it was called ABNF or EBNF. It is used throughout the entire collection of internet RFC documents and was designed for handling pseudo code as well as formal languages.

To cover parsing natural languages you'll need to enhance your grammar in other ways, specifically to handle feature constraints. I found a particularly elegant solution to that problem. I will be very interested to see how you solve it.

Here's the grammar for my grammar (the syntax portion of it). I omitted the character definitions as they're a bit repetitive and should be easy to infer. :)

Code: [Select]
; CONTEXT FREE GRAMMAR ---------------------------------------------------------

Grammar := rules rule_separator
Grammar := rule_separator rules rule_separator

rules := Rule
rules := rules rule_separator Rule

Rule := RuleName defined_as RuleBody

defined_as := char_colon char_equals
defined_as := char_colon char_equals white_space
defined_as := white_space char_colon char_equals
defined_as := white_space char_colon char_equals white_space

RuleName := letter
RuleName := letter rule_name_chars

rule_name_chars := rule_name_char
rule_name_chars := rule_name_chars rule_name_char

rule_name_char := letter
rule_name_char := digit
rule_name_char := char_underscore

RuleBody := rule_items

rule_items := rule_item
rule_items := rule_items white_space rule_item

rule_item := RuleName
rule_item := CharCode

CharCode := digits

rule_separator := end_of_line
rule_separator := line_comment
rule_separator := rule_separator end_of_line
rule_separator := rule_separator line_comment

line_comment := char_semicolon end_of_line
line_comment := char_semicolon line_comment_chars end_of_line

line_comment_chars := line_comment_char
line_comment_chars := line_comment_chars line_comment_char

line_comment_char := alphanumeric
line_comment_char := punctuation
line_comment_char := white_space


;-------------------------------------------------------------------------------

*

ivan.moony

  • Trusty Member
  • *********
  • Terminator
  • *
  • 757
  • look, a star is falling
Quote from: infurl
Have you tried memoizing your function calls?
Yes, they were memoized from the start, it is a must have. Check out the above function that returns the ahead network.

Quote from: infurl
To cover parsing natural languages you'll need to enhance your grammar in other ways, specifically to handle feature constraints.
I plan to handle grammar as pure CFG for sentence structure. For now, word formation will be left outside the parser, when the parsing is done. I'll just return an ambiguous forest, and it is up to user to check out if nouns, adjectives and verbs are what they are supposed to be, and to pick the combination that fits a word list and a common sense. Only rare word types like adverbs, prepositions, auxiliary verbs and others will be hard coded into a grammar, while the user is responsible to pick up the right combination of other words from given palette. That simple for this iteration, although I have plans for moving a word formation system right into the grammar, in the future, but that requires wiring with functional programming paradigm. I'll probably post a paper before on this subject, I also have some cool solutions worth to be seen.

I like your grammar. It is clean and simple.
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

Don Patrick

  • Trusty Member
  • *******
  • Starship Trooper
  • *
  • 389
    • Artificial Detective
Good work, Ivan. You've come a long way  O0
Personal project: NLP -> learning -> knowledge -> logical inference -> A.I.

*

ivan.moony

  • Trusty Member
  • *********
  • Terminator
  • *
  • 757
  • look, a star is falling
Thank you, Don Patrick :)
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

ivan.moony

  • Trusty Member
  • *********
  • Terminator
  • *
  • 757
  • look, a star is falling
People, it seems that any grammar can be parsed in linear time in worst case, no matter how ambiguous it is!

Time of parsing (among other values) depends on string size and, until now, it has been a worst case cubic function of string size. This is something I've been quiet about for a month, as I know how crazy it sounds, but after some coding and testing it on the following grammars, it seems that ahead network parser always operates in linear parsing time (a linear function of string size):

Code: [Select]
// grammar1
(
    A :> (
        (.A, .A) |
        'a'
    ),
    'b'
)

//grammar2
(
    A :> (
        (
            .A,
            (
                (.A, .A, .A) |
                'a'
            ),
            .A           
        ) |
        'a'
    ),
    'b'
)

// grammar 3
(
    A :> (
        B :> (
            (
                .A,
                (
                    (.A, 'a', .A) |
                    .B
                ),
                .A
        ) |
        'a'
    ),
    'b'
)

These grammars are parsing thousands of characters in a matter of dozens milliseconds! This speed is obtained using the fact that "compact parse forest" (another invention of mine) maps each grammar rule only once at given position, while still retaining possibility to reconstruct input string. Current fastest parsing algorithms build up the whole shared packed parse forest (SPPF) in explicit format that bounds them to worst case cubic time. Compact parse forest is a kind of implicit format where you can later construct SPPF, or anything else of the interest.

Man, if this is true for all cases of ambiguous grammars... after 50 years of worst case cubic time... I still can't believe it, I must be doing something wrong  :-\
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

infurl

  • Trusty Member
  • *******
  • Starship Trooper
  • *
  • 276
  • Humans will disappoint you.
    • Home Page
Great results if true. I'm not convinced that you're not just parsing regular expressions though. Try defining a grammar to parse XML and see how that performs. XML is very easy to parse using simple string searches, it was designed that way, but defining it and parsing it using a formal grammar will really stress your parser. Also, how exactly does your compact parse forest improve on the one that was published by Tomita in 1986?

*

ivan.moony

  • Trusty Member
  • *********
  • Terminator
  • *
  • 757
  • look, a star is falling
I'm not convinced that you're not just parsing regular expressions though. Try defining a grammar to parse XML and see how that performs. XML is very easy to parse using simple string searches, it was designed that way, but defining it and parsing it using a formal grammar will really stress your parser
Thank you for reply, I know how crazy I look even to myself. I plan to do something - a parser generator - to thoroughly test these preliminary results. Will publish it on Github if everything goes ok, probably together with automatic grammar extractor from Treebank files (for natural language) in later iteration. If not, we will have an alternative O(n^3) parser which I can also use for further work.

Also, how exactly does your compact parse forest improve on the one that was published by Tomita in 1986?
I'm not sure what Tomita published in 1986., but I'll refer to what I saw is called "todays most effective parse forest representation": Shared Packed Parse Forest (SPPF). Consider grammar top symbol and all the possible paths that lead from top symbol to terminals, in whichever recursion there is. SPPF tracks all the paths, and considering there is maximum n^3 in-between nodes (n is input length), parsing naturally takes the time of O(n^3) in the worst case.

Compact parse forest (CPF) differently maps these nodes and greatly reduces amount of data it uses for forest representation. Main idea came from the fact that a context free grammar itself has a tree-like representation and I decided to reuse it for representing forests. What CPF really does during parsing is stitching a different array to each different grammar node (rule). These arrays contain elements of different pairs of left and right extents that occur inside parsed input for each rule. Thus, we don't store exact paths in CPF, we just have a grammar forest from before parsing where each tree is represented by unique rule that occur in text input on certain positions. Mentioned array only says on which positions a tree (rule) can be found inside the whole forest (input text). Watching from this perspective, a great deal of data is reduced (actual paths to terminals), while we store only rule positions that seem to grow linearly as the input text size grows.

However, there is a drawback in CPF, and that is in order to process it, we have to reconstruct paths to terminals afterwards parsing, which impacts speed of afterwards processing, but maybe it is worth of linear parsing time (if my results don't lie). It is a question and a matter of programming task whether CPF combination suits better than SPPF, but for my personal needs of constructing a language that holds a knowledge base, CPF will probably serve me better than SPPF because I don't plan to reconstruct whole paths, I just need references to certain rule parses. As for statistical analyzing of heavily ambiguous parse forests, CPF could have its advantages over SPPF.

The essence of CPF gave me indices that I could expect linear time parse to construct it, and by now I managed to successfully parse texts by all the grammars in this thread and more in nearly linear time, but we will see how it performs with other grammars when I build the parser generator. Anyway, I'm still not sure how to calculate exact worst case parsing time complexity.
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

Zero

  • Trusty Member
  • *******
  • Starship Trooper
  • *
  • 403
  • Fictional character
    • SYN CON DEV LOG
I wish I could understand this.

*

ivan.moony

  • Trusty Member
  • *********
  • Terminator
  • *
  • 757
  • look, a star is falling
Yes, it is easy to understand, but complex to explain, as always.

Let's try this: regular parsers construct all the paths from the root to all the terminals as abstract syntax forest. I removed path construction outside of the algorithm, so I track only extents. As a great deal of nodes overlap, although they belong to different paths, when constructing whole paths we need more data than when attaching only extents. It is about a node reuse that is more frequent than with regular parsers.

The drawback is that what you don't do right away, you have to do later, and that means constructing paths from extents, if needed. The advantage comes up when we don't need to reconstruct the whole forest, but only fragments. In those cases it might be reasonable to use ahead network parser. This might make sense if you think about it in a frames of natural language, when we refer to a notion from another sentence. Then we don't have to know the whole referred sentence structure, we only need a reference to its part. This partial referencing pattern can be observed in some artificial languages too. And that is where it makes sense to use the proposed speedup.
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

korrelan

  • Trusty Member
  • ********
  • Replicant
  • *
  • 711
  • Look into my eyes! WOAH!
    • Google +
Really liking your work Ivan…

Search speed is finite (GHz) storage is plentiful (TB).

 :)
It thunk... therefore it is!

 


Users Online

31 Guests, 3 Users
Users active in past 15 minutes:
LOCKSUIT, Marco, yotamarker
[Bumblebee]
[Trusty Member]

Most Online Today: 55. Most Online Ever: 208 (August 27, 2008, 09:36:30 am)

Articles