Ahead Network Parser: developing an option towards NLP and some other thingies

  • 23 Replies
  • 832 Views
*

ivan.moony

  • Trusty Member
  • ********
  • Replicant
  • *
  • 711
  • look, a star is falling
Quote
How long it actually takes you will largely depend on how long you're willing to waste your time trying to reinvent everything yourself like I did. Once you realize that no matter what you invent somebody else already did it and did it better, then you start learning from others and you make much faster progress. It's only then that you stand a real chance of inventing something that actually is new. Hubris is the biggest obstacle to progress. I wish I'd learned that when I was a lot younger.

Dream big. The bigger the dream is, more beautiful the world could be.
« Last Edit: July 23, 2017, 10:18:08 am by ivan.moony »
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

ivan.moony

  • Trusty Member
  • ********
  • Replicant
  • *
  • 711
  • look, a star is falling
You know that feeling when you program something, it works, but you still having an itch about it? Well, it happened to me with the last version of parsing algorithm. It should work in the practice, I even started implementing it, but something wouldn't give me a break, so I stopped what I was doing and returned to my scratch book. I spent another five days thinking, although the previous version could work fine, but a little bird was inexorably tweeting that the algorithm can be better.

... :headscratch: ...

And it can. You know that feeling when you give everything out of yourself, and the result just feels right? Well, it is just happening to me now, I'm celebrating it with a fresh lemon beer, and I'm enjoying in a good milestone.

 :weee:

See for yourself:
Quote
FUNCTION parse (grammar, text)
        DECLARE chart as ARRAY[LENGTH(text)][], terminals as ARRAY[]
        chart[0][0] := {
                Rule: grammar.START_SYMBOL,
                Aheads: []
        };
        FOR each column in chart
                terminals.CLEAR ()
                WHILE new rows in column
                        // Phase 1: collect
                        FOR each new row in column
                                memo := column.GET_FIRST (row.Rule);
                                IF memo not equals row // found memo
                                        FOR each item in row.Aheads
                                                memo.Aheads.PUSH (item);

                                ELSE IF row.Rule is not terminal
                                        FOR each production of row.Rule
                                                column.PUSH (NODE_WITH_AHEAD_NETWORK (production.RIGHT_SIDE));

                                ELSE // row.Rule is terminal
                                        IF row.SCAN_TERMINAL ()
                                                terminals.PUSH (row);

                        // Phase 2: advance
                        FOR each term in terminals
                                FOR each new item in term.Aheads
                                        IF item not in chart[term.RIGHT_EXTENT]
                                                chart[term.RIGHT_EXTENT].PUSH (item);

        RETURN chart;

Prose: a chart of empty columns is declared with a length of input text, and the top symbol is pushed at chart[0] column. Then the algorithm repeats two phases:
  • Phase 1: a loop goes through the column, unfolding non-terminal symbols to the end of the column. Each symbol internally remembers its possibly inherited ahead elements at the each unfold. When a duplicate symbol is reached (memo), its ahead elements are enriched by those of the new position in the grammar. When a terminal is reached, it is pushed at a terminal array and nothing new is written to the column, slowing down the column growth by ending that branch.
  • Phase 2: when the column loop is over (meaning that all terminals in the column and their ahead elements are recorded), each ahead element of all terminals at the terminal array is written at the corresponding distance column in the chart. If there was a null sized terminal in the terminal array, and there were previously new ahead elements put by the unfolding or memo system in the phase 1, the same chart column grows, and it is continued to be processed over again (it returns to the phase 1), until there is no new rows put in it. When there are no new rows to put in the same column, the terminal array is cleared, and the pointer moves to the next non-empty column in the chart and the process repeats.
Ahead network structure remained the same, meaning that we have a structure:
Quote
... <- [
        sub-sub-node <- [
                sub-node <- [
                        node <- [node_aheads],
                        [sub-node_aheads]
                ],
                [sub-sub-node_aheads]
        ],
        ...
]
that is capable of inheriting elements from its parent nodes. When we extend an array of ahead symbols of an parent, we don't have to update any of its children because the structure itself do the magic for us. For example, all alternation elements inherit ahead elements from their parents. Also, some complex inheritance structures can happen to be constructed in cases of multiple referencing of symbols over complex grammars. Specific implementation of ahead network parser could need an array  flattening function when ranging over structure of ahead elements, but that depends on skills of a programmer who might choose tail recursion iterator over array flattening.

What do you think? It looks almost like those cool simple pseudo-codes on Wikipedia when you search it for parsing algorithms. Theoretically, it shouldn't be ashamed of its speed, and It should exhibit linear parsing time for any periodically convergent or non-ambiguous grammars, no matter if it has a left or a right recursion. Its linearity of parsing speed should be reverse proportional to a degree of its grammar ambiguity divergence.

Tx little bird, I don't know if it was worth of trouble, but I'll try to implement it soon, after some obligations are done before, and then we'll see how it behaves in practice.

 :favicon:
« Last Edit: August 08, 2017, 05:14:21 pm by ivan.moony »
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

ivan.moony

  • Trusty Member
  • ********
  • Replicant
  • *
  • 711
  • look, a star is falling
It is time to reveal what is going on behind ahead network system. Things are fairly simple, the Javascript function that lazily unfolds ahead items looks like this:
Code: [Select]
function NODE_WITH_AHEAD_NETWORK (parent) {
    "use strict";
    function recurse (index) {
        "use strict";
        var aheads;
        return {
            rule: parent.rule.sequence[index],
            aheads: function () {
                if (!aheads)
                    aheads = (index < parent.rule.sequence.length - 1)?
                        [recurse (index + 1)]:
                        [parent.aheads ()];

                return aheads;
            }
        }
    }
    return recurse (0);
}
It works by incrementing a sequence index. If it reaches end of sequence, it inherits ahead item from its parent. This way, inheriting works for any depth in one assignment cycle, without expensive loops that get heavy weighted on right recursion. I updated a bit the above parser algorithm to match NODE_WITH_AHEAD_NETWORK function and simplified it a bit further. Next week I hope I'll have the first results on parsing speed. The beauty of the algorithm is that it never looks back in the chart, so it might be fast enough, even in Javascript.
« Last Edit: August 08, 2017, 05:02:53 pm by ivan.moony »
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

ivan.moony

  • Trusty Member
  • ********
  • Replicant
  • *
  • 711
  • look, a star is falling
I tested the algorithm, it works like a charm. Well, almost, it parses and reports errors well, but it's a bit slow. It takes about the same time as an Earley parser, but I suspect it is due to building ahead JSON function conglomerate instead of having immutable code with mutable data. I'll see what can I do about this.

The previous javascript post contained some bugs (naturally), so I'll update it now, according to experimental code on my machine. It works, but it is subject to change, if I manage to optimize that part.
« Last Edit: August 08, 2017, 05:46:43 pm by ivan.moony »
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

ivan.moony

  • Trusty Member
  • ********
  • Replicant
  • *
  • 711
  • look, a star is falling
I optimized the algorithm, got about 150% of starting speed. Moving function from JSON to outside didn't do much, but I managed to apply a few array optimizations to get the speed up.

I've done some tests with two grammars. One is left recursive and the other is right recursive version of the same self definition grammar for parsing.

Right recursive grammar version looks like this:
Code: [Select]
(
    Parser :> (
        OptSpaceOrComments :> (
            (/(\s+)|(\/\/((.*\n)|(.*$)))|(\/\*[\S\s]*?\*\/)/, OptSpaceOrComments) |
            ''
        ) |
        String :> /("([^"\\\n]|(\\[\S\s]))*")|('([^'\\\n]|(\\[\S\s]))*')|(`([^`\\]|(\\[\S\s]))*`)/ |
        RegExp :> /\/(?!\*)(?!\/)([^\/\\\n]|(\\.))*\/i?/ |
        SymbolName :> /[_A-z][_A-z0-9]*/ |
        Root :> (
            AExp :> (
                Alternation :> (Item :> Root.SExp, '|', Next :> Root.AExp) |
                SExp :> (
                    Sequence :> (Item :> Root.STExp, ',', Next :> Root.SExp) |
                    STExp :> (
                        Subtype :> (Left :> Root.Exp, ':>', Right :> Root.STExp) |
                        Exp :> (
                            Parser.OptSpaceOrComments,
                            (
                                Parser.String |
                                Parser.RegExp |
                                NewSymbol :> Parser.SymbolName |
                                Reference :> (
                                    Parent :> ('' | Root.Group | Root.Reference | Parser.SymbolName),
                                    '.',
                                    Spec :> Parser.SymbolName
                                ) |
                                Group :> ('(', Root.AExp, ')')
                            ),
                            Parser.OptSpaceOrComments
                        )
                    )
                )
            )
        )
    )
).Parser.Root
and I got these rough results when parsing texts by this grammar:


Left recursive grammar version looks like this:
Code: [Select]
(
    Parser :> (
        OptSpaceOrComments :> (
            (/(\s+)|(\/\/((.*\n)|(.*$)))|(\/\*[\S\s]*?\*\/)/, OptSpaceOrComments) |
            ''
        ) |
        String :> /("([^"\\\n]|(\\[\S\s]))*")|('([^'\\\n]|(\\[\S\s]))*')|(`([^`\\]|(\\[\S\s]))*`)/ |
        RegExp :> /\/(?!\*)(?!\/)([^\/\\\n]|(\\.))*\/i?/ |
        SymbolName :> /[_A-z][_A-z0-9]*/ |
        Root :> (
            AExp :> (
                Alternation :> (Item :> Root.AExp, '|', Next :> Root.SExp) |
                SExp :> (
                    Sequence :> (Item :> Root.SExp, ',', Next :> Root.STExp) |
                    STExp :> (
                        Subtype :> (Left :> Root.STExp, ':>', Right :> Root.Exp) |
                        Exp :> (
                            Parser.OptSpaceOrComments,
                            (
                                Parser.String |
                                Parser.RegExp |
                                NewSymbol :> Parser.SymbolName |
                                Reference :> (
                                    Parent :> ('' | Root.Group | Root.Reference | Parser.SymbolName),
                                    '.',
                                    Spec :> Parser.SymbolName
                                ) |
                                Group :> ('(', Root.AExp, ')')
                            ),
                            Parser.OptSpaceOrComments
                        )
                    )
                )
            )
        )
    )
).Parser.Root
and I got these results  when parsing texts by this grammar:


I think I can be satisfied by speed, considering that this is a Javascript version operating on 4 years old Celeron M1000 on 1.8 GHz. As you can see, Chromium beats up Firefox for running this parser. Left recursive version performs slightly better. Both versions exhibit linear parsing speed, as they are non-ambuguous.

Ambiguously divergent grammars are parsing texts in logarithm scale, and as such, they are not very interesting for this parser. Natural languages are ambiguously periodically convergent, and as such, should be parsed in linear time.

Don't be intimidated by the grammar format, it is something I invested a lot of effort in and I hope you like it. It is self-explanatory if you are familiar with BNF grammar language. The two above examples are actually the grammar self definitions in left and right recursive versions.

I guess that the next steps include building English (and possibly other languages) grammars (possibly automatically from Treebank text files). I think I'm going open source on Github with this, but maybe I want to try to do something weird before publishing (I'll stay silent for now, you'd think I'm crazier than I am).
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

infurl

  • Trusty Member
  • *******
  • Starship Trooper
  • *
  • 263
  • Humans will disappoint you.
    • Home Page
Nice work. Have you tried memoizing your function calls? It's generally the cheapest way to get a big performance improvement when parsing because it emulates the chart parsing mechanism, though of course there are many more advanced techniques which gain even greater comparative performance improvements.

Your grammar isn't really pure BNF because you've mixed a bunch of regular expressions in with it but most practical grammars do that anyway, that's why there are so many incompatible variations. I stuck to pure BNF for the syntax aspect of my parsing software because in the long run the results are much simpler and clearer. I did try variations like yours along the way. I think it was called ABNF or EBNF. It is used throughout the entire collection of internet RFC documents and was designed for handling pseudo code as well as formal languages.

To cover parsing natural languages you'll need to enhance your grammar in other ways, specifically to handle feature constraints. I found a particularly elegant solution to that problem. I will be very interested to see how you solve it.

Here's the grammar for my grammar (the syntax portion of it). I omitted the character definitions as they're a bit repetitive and should be easy to infer. :)

Code: [Select]
; CONTEXT FREE GRAMMAR ---------------------------------------------------------

Grammar := rules rule_separator
Grammar := rule_separator rules rule_separator

rules := Rule
rules := rules rule_separator Rule

Rule := RuleName defined_as RuleBody

defined_as := char_colon char_equals
defined_as := char_colon char_equals white_space
defined_as := white_space char_colon char_equals
defined_as := white_space char_colon char_equals white_space

RuleName := letter
RuleName := letter rule_name_chars

rule_name_chars := rule_name_char
rule_name_chars := rule_name_chars rule_name_char

rule_name_char := letter
rule_name_char := digit
rule_name_char := char_underscore

RuleBody := rule_items

rule_items := rule_item
rule_items := rule_items white_space rule_item

rule_item := RuleName
rule_item := CharCode

CharCode := digits

rule_separator := end_of_line
rule_separator := line_comment
rule_separator := rule_separator end_of_line
rule_separator := rule_separator line_comment

line_comment := char_semicolon end_of_line
line_comment := char_semicolon line_comment_chars end_of_line

line_comment_chars := line_comment_char
line_comment_chars := line_comment_chars line_comment_char

line_comment_char := alphanumeric
line_comment_char := punctuation
line_comment_char := white_space


;-------------------------------------------------------------------------------

*

ivan.moony

  • Trusty Member
  • ********
  • Replicant
  • *
  • 711
  • look, a star is falling
Quote from: infurl
Have you tried memoizing your function calls?
Yes, they were memoized from the start, it is a must have. Check out the above function that returns the ahead network.

Quote from: infurl
To cover parsing natural languages you'll need to enhance your grammar in other ways, specifically to handle feature constraints.
I plan to handle grammar as pure CFG for sentence structure. For now, word formation will be left outside the parser, when the parsing is done. I'll just return an ambiguous forest, and it is up to user to check out if nouns, adjectives and verbs are what they are supposed to be, and to pick the combination that fits a word list and a common sense. Only rare word types like adverbs, prepositions, auxiliary verbs and others will be hard coded into a grammar, while the user is responsible to pick up the right combination of other words from given palette. That simple for this iteration, although I have plans for moving a word formation system right into the grammar, in the future, but that requires wiring with functional programming paradigm. I'll probably post a paper before on this subject, I also have some cool solutions worth to be seen.

I like your grammar. It is clean and simple.
Wherever you see a nice spot, plant another knowledge tree :favicon:

*

Don Patrick

  • Trusty Member
  • *******
  • Starship Trooper
  • *
  • 386
    • Artificial Detective
Good work, Ivan. You've come a long way  O0
Personal project: NLP -> learning -> knowledge -> logical inference -> A.I.

*

ivan.moony

  • Trusty Member
  • ********
  • Replicant
  • *
  • 711
  • look, a star is falling
Thank you, Don Patrick :)
Wherever you see a nice spot, plant another knowledge tree :favicon:

 


Users Online

21 Guests, 2 Users
Users active in past 15 minutes:
WriterOfMinds, Art
[Global Moderator]
[Trusty Member]

Most Online Today: 45. Most Online Ever: 208 (August 27, 2008, 09:36:30 am)

Articles