function parse (startSymbol, text) {
success = false;
node = START_SYMBOL;
while (true) {
if (node is sequence) {
node = node.FIRST_SEQUENCE_ELEMENT;
} else if (node is alternation) {
node = node.FIRST_ALTERNATION_ELEMENT;
} else if (node.rule is terminal) {
length = SCAN_TERMINAL (node.POSITION, node.TERMINAL);
if (node.POSITION + length === text.length && length > -1)
success = true;
if (node.AHEAD_ITEM !== null && length > -1) {
node = node.AHEAD_ITEM;
} else if (node.NEXT_ALTERNATION !== null) {
node = node.NEXT_ALTERNATION;
} else
break;
}
}
return success;
}
{
position: 0,
rule: startSymbol,
aheadItem: null,
nextAlternation: null
}
{
position: node.position,
rule: node.rule.sequence[0],
aheadItem: {parentSequence: node, index: 1},
nextAlternation: node.nextAlternation
}
{
position: node.position + length,
rule: node.aheadItem.parentSequence.sequence[node.aheadItem.index],
aheadItem: (
(node.aheadItem.index === node.aheadItem.parentSequence.length - 1)?
node.aheadItem.parentSequence.aheadItem:
{parentSequence: node.aheadItem.parentSequence, index: node.aheadItem.index + 1}
),
nextAlternation: node.nextAlternation
}
FUNCTION parse (startSymbol, text)
VAR schedule as FIFO_QUEUE, memo as array, branch as object, node as object, success as boolean
success = false
schedule.ENQUEUE (START_SYMBOL)
WHILE schedule is not empty
branch = schedule.DEQUEUE ()
IF branch is in memo || branch is terminal
IF branch is in memo
node = memo.GET (branch.POSITION, branch.RULE)
ELSE
node = branch.SCAN_TERMINAL ()
IF node.SUCCESS AND branch.AHEAD_NETWORK is null AND text.length in node.RIGHT_EXTENTS
success = true
IF node.SUCCESS AND branch.AHEAD_NETWORK is not null
FOR each a in branch.AHEAD_NETWORK
node.AHEAD_NETWORK.PUSH (a)
FOR each position in node.RIGHT_EXTENDS
schedule.ENQUEUE (a)
ELSE IF branch.NEXT_ALTERNATION is not null
schedule.ENQUEUE (branch.NEXT_ALTERNATION)
ELSE
IF branch is alternation
schedule.ENQUEUE (branch.FIRST_ALTERNATION_ELEMENT)
ELSE IF branch is sequence
schedule.ENQUEUE (branch.FIRST_SEQUENCE_ELEMENT)
memo.UPDATE (branch)
RETURN success
... -> [
sub-sub-node -> [
sub-node -> [
node -> [node_aheads],
[sub-node_aheads]
],
[sub-sub-node_aheads]
],
...
]
function SCAN_TERMINAL(x,y)
{
len = y-x;
return len;
}
What does SCAN_TERMINAL() do?If the terminal is a constant string, it checks if the constant resides at a position in parsed text:
function parseConstant (text, pos, constant)
if (text.substring (pos, pos + constant.length) === constant)
return pos + constant.length;
else
return -1;
}
#include <stdio.h>
#include <string.h>
void main()
{
const char *scan = "Scan Terminal", *terminal = "Terminal";
char *node;
int pos;
node = strstr(scan, terminal);
pos = node - scan;
printf("Scanned, \"%s\". For, \"%s\" (at position %d).\n", scan, node,pos);
}
Considering GLR is basically bottom-up, how does it perform regarding to speed? Once I programmed a shift-reduce parser, but it was a way too slow because it was checking all the combinations from bottom towards up, no matter if an upper symbol fits at given place. I think that a major reason for speed of top down parsers is that they check only what they expect at given positions.
@Freddy: May we have syntax highlighting?
@infurl: Ten or twenty years? Boy, it doesn't sound very promising. I wonder, do you use a word binary search for very big alternations?
<Clause>
<Subject Clause_type='declarative' Number='plural' Person='third' Verb_type='complex_intransitive'>
<Determinative_THE Category='definite'>The</Determinative_THE>
<Nominal Number='plural'>
<Noun_BOX Case='plain_case' Category='common_noun' Number='plural'>boxes</Noun_BOX>
</Nominal>
</Subject>
<Predicator Clause_type='declarative' Number='plural' Person='third' Verb_type='complex_intransitive'>
<Nonmodal_BE Number='plural' Person='third' Polarity='negative' Verb_form='present' Verb_type='complex_intransitive' to_infinitival='yes'>are not</Nonmodal_BE>
</Predicator>
<Subjective>
<Preposition_phrase>
<Preposition_ON Category='preposition'>on</Preposition_ON>
<Determinative_THE Category='definite'>the</Determinative_THE>
<Nominal Number='plural'>
<Noun_TABLE Case='plain_case' Category='common_noun' Number='plural'>tables</Noun_TABLE>
</Nominal>
</Preposition_phrase>
</Subjective>
</Clause>
How long it actually takes you will largely depend on how long you're willing to waste your time trying to reinvent everything yourself like I did. Once you realize that no matter what you invent somebody else already did it and did it better, then you start learning from others and you make much faster progress. It's only then that you stand a real chance of inventing something that actually is new. Hubris is the biggest obstacle to progress. I wish I'd learned that when I was a lot younger.
FUNCTION parse (grammar, text)
DECLARE chart as ARRAY[LENGTH(text)][], terminals as ARRAY[]
chart[0][0] := {
Rule: grammar.START_SYMBOL,
Aheads: []
};
FOR each column in chart
terminals.CLEAR ()
WHILE new rows in column
// Phase 1: collect
FOR each new row in column
memo := column.GET_FIRST (row.Rule);
IF memo not equals row // found memo
FOR each item in row.Aheads
memo.Aheads.PUSH (item);
ELSE IF row.Rule is not terminal
FOR each production of row.Rule
column.PUSH (NODE_WITH_AHEAD_NETWORK (production.RIGHT_SIDE));
ELSE // row.Rule is terminal
IF row.SCAN_TERMINAL ()
terminals.PUSH (row);
// Phase 2: advance
FOR each term in terminals
FOR each new item in term.Aheads
IF item not in chart[term.RIGHT_EXTENT]
chart[term.RIGHT_EXTENT].PUSH (item);
RETURN chart;
... <- [that is capable of inheriting elements from its parent nodes. When we extend an array of ahead symbols of an parent, we don't have to update any of its children because the structure itself do the magic for us. For example, all alternation elements inherit ahead elements from their parents. Also, some complex inheritance structures can happen to be constructed in cases of multiple referencing of symbols over complex grammars. Specific implementation of ahead network parser could need an array flattening function when ranging over structure of ahead elements, but that depends on skills of a programmer who might choose tail recursion iterator over array flattening.
sub-sub-node <- [
sub-node <- [
node <- [node_aheads],
[sub-node_aheads]
],
[sub-sub-node_aheads]
],
...
]
function NODE_WITH_AHEAD_NETWORK (parent) {
"use strict";
function recurse (index) {
"use strict";
var aheads;
return {
rule: parent.rule.sequence[index],
aheads: function () {
if (!aheads)
aheads = (index < parent.rule.sequence.length - 1)?
[recurse (index + 1)]:
[parent.aheads ()];
return aheads;
}
}
}
return recurse (0);
}
(
Parser :> (
OptSpaceOrComments :> (
(/(\s+)|(\/\/((.*\n)|(.*$)))|(\/\*[\S\s]*?\*\/)/, OptSpaceOrComments) |
''
) |
String :> /("([^"\\\n]|(\\[\S\s]))*")|('([^'\\\n]|(\\[\S\s]))*')|(`([^`\\]|(\\[\S\s]))*`)/ |
RegExp :> /\/(?!\*)(?!\/)([^\/\\\n]|(\\.))*\/i?/ |
SymbolName :> /[_A-z][_A-z0-9]*/ |
Root :> (
AExp :> (
Alternation :> (Item :> Root.SExp, '|', Next :> Root.AExp) |
SExp :> (
Sequence :> (Item :> Root.STExp, ',', Next :> Root.SExp) |
STExp :> (
Subtype :> (Left :> Root.Exp, ':>', Right :> Root.STExp) |
Exp :> (
Parser.OptSpaceOrComments,
(
Parser.String |
Parser.RegExp |
NewSymbol :> Parser.SymbolName |
Reference :> (
Parent :> ('' | Root.Group | Root.Reference | Parser.SymbolName),
'.',
Spec :> Parser.SymbolName
) |
Group :> ('(', Root.AExp, ')')
),
Parser.OptSpaceOrComments
)
)
)
)
)
)
).Parser.Root
(
Parser :> (
OptSpaceOrComments :> (
(/(\s+)|(\/\/((.*\n)|(.*$)))|(\/\*[\S\s]*?\*\/)/, OptSpaceOrComments) |
''
) |
String :> /("([^"\\\n]|(\\[\S\s]))*")|('([^'\\\n]|(\\[\S\s]))*')|(`([^`\\]|(\\[\S\s]))*`)/ |
RegExp :> /\/(?!\*)(?!\/)([^\/\\\n]|(\\.))*\/i?/ |
SymbolName :> /[_A-z][_A-z0-9]*/ |
Root :> (
AExp :> (
Alternation :> (Item :> Root.AExp, '|', Next :> Root.SExp) |
SExp :> (
Sequence :> (Item :> Root.SExp, ',', Next :> Root.STExp) |
STExp :> (
Subtype :> (Left :> Root.STExp, ':>', Right :> Root.Exp) |
Exp :> (
Parser.OptSpaceOrComments,
(
Parser.String |
Parser.RegExp |
NewSymbol :> Parser.SymbolName |
Reference :> (
Parent :> ('' | Root.Group | Root.Reference | Parser.SymbolName),
'.',
Spec :> Parser.SymbolName
) |
Group :> ('(', Root.AExp, ')')
),
Parser.OptSpaceOrComments
)
)
)
)
)
)
).Parser.Root
; CONTEXT FREE GRAMMAR ---------------------------------------------------------
Grammar := rules rule_separator
Grammar := rule_separator rules rule_separator
rules := Rule
rules := rules rule_separator Rule
Rule := RuleName defined_as RuleBody
defined_as := char_colon char_equals
defined_as := char_colon char_equals white_space
defined_as := white_space char_colon char_equals
defined_as := white_space char_colon char_equals white_space
RuleName := letter
RuleName := letter rule_name_chars
rule_name_chars := rule_name_char
rule_name_chars := rule_name_chars rule_name_char
rule_name_char := letter
rule_name_char := digit
rule_name_char := char_underscore
RuleBody := rule_items
rule_items := rule_item
rule_items := rule_items white_space rule_item
rule_item := RuleName
rule_item := CharCode
CharCode := digits
rule_separator := end_of_line
rule_separator := line_comment
rule_separator := rule_separator end_of_line
rule_separator := rule_separator line_comment
line_comment := char_semicolon end_of_line
line_comment := char_semicolon line_comment_chars end_of_line
line_comment_chars := line_comment_char
line_comment_chars := line_comment_chars line_comment_char
line_comment_char := alphanumeric
line_comment_char := punctuation
line_comment_char := white_space
;-------------------------------------------------------------------------------
Have you tried memoizing your function calls?Yes, they were memoized from the start, it is a must have. Check out the above function that returns the ahead network.
To cover parsing natural languages you'll need to enhance your grammar in other ways, specifically to handle feature constraints.I plan to handle grammar as pure CFG for sentence structure. For now, word formation will be left outside the parser, when the parsing is done. I'll just return an ambiguous forest, and it is up to user to check out if nouns, adjectives and verbs are what they are supposed to be, and to pick the combination that fits a word list and a common sense. Only rare word types like adverbs, prepositions, auxiliary verbs and others will be hard coded into a grammar, while the user is responsible to pick up the right combination of other words from given palette. That simple for this iteration, although I have plans for moving a word formation system right into the grammar, in the future, but that requires wiring with functional programming paradigm. I'll probably post a paper before on this subject, I also have some cool solutions worth to be seen.
// grammar1
(
A :> (
(.A, .A) |
'a'
),
'b'
)
//grammar2
(
A :> (
(
.A,
(
(.A, .A, .A) |
'a'
),
.A
) |
'a'
),
'b'
)
// grammar 3
(
A :> (
B :> (
(
.A,
(
(.A, 'a', .A) |
.B
),
.A
) |
'a'
),
'b'
)
I'm not convinced that you're not just parsing regular expressions though. Try defining a grammar to parse XML and see how that performs. XML is very easy to parse using simple string searches, it was designed that way, but defining it and parsing it using a formal grammar will really stress your parserThank you for reply, I know how crazy I look even to myself. I plan to do something - a parser generator - to thoroughly test these preliminary results. Will publish it on Github if everything goes ok, probably together with automatic grammar extractor from Treebank files (for natural language) in later iteration. If not, we will have an alternative O(n^3) parser which I can also use for further work.
Also, how exactly does your compact parse forest improve on the one that was published by Tomita in 1986?I'm not sure what Tomita published in 1986., but I'll refer to what I saw is called "todays most effective parse forest representation": Shared Packed Parse Forest (SPPF) (http://www.bramvandersanden.com/post/2014/06/shared-packed-parse-forest/). Consider grammar top symbol and all the possible paths that lead from top symbol to terminals, in whichever recursion there is. SPPF tracks all the paths, and considering there is maximum n^3 in-between nodes (n is input length), parsing naturally takes the time of O(n^3) in the worst case.
Really liking your work Ivan…Thank you, but I should take these preliminary results with reserve. Looks cool for now, but it's not fully tested yet. :-\
Search speed is finite (GHz) storage is plentiful (TB).For writing each byte, some time is spent. More there are bytes, more time is needed for their writing.
PROCEDURE Parse (grammar, text)
DECLARE chart: [];
Populate (chart[0], grammar.TOP_RULE, 0, []);
FOR each nonempty column in chart // chart can expand during this loop
FOR each new item in column // column can expand during this loop
IF item.Rule is not terminal
FOR each alternation of item.rule
Populate (column, alternation.sequence, 0, [item]);
ELSE
IF item.Rule.SCAN_TERMINAL (column.index, text)
item.Aheads.ADD_IF_NOT_EXIST (chart[item.RIGHT_EXTENT]);
FOR each occurence in item.Occurences
Populate (chart[item.RIGHT_EXTENT], occurence.sequence, occurence.index + 1, occurence.Parents);
RETURN chart;
PROCEDURE Populate (array, sequence, index, parents)
rule ↠sequence[Index];
item ↠array.FIND (rule);
IF not found item
item ↠NEW {Rule: rule, Occurences: [], Inheritors: [], Aheads:[]};
array.ADD (item);
IF Sequence.LENGTH > Index + 1
AddOccurence (item, sequence, index, parents);
ELSE
FOR each parent in parents
parent.Inheritors.ADD (item);
FOR each toInherit in parent.Occurences
AddOccurence (item, toInherit.Sequence, toInherit.Index, toInherit.Parents);
PROCEDURE AddOccurence (item, sequence, index, parents)
occurence ↠item.Occurences.FIND (sequence, index);
IF not found occurence
occurence ↠NEW {Sequence: sequence, Index: index, Parents: []};
item.Occurences.ADD (occurence);
FOR each parent in parents
occurence.Parents.ADD_IF_NOT_EXIST (parent);
FOR each inheritor in item.Inheritors
AddOccurence (inheritor, occurence.Sequence, occurence.Index, occurence.Parents);
FOR each column in item.Aheads
Populate (column, occurence.Sequence, occurence.Index + 1, occurence.Parents);
01 DECLARE chart: [][], text: STRING;
02
03 FUNCTION Parse (grammar, input)
04 text ↠input;
05 chart.CLEAR ();
06 MergeItem (0, [grammar.START_SEQUENCE], 0, null);
07 FOR each new column in chart DO
08 FOR each new item in column DO
09 FOR each production of item.Sequence[item.Index] in grammar DO
10 MergeItem (column.Index, production.sequence, 0, item);
11
12 RETURN chart;
13
14 PROCEDURE MergeItem (offset, sequence, index, parent)
15 item ↠chart[offset].FIND (sequence, index);
16 IF not found item THEN
17 item ↠{Sequence: sequence, Index: index, Inherited: [], Inheritors: [], Parents: []};
18 chart[offset].ADD (item);
19
20 IF parent not in item.Parents THEN
21 item.Parents.ADD (parent);
22 FOR each x in [parent] UNION parent.Inherited DO
23 IF x.Index + 1 < x.Sequence.LENGTH THEN
24 FOR each y in [item] UNION item.Inheritors DO
25 IF y.Index + 1 == y.Sequence.LENGTH OR y is terminal THEN
26 IF (x.Sequence, x.Index) not in y.Inherited THEN
27 x.Inheritors.ADD (y);
28 y.Inherited.ADD (x);
29
30 IF y is terminal succeeded in text at offset THEN
31 FOR each z in x.Parents DO
32 MergeItem (offset + y.LENGTH, x.Sequence, x.Index + 1, z);
I'm following your project with great interest Ivan and I'm looking forward to your updates.
Is "sequential normal form" your own invention?
Definition of the grammars and their ability to develop rules towards input string is similar to the context-free grammars definition, except that the left side of rules may also be sequences of symbols.
From your github documentation:QuoteDefinition of the grammars and their ability to develop rules towards input string is similar to the context-free grammars definition, except that the left side of rules may also be sequences of symbols.
That seems like the same thing as a context sensitive grammar which sits between context free grammars and recursively enumerable grammars in the Chomsky hierarchy. What's different about yours?
The trouble with context sensitive grammars is that parsing them is an intractable problem. Even human brains can't handle context sensitive grammars and very few natural languages have syntactic elements that could be considered context sensitive. (This shouldn't be confused with the semantics of natural languages which is another story altogether.)
One day I imagine there will be quantum computers programming each other using unrestricted context sensitive grammars but they will be beyond our comprehension and they are certainly beyond the capabilities of any of our current machines.