@8pla.net: it's a pseudocode. Upper case letters are used where some sort of calculation is needed. In the case of "START_SYMBOL", it is an object constructed from "startSymbol", along with some extra informations. For example, "START_SYMBOL" in javascript could look like this:
{
position: 0,
rule: startSymbol,
aheadItem: null,
nextAlternation: null
}
"node.FIRST_SEQUENCE_ELEMENT" could look like this:
{
position: node.position,
rule: node.rule.sequence[0],
aheadItem: {parentSequence: node, index: 1},
nextAlternation: node.nextAlternation
}
and "node.AHEAD_ITEM" could look like this:
{
position: node.position + length,
rule: node.aheadItem.parentSequence.sequence[node.aheadItem.index],
aheadItem: (
(node.aheadItem.index === node.aheadItem.parentSequence.length - 1)?
node.aheadItem.parentSequence.aheadItem:
{parentSequence: node.aheadItem.parentSequence, index: node.aheadItem.index + 1}
),
nextAlternation: node.nextAlternation
}
Anyway, the top post pseudocode doesn't handle memos and left recursion. Ive been developing it further, and I think this time I have a complete version. This is the pseudocode:
FUNCTION parse (startSymbol, text)
VAR schedule as FIFO_QUEUE, memo as array, branch as object, node as object, success as boolean
success = false
schedule.ENQUEUE (START_SYMBOL)
WHILE schedule is not empty
branch = schedule.DEQUEUE ()
IF branch is in memo || branch is terminal
IF branch is in memo
node = memo.GET (branch.POSITION, branch.RULE)
ELSE
node = branch.SCAN_TERMINAL ()
IF node.SUCCESS AND branch.AHEAD_NETWORK is null AND text.length in node.RIGHT_EXTENTS
success = true
IF node.SUCCESS AND branch.AHEAD_NETWORK is not null
FOR each a in branch.AHEAD_NETWORK
node.AHEAD_NETWORK.PUSH (a)
FOR each position in node.RIGHT_EXTENDS
schedule.ENQUEUE (a)
ELSE IF branch.NEXT_ALTERNATION is not null
schedule.ENQUEUE (branch.NEXT_ALTERNATION)
ELSE
IF branch is alternation
schedule.ENQUEUE (branch.FIRST_ALTERNATION_ELEMENT)
ELSE IF branch is sequence
schedule.ENQUEUE (branch.FIRST_SEQUENCE_ELEMENT)
memo.UPDATE (branch)
RETURN success
It is about a network of ahead symbols. We walk through symbols in a depth-first manner. When a symbol is parsed for the first time, it's ahead symbol is stitched on memo chart. Later, in the process of parsing, if the symbol repeats on the same location, its new ahead symbols are additionally stitched and processed along with its old ahead symbols. I would call this as "ahead network" because it contains some structure in which parent ahead symbols are inherited to end-of-sequence and sub-alternations children symbols. Something like:
... -> [
sub-sub-node -> [
sub-node -> [
node -> [node_aheads],
[sub-node_aheads]
],
[sub-sub-node_aheads]
],
...
]
this way each network element can be enriched by new ahead symbols in the process of parsing, waiting for the future processing, if the same symbol gets re-considered. Note that inheriting works only downwards, and not upwards, which is exactly what we need.
The algorithm handles left recursion in the same way it handles symbols matched in memo chart. Recursive symbol gets enriched by a new ahead symbol that continues recursion, along with recursion terminating ahead symbol. It should also have no problems with nulls. This algorithm is constructed using depth-first tree walking policy, but the ahead system should work if we rewrite it to any tree walking policy, including breadth-first and
other possible variations.
Another new thing I plan to introduce is what I would call "Compact Parse Forest". In CPF there are no additional syntax tree returned from the parsing function. Instead, all the symbol left and right extents are stored directly into the grammar tree, as arrays belonging to relevant symbols. If we later want to reconstruct the abstract syntax forest, we have to see which previous node right extent matches with which next node left extent in a sequence. A bit more complicated for extracting, but uses less memory and makes exposed kind of parsing possible (a kind of parsing based only on walking through ahead symbols). When processing a final result, special lens that do the matching could be used in a form of javascript
proxy handlers, so the end user would see no difference, while saving a memory for other stuff. CPF would also directly be used instead of memo table, as it has all the required informations - a symbol and its extents, and this additionally saves the memory. Someone would say that I'm obsessed with memory issues in a time when computers have several gigabytes of ram, but remember that more memory consumption means more reading and writing to memory, and that means more time spent to read/write the data. Maybe I'm also hoping to break some speed records for parsing CFG grammar based text data, but we'll see.
And last, but not least, an innovation would also be a human readable parse forest representation based on indexed continuations. I'll present this innovation a bit later, but first I have to check how the algorithm behaves in the practice, and does it work as it is intended anyway.