This is the file that I am currently using for testing:
http://asmith.id.au/files/the-country-of-the-blind.text and this is the output that I am currently getting after running it through the parser
http://asmith.id.au/files/the-country-of-the-blind.xml It is almost correct but if you scroll down to the "ambiguous" tags you will see where it is still having trouble determining the end of some sentences. In such cases it creates an ambiguous element and puts each distinct possible interpretation in an option element. I know how to fix that, by not allowing sentences that begin with a lower case letter, but I'm still working to refine the grammar definition to express that cleanly.
At the moment I am just writing grammar rules in
Chomsky normal form but as things progress I'll be able to start using more succinct grammar constructs. Anyway, here by way of example are the first few rules that were used to create the parser that produced the aforementioned output.
Grammar := paragraph_list
paragraph_list := paragraph_break
paragraph_list :=
Paragraph paragraph_break
paragraph_list := paragraph_list
Paragraph paragraph_break
paragraph_break := end_of_line
paragraph_break := paragraph_break end_of_line
Paragraph := sentence_list
sentence_list :=
Sentencesentence_list := sentence_list white_space
SentenceI've adopted the convention that non-terminal symbols that are capitalised are converted in to elements in the output and non-terminal symbols that are not capitalised are just place holders. With syntactic short hand such as x (asterisk) for "repeating" and (open square bracket)x(close square bracket) for "optional" many of these wouldn't be needed, however first things first, I have to finish testing this thoroughly with what amounts to the assembly language of grammar construction. Can start doing fancier things once I am sure that this stage is working properly.
The grammar definition is compiled in to a set of look up tables that allow the parser to perform every possible test and operation as efficiently as possible. In effect, the grammar compilation process pre-processes as much as possible to minimise the amount of computation required at parse time. In this case, it can parse the 9,500 word test file on a single CPU of my work station in approximately 100 milliseconds. Larger files can be split up and parsed in parallel on multiple CPUs yielding quite breathtaking performance.
Rather than generating look up tables, the Common Lisp version actually compiles the grammar directly to executable code so even though it was interpreted, it was as fast as the C version. I will probably go on to write a version that similarly compiles the grammar to C code, but at this stage performance is not the issue.