GLR Parser

infurl · « **on:** July 27, 2011, 01:11:57 pm »

Well I'm happy to announce that I spent all day today using the GLR Parser software that I've been developing for the last four years, and for the first time, I didn't find any more bugs to fix! I've already developed a grammar definition for the grammar definitions that my parser uses so I will be able to make it "self-homed" i.e. able to parse itself.

More ambitious but nearly finished and debugged is a grammar definition for the Typed Feature Structure language used by the massive English Resource Grammar. I've already tested it extensively on the full lexicon version of the ERG. It takes my 4 CPU work station a whole ten seconds to translate the entire thing into more than 3 million lines of XML. With a fair bit more effort I should be able to use the ERG directly as the input to my parser library, thus producing a high performance parser for the entire English language. That will probably only take another four years to do.

However today I've been developing a grammar for splitting English prose into sentences. Still a couple of minor problems left to resolve but it only seems to be taking a few dozen rules to do it, and it is already able to handle all the complexities of punctuation and nested quotations. I guess the best way to celebrate will be to keep on working on it, eh.

Freddy · « **Reply #1 on:** July 27, 2011, 02:38:28 pm »

Congratulations

10 seconds to make 3 million lines of XML...hmm...that sounds really fast to me

mendicott · « **Reply #2 on:** July 27, 2011, 04:06:01 pm »

Andrew, I've worked a good bit parsing sentences of a corpus into XML. What I don't get yet is how to retain "relatedness", in other words sentences of a paragraph, section or chapter. (Which would also have to do with the Interpreter.) One can do statistical Semantic Analysis on the corpus and chain that to the XML, which does give some form of "relatedness", but does not necessarily show sentences in a paragraph, etc. For example, how to get a conversational agent to read a series of sentences consecutively related to given topic, rather than randomly jumping around through the corpus, based solely on statistical similarity?

infurl · « **Reply #3 on:** July 27, 2011, 11:29:13 pm »

@Marcus If I understand you correctly, the information that you are looking for comes under the heading Discourse Analysis.

http://en.wikipedia.org/wiki/Discourse_analysis

Just as grammar defines the structure of valid sentences in a language, discourse defines the protocols for arranging sentences in a meaningful dialog or narrative.

Art · « **Reply #4 on:** July 28, 2011, 12:29:27 am »

Andrew,

If you might be so inclined, a demo flick might be nice to witness...<hint...hint...wink...nudge...>

infurl · « **Reply #5 on:** July 28, 2011, 12:38:13 am »

Quote from: Art on July 28, 2011, 12:29:27 am

If you might be so inclined, a demo flick might be nice to witness...<hint...hint...wink...nudge...>

Not sure what you mean by a "demo flick" because there is nothing interesting enough to make a movie of happening here yet. Happy to post samples of data though to give a clearer idea of what I'm going on about if that would help.

DaveMorton · « **Reply #6 on:** July 28, 2011, 07:45:56 am »

Even if it doesn't help, we'd still love to see it.

infurl · « **Reply #7 on:** July 28, 2011, 08:30:16 am »

This is the file that I am currently using for testing: http://asmith.id.au/files/the-country-of-the-blind.text and this is the output that I am currently getting after running it through the parser http://asmith.id.au/files/the-country-of-the-blind.xml It is almost correct but if you scroll down to the "ambiguous" tags you will see where it is still having trouble determining the end of some sentences. In such cases it creates an ambiguous element and puts each distinct possible interpretation in an option element. I know how to fix that, by not allowing sentences that begin with a lower case letter, but I'm still working to refine the grammar definition to express that cleanly.

At the moment I am just writing grammar rules in Chomsky normal form but as things progress I'll be able to start using more succinct grammar constructs. Anyway, here by way of example are the first few rules that were used to create the parser that produced the aforementioned output.

Grammar := paragraph_list
paragraph_list := paragraph_break
paragraph_list := Paragraph paragraph_break
paragraph_list := paragraph_list Paragraph paragraph_break
paragraph_break := end_of_line
paragraph_break := paragraph_break end_of_line
Paragraph := sentence_list
sentence_list := Sentence
sentence_list := sentence_list white_space Sentence

I've adopted the convention that non-terminal symbols that are capitalised are converted in to elements in the output and non-terminal symbols that are not capitalised are just place holders. With syntactic short hand such as x (asterisk) for "repeating" and (open square bracket)x(close square bracket) for "optional" many of these wouldn't be needed, however first things first, I have to finish testing this thoroughly with what amounts to the assembly language of grammar construction. Can start doing fancier things once I am sure that this stage is working properly.

The grammar definition is compiled in to a set of look up tables that allow the parser to perform every possible test and operation as efficiently as possible. In effect, the grammar compilation process pre-processes as much as possible to minimise the amount of computation required at parse time. In this case, it can parse the 9,500 word test file on a single CPU of my work station in approximately 100 milliseconds. Larger files can be split up and parsed in parallel on multiple CPUs yielding quite breathtaking performance.

Rather than generating look up tables, the Common Lisp version actually compiles the grammar directly to executable code so even though it was interpreted, it was as fast as the C version. I will probably go on to write a version that similarly compiles the grammar to C code, but at this stage performance is not the issue.

Bragi · « **Reply #8 on:** July 28, 2011, 08:55:32 am »

congrats Andrew.
I'm a bit confused though, from what I see in the files, it only appears to be picking up the sentence structure, but is not yet tagging everything within the sentence (aka, find the subject, verb,...). Is that correct, or am I missing something?

infurl · « **Reply #9 on:** July 28, 2011, 09:01:29 am »

Quote from: Bragi on July 28, 2011, 08:55:32 am

congrats Andrew.
I'm a bit confused though, from what I see in the files, it only appears to be picking up the sentence structure, but is not yet tagging everything within the sentence (aka, find the subject, verb,...). Is that correct, or am I missing something?

That's correct Jan, at this stage I'm only trying to split up free form prose into sentences. Even that apparently simple task is non-trivial, but as I'm more concerned with knowledge acquisition than conversation (for the time being), it is an important step. I'm also planning to compare my results with those of sentence splitting tools from other NLP systems such as OpenNLP. I may not be a PhD candidate, but I do try to conduct my research like one.

By the way Jan, are you on Google+ yet? If so, how do I find you there?

Bragi · « **Reply #10 on:** July 28, 2011, 11:06:17 am »

Ahh, I see. I also did something similar in Aici: second stage splitting into sentences, can be tricky indeed.

No, I'm not on google+, so It's gonna be hard to find me there, sorry.

DaveMorton · « **Reply #11 on:** July 28, 2011, 02:23:48 pm »

Andrew, that's great stuff! I'm looking forward to seeing more, as it comes along.

@Jan: If you would like to give Google+ (G+) a go, you're welcome to PM either myself or Andrew (infurl) your email address, and we'd be more than happy to invite you.

Freddy · « **Reply #12 on:** July 30, 2011, 01:12:14 pm »

Topic split at this point to here...

DaveMorton · « **Reply #13 on:** July 30, 2011, 01:36:05 pm »

Thanks, Freddy! You're my hero!

Now, back to the original topic...

Andrew, I seem to recall you asking if there was interest in helping to test your parser. Are you still looking for someone to test it?

infurl · « **Reply #14 on:** August 12, 2011, 08:49:08 am »

This is an IndieGoGo fund-raising campaign which was launched yesterday and with 20 days still to go has already raised double the target amount of $3000, that is, it has raised $6000 in barely 24 hours. The primary objective of this project was to improve the Windows port of SBCL, a cross-platform open source CommonLisp compiler, however with such a huge turnout by the programming community in support of this project, the author will be able to make further enhancements to the Linux version as well.

http://www.indiegogo.com/SBCL-Threading-Improvements-1

This is the second wildly successful IndieGoGo fund-raiser for artificial intelligence software development this year (that I know about -- the other being for Grandroids) so it certainly seems that anyone with the skills and proven track record can get funding this way. I'm pretty sure the videos helped too -- there is something about being able to look someone in the eye and hear them speak, even if only via video, that engenders confidence.

Oh, I would also like to add that in spite of having 30 years of professional programming experience in C and C++ and only 12 years with CommonLisp, porting my parser software to C from CommonLisp has been such a time-consuming pain in the neck that I'm seriously considering giving up on it and just sticking with CommonLisp from now on. This might be a good time to switch to SBCL. I'll certainly be making the biggest donation that I can afford come next pay day. Heh.

GLR Parser

infurl

GLR Parser

Freddy

Re: GLR Parser

mendicott

Re: GLR Parser

infurl

Re: GLR Parser

Art

Re: GLR Parser

infurl

Re: GLR Parser

DaveMorton

Re: GLR Parser

infurl

Re: GLR Parser

Bragi

Re: GLR Parser

infurl

Re: GLR Parser

Bragi

Re: GLR Parser

DaveMorton

Re: GLR Parser

Freddy

Re: GLR Parser

DaveMorton

Re: GLR Parser

infurl

Re: GLR Parser

Recent Topics

Recent News

Users Online

Articles