Ai Dreams Forum

Member's Experiments & Projects => AI Programming => Topic started by: infurl on July 27, 2011, 01:11:57 pm

Title: GLR Parser
Post by: infurl on July 27, 2011, 01:11:57 pm
Well I'm happy to announce that I spent all day today using the GLR Parser software that I've been developing for the last four years, and for the first time, I didn't find any more bugs to fix! I've already developed a grammar definition for the grammar definitions that my parser uses so I will be able to make it "self-homed" i.e. able to parse itself.

More ambitious but nearly finished and debugged is a grammar definition for the Typed Feature Structure language used by the massive English Resource Grammar. I've already tested it extensively on the full lexicon version of the ERG. It takes my 4 CPU work station a whole ten seconds to translate the entire thing into more than 3 million lines of XML. With a fair bit more effort I should be able to use the ERG directly as the input to my parser library, thus producing a high performance parser for the entire English language. That will probably only take another four years to do.  :)

However today I've been developing a grammar for splitting English prose into sentences. Still a couple of minor problems left to resolve but it only seems to be taking a few dozen rules to do it, and it is already able to handle all the complexities of punctuation and nested quotations. I guess the best way to celebrate will be to keep on working on it, eh.
Title: Re: GLR Parser
Post by: Freddy on July 27, 2011, 02:38:28 pm
Congratulations  :)

10 seconds to make 3 million lines of XML...hmm...that sounds really fast to me  :o
Title: Re: GLR Parser
Post by: mendicott on July 27, 2011, 04:06:01 pm
Andrew, I've worked a good bit parsing sentences of a corpus into XML.  What I don't get yet is how to retain "relatedness", in other words sentences of a paragraph, section or chapter.  (Which would also have to do with the Interpreter.)  One can do statistical Semantic Analysis on the corpus and chain that to the XML, which does give some form of "relatedness", but does not necessarily show sentences in a paragraph, etc.  For example, how to get a conversational agent to read a series of sentences consecutively related to given topic, rather than randomly jumping around through the corpus, based solely on statistical similarity?
Title: Re: GLR Parser
Post by: infurl on July 27, 2011, 11:29:13 pm
@Marcus If I understand you correctly, the information that you are looking for comes under the heading Discourse Analysis.

http://en.wikipedia.org/wiki/Discourse_analysis (http://en.wikipedia.org/wiki/Discourse_analysis)

Just as grammar defines the structure of valid sentences in a language, discourse defines the protocols for arranging sentences in a meaningful dialog or narrative.

Title: Re: GLR Parser
Post by: Art on July 28, 2011, 12:29:27 am
Andrew,

If you might be so inclined, a demo flick might be nice to witness...<hint...hint...wink...nudge...> ;)
Title: Re: GLR Parser
Post by: infurl on July 28, 2011, 12:38:13 am
If you might be so inclined, a demo flick might be nice to witness...<hint...hint...wink...nudge...> ;)

Not sure what you mean by a "demo flick" because there is nothing interesting enough to make a movie of happening here yet. Happy to post samples of data though to give a clearer idea of what I'm going on about if that would help.
Title: Re: GLR Parser
Post by: DaveMorton on July 28, 2011, 07:45:56 am
Even if it doesn't help, we'd still love to see it. :P
Title: Re: GLR Parser
Post by: infurl on July 28, 2011, 08:30:16 am
This is the file that I am currently using for testing: http://asmith.id.au/files/the-country-of-the-blind.text (http://asmith.id.au/files/the-country-of-the-blind.text) and this is the output that I am currently getting after running it through the parser http://asmith.id.au/files/the-country-of-the-blind.xml (http://asmith.id.au/files/the-country-of-the-blind.xml) It is almost correct but if you scroll down to the "ambiguous" tags you will see where it is still having trouble determining the end of some sentences. In such cases it creates an ambiguous element and puts each distinct possible interpretation in an option element. I know how to fix that, by not allowing sentences that begin with a lower case letter, but I'm still working to refine the grammar definition to express that cleanly.

At the moment I am just writing grammar rules in Chomsky normal form (http://en.wikipedia.org/wiki/Chomsky_normal_form) but as things progress I'll be able to start using more succinct grammar constructs. Anyway, here by way of example are the first few rules that were used to create the parser that produced the aforementioned output.

Grammar := paragraph_list
paragraph_list := paragraph_break
paragraph_list := Paragraph paragraph_break
paragraph_list := paragraph_list Paragraph paragraph_break
paragraph_break := end_of_line
paragraph_break := paragraph_break end_of_line
Paragraph := sentence_list
sentence_list := Sentence
sentence_list := sentence_list white_space Sentence

I've adopted the convention that non-terminal symbols that are capitalised are converted in to elements in the output and non-terminal symbols that are not capitalised are just place holders. With syntactic short hand such as x (asterisk) for "repeating" and (open square bracket)x(close square bracket) for "optional" many of these wouldn't be needed, however first things first, I have to finish testing this thoroughly with what amounts to the assembly language of grammar construction. Can start doing fancier things once I am sure that this stage is working properly.

The grammar definition is compiled in to a set of look up tables that allow the parser to perform every possible test and operation as efficiently as possible. In effect, the grammar compilation process pre-processes as much as possible to minimise the amount of computation required at parse time. In this case, it can parse the 9,500 word test file on a single CPU of my work station in approximately 100 milliseconds. Larger files can be split up and parsed in parallel on multiple CPUs yielding quite breathtaking performance.

Rather than generating look up tables, the Common Lisp version actually compiles the grammar directly to executable code so even though it was interpreted, it was as fast as the C version. I will probably go on to write a version that similarly compiles the grammar to C code, but at this stage performance is not the issue.
Title: Re: GLR Parser
Post by: Bragi on July 28, 2011, 08:55:32 am
congrats Andrew.
I'm a bit confused though, from what I see in the files, it only appears to be picking up the sentence structure, but is not yet tagging everything within the sentence  (aka, find the subject, verb,...). Is that correct, or am I missing something?
Title: Re: GLR Parser
Post by: infurl on July 28, 2011, 09:01:29 am
congrats Andrew.
I'm a bit confused though, from what I see in the files, it only appears to be picking up the sentence structure, but is not yet tagging everything within the sentence  (aka, find the subject, verb,...). Is that correct, or am I missing something?

That's correct Jan, at this stage I'm only trying to split up free form prose into sentences. Even that apparently simple task is non-trivial, but as I'm more concerned with knowledge acquisition than conversation (for the time being), it is an important step. I'm also planning to compare my results with those of sentence splitting tools from other NLP systems such as OpenNLP (http://incubator.apache.org/opennlp). I may not be a PhD candidate, but I do try to conduct my research like one. ;)

By the way Jan, are you on Google+ yet? If so, how do I find you there?
Title: Re: GLR Parser
Post by: Bragi on July 28, 2011, 11:06:17 am
Ahh, I see. I also did something similar in Aici: second stage splitting into sentences, can be tricky indeed.

No, I'm not on google+, so It's gonna be hard to find me there, sorry.
Title: Re: GLR Parser
Post by: DaveMorton on July 28, 2011, 02:23:48 pm
Andrew, that's great stuff! I'm looking forward to seeing more, as it comes along. :)

@Jan: If you would like to give Google+ (G+) a go, you're welcome to PM either myself or Andrew (infurl) your email address, and we'd be more than happy to invite you.
Title: Re: GLR Parser
Post by: Freddy on July 30, 2011, 01:12:14 pm
Topic split at this point to here... (http://aidreams.co.uk/forum/index.php?topic=4218)
Title: Re: GLR Parser
Post by: DaveMorton on July 30, 2011, 01:36:05 pm
Thanks, Freddy! You're my hero! :)

Now, back to the original topic...

Andrew, I seem to recall you asking if there was interest in helping to test your parser. Are you still looking for someone to test it?
Title: Re: GLR Parser
Post by: infurl on August 12, 2011, 08:49:08 am
This is an IndieGoGo fund-raising campaign which was launched yesterday and with 20 days still to go has already raised double the target amount of $3000, that is, it has raised $6000 in barely 24 hours. The primary objective of this project was to improve the Windows port of SBCL, a cross-platform open source CommonLisp compiler, however with such a huge turnout by the programming community in support of this project, the author will be able to make further enhancements to the Linux version as well.

http://www.indiegogo.com/SBCL-Threading-Improvements-1 (http://www.indiegogo.com/SBCL-Threading-Improvements-1)

This is the second wildly successful IndieGoGo fund-raiser for artificial intelligence software development this year (that I know about -- the other being for Grandroids) so it certainly seems that anyone with the skills and proven track record can get funding this way. I'm pretty sure the videos helped too -- there is something about being able to look someone in the eye and hear them speak, even if only via video, that engenders confidence.

Oh, I would also like to add that in spite of having 30 years of professional programming experience in C and C++ and only 12 years with CommonLisp, porting my parser software to C from CommonLisp has been such a time-consuming pain in the neck that I'm seriously considering giving up on it and just sticking with CommonLisp from now on. This might be a good time to switch to SBCL. I'll certainly be making the biggest donation that I can afford come next pay day. Heh. :)
Title: Re: GLR Parser
Post by: infurl on August 18, 2011, 05:44:19 pm
I've been doing some stress testing of my GLR parser library. One of the files that I'm using is the Project Gutenberg catalog in RDF format. It is about 210 megabytes decompressed, and it takes my parser about 8 minutes to process it using a single core on my work station. That compares favorably with the time that it takes xsltproc to process it with an XSLT script (about 7 minutes) but falls far short of the time that it takes a compiled flex program to scan it (about 7 seconds).

Lexical scanners like flex only accept regular expressions which are much less expressive than context free grammars, however they have the advantage that it takes relatively little processing power to handle them. It would be neat if it was possible to recognise grammars, or parts of grammars, that could be handled by a regular expression parser instead of a full context free grammar parser. Could get the best of both worlds then.
Title: Re: GLR Parser
Post by: mendicott on August 18, 2011, 06:15:31 pm
> http://en.wikipedia.org/wiki/Parsing_expression_grammar (http://en.wikipedia.org/wiki/Parsing_expression_grammar)

How would "Parsing expression grammar" relate (or not relate) to this?

So, what is needed is a cloud-based API to translate between (<=>) Regex and "Phrase structure grammar" ?!?

If so, what would that look like, can you give a braindead simple example??
Title: Re: GLR Parser
Post by: Bragi on August 19, 2011, 07:20:21 am
Quote
Lexical scanners like flex only accept regular expressions which are much less expressive than context free grammars
I think you are a bit confused. Flex doesn't use regular expressions, it's a lexical scanner. It's fast because it doesn't do the full parse, not because it uses regular expressions, that would make it a lot slower.
Flex is what you use to create the data for bizon or yacc.
Title: Re: GLR Parser
Post by: infurl on August 19, 2011, 07:33:38 am
Quote
Lexical scanners like flex only accept regular expressions which are much less expressive than context free grammars
I think you are a bit confused. Flex doesn't use regular expressions, it's a lexical scanner. It's fast because it doesn't do the full parse, not because it uses regular expressions, that would make it a lot slower.
Flex is what you use to create the data for bizon or yacc.

From the flex lexical scanner manual http://flex.sourceforge.net/manual/Introduction.html#Introduction (http://flex.sourceforge.net/manual/Introduction.html#Introduction)

Quote
flex is a tool for generating scanners. A scanner is a program which recognizes lexical patterns in text. The flex program reads the given input files, or its standard input if no file names are given, for a description of a scanner to generate. The description is in the form of pairs of regular expressions and C code, called rules. flex generates as output a C source file, lex.yy.c by default, which defines a routine yylex(). This file can be compiled and linked with the flex runtime library to produce an executable. When the executable is run, it analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code.

Jan, you did actually say something intelligent last week, I remember because it was so out of character for you, but mostly you just talk rubbish. However I'll give you some credit this time for at least realising that I wasn't talking about the product that Adobe has which is also called Flex.
Title: Re: GLR Parser
Post by: Bragi on August 19, 2011, 08:23:57 am
My bad, I had the flex def confused with that of coco. No need to be insulting though
Title: Re: GLR Parser
Post by: infurl on August 19, 2011, 08:30:29 am
My bad, I had the flex def confused with that of coco. No need to be insulting though

Accusing me of being confused isn't insulting? Maybe the word has different connotations in different cultures, but more often it is a way of insulting someone in such a way as to be able to feign innocence when challenged. Even if your command of English isn't good enough for you to know that (and I believe it is), it wouldn't be the first time that you tinged your comments with a little bit of spite would it. :)
Title: Re: GLR Parser
Post by: Bragi on August 19, 2011, 08:51:32 am
Ah, I think there we have the culture difference.  It's a statement that clearly bothers you, my apologies for that. It certainly wasn't meant in the way that you took it. Maybe you haven't noticed before, but I often am very confused (up to the point that I sometimes wonder what the hell I am doing in room x while I need to be in y). People say it to me all the time (it's not considered offensive for us, especially in my family since we all share the trait). I simply, in this case incorrectly, assumed that you were also confused. Turned out it was me again.
Title: Re: GLR Parser
Post by: infurl on August 23, 2011, 11:33:21 am
I've begun the task of translating my GLR parser generator from Common Lisp to C. So far I've rewritten the code that parses the grammar definition file and builds the data structures for it. The Common Lisp code that does this is entirely contained in the following three function definitions while the C version is more than 400 lines just for the data structure definitions and functions. Strictly speaking I should also include the 1800 lines of C for the parser driver and the 500 lines for the compiled parser definitions, although this part could probably be rewritten in a few hundred lines as a simple recursive descent parser instead.

Still, that's 23 lines of Common Lisp versus more than 3000 lines of C. The performance, portability, and scalability gains, if any, had better be worth the effort in the long run.  :uglystupid2:

Code
(defun read-grammar-name (istream)
  (let ((item (read istream nil)))
    (when item
      (etypecase item
        (number item)
        (keyword item)
        (symbol (symbol-name item))))))

(defun read-grammar-rule (istring)
  (when istring
    (with-input-from-string (istream istring)
      (loop :for token = (read-grammar-name istream)
            :while token :unless (keywordp token) :collect token))))

(defun read-grammar-file (file-name)
  (let ((*readtable* (copy-readtable nil)))
    (setf (readtable-case *readtable*) :preserve)
    (with-open-file (istream file-name :direction :input)
      (loop :for line = (read-line istream nil)
            :for rule = (read-grammar-rule line)
            :while line :when rule :collect rule))))