Part of the reason she seems better, I figure, is that she is writen with a theory of chat in mind. And the engine directly supports a theory of chat. From more of my early notes:
Chat is generally an exchange of words between two participants where facts and opinions are exchanged. The exchange is important. One does not like to give out much information if one is not receiving some back, particularly personal information. Hence a common flow is for one side to volunteer something and, after perhaps optional followup or questions about it, the other side then returns with a similar bit of information. An equivalent flow is for one side to ask a question, the other answers it and either flips that question back or expects the originator to automatically provide the equivalent information.
Chat is predictable in that topics tend to last for a bit, going into deeper and deeper material after skimming off the common initial information. “Where do you live†can eventually lead to “What tourist attractions are there?â€
Chat is interesting in that it is unpredictable and goes off on unexpected tangents. I say “I like onions†and you react with a rant about “The Onionâ€, some liberal writer’s magazine.
------
The following is a statistical analysis of Jabberwacky chat againt humans. Nearly 30,000 sentences were analyzed, and I assume a word is used once per sentence instead of possibly showing up multiple times in a sentence. The input chat line itself may consist of one or more sentences at a time. If a sentence begins with an interjection, I consider that a separate sentence. So I actually fed in 26,378 user input lines, which became roughly 30,000 sentences.
Chat is heavily biased in the use of the pronouns you (9935) and I (8315), as those are the primary areas of interest. In fact, you was the most frequent of all real words. Sentences of you, your I, my take up 2/3 of all chat.
The what (2114) question dominates the w-words, with intermediate use of how (863) and why (646) questions and low use of where (232) or when (166). Do (3832) and Can (961) are common as questions or statements, as is speaking about what is liked or favorite (1107). A flavor of yes (1977) and no (1218) are 10% of all sentences
The most common human-uttered single or composite word sentences in order: some form of yes (937), some form of no (584), some form of goodbye (140), some form of happy expression (106), a standalone why question (101), some form of thanks (99), some form of funny (63), some form of apology (51), a standalone what question (48).
The most common words are: period (22496), the verb be (10440), I/me (10308), you (9935), question mark (7227), not (4617), do (3832), a (3487), comma (3396), to (3137), that (2880), the (2423), it (2134), what (2114), some form of yes (1977), exclamation mark (1873), the verb have (1604), know (1356), and (1276), some form of no (1218), of (1181), your (1108), like/favorite (1107), can (924), so (904), my(898). The period shows up so heavily because when I split intejections into a separate sentence, I add a period after it.
Negation (not, never, xxxn’t) appears in 1/6 of all sentences. Of the not’s, about 15% are of an adjective that could just be flipped to a positive form.
--------
Issues in Pattern Matching
Our chatbot works as an expert system, matching rules against the input information to find a response. The system has various collections of rules (called topics) that are executed to see if they match. Some topics execute all rules to find all matches. Most execute only until a match is found. When considering pattern matching there are some fundamental issues that coexist in tension.
Randomness/Variety
First, is the issue of randomness and variety. We try not to have the user see the same output or be able to for certain predict the output. To insure little repetion the system tracks the last 20 replies given by the program and if a new reply completely matches one of those, it will be blocked from being used for now. This becomes equivalent to the pattern failing and the system will go find a new pattern to use.
To insure randomness within common replies, one can use a feature that randomly picks a phrase from a collection of phrases. To insure randomness within a topic, the system can be told to order the responders randomly before testing them. To insure randomness across topics, the SQL queries to find topics that match keywords return the list of topics in a random order (subject to best match however).
Variety extends beyond randomness. You want to vary your sentence structures and lengths as well. Maybe sometimes answering in full sentences and sometimes in fragments. The system classifies responses based on number of words, and you can explicitly request short, medium or long responses. This affects the quibbling area only, because that generally has multiple ways of quibbling within a sigle rule.
Priority
Second, is the issue of priority. You want user-known information to have priority over generally known information, which in turn should have priority over mere stalling or quibbling chat. And specific replies to have priority over general ones. The current topic should have priority over other topics, so if a question or statement has a response within the current topic, there is no reason to change to some other topic that might also answer it. On the other hand, if the current topic does not have a specific answer and some other topic does, it makes sense to change topics rather than stay in the current one and make a gambit response or quibble.
To control priority some other chat languages allow you specify a priority number to a response. This is overkill and hard to manage. Instead, the standard topic processes responders in order, so you just place them in the priority order you want (usually most specific matching first and more general matching last). Similarly, Gambits often have a flow that tells a story, and should not normally be scrambled. They can execute out of order only if needed to answer a statement or question of the user. And, unless the topic is marked otherwise, once the system jumps to a later gambit it will continue dishing out gambits from there, continuing that part of the story. Only when it reaches the end of the gambits will it return to earlier on to use up those as well.
We prioritize when choosing which words of a sentence to match against topics to find a match. Sequences of words have priority over single words (so “racial discrimination†has priority over any individual word in the sentence). Fundamental nouns (subject and direct object) have priority over verbs, which have priority over all other words. We also prioritize topics that have more keywords in the input than others.
Priority is also managed by the collections being processed in a priority order. These include the current topic, a special system topic, other user topics, the general quibble topics, etc. Within a topic, one can request a subtopic take control if it can match. So one can order collections of patterns within a topic by a common matching characteristic and then executing the subtopic. This is used, for example, in the quibble system topic. If you use the word not or never this tends to drastically alter the intent of a sentence, so a special negative subtopic is matched FIRST. Only if it finds no match will the quibble topic move onto other quibble choices.
Reuse
Third is the issue of reuse. Replies to specific statements or questions are information that might also be spontaneously volunteered as gambits when within a topic. To support this, patterns can have labels attached to them and other patterns can direct that their output reuse a label, meaning do the output of what the labelled pattern would do. Therefore most statement or question responders execute a reuse on some gambit within the topic, making it possible to tell the user as much as possible, but replying as focused as possible if the user asks a question or makes a statement.
But in a normal conversation, once I have told you something, I am not expecting to tell it to you again. If I tell you I am a writer, I expect you to remember and I shouldn’t volunteer that again. (Correspondingly, you shouldn’t ask me what I do again.). This is handled in normal user topics by actually erasing the pattern after it gets used and saving that topic’s state within the specific user’s datafile (so each user has a different state of the system based on their chat so far). If the label target of a reuse no longer exists, then the question or statement cannot respond and a different pattern be matched instead. So if the system has answered what do you do, it cannot accidently then volunteer the same answer later.
In effect, chatting with someone is a self-extinguishing process. Without new data coming into the system, eventually you will have talked about everything you know and have nothing new to say to someone. This happens a lot in real life though there is often new data coming in. In the system, however, there is rarely new data coming in. Should you use up all of its data, it will reset itself and start back at the beginning. The overall system does allow you to add or revise topics. It will continue to work fine with the existing state of all users after you restart the server (a requirement to allow new topics but not required if merely modifying existing topics).
Specificity
Fourth is the issue of pattern specificity. If you write your pattern to exactly match an input it will be triggered correctly but all sorts of closely related input will fail. E.g., if your pattern is “What is my name†then it would fail on What is your given name? Or What in heck is your name? At the other extreme if your pattern is “nameâ€, then it would match all forms of questions involving name, but match things like “Name an animal you like.†The system does not manage this issue directly, but instead allows you a range of choices in how you express the pattern you want. The keywords AND, THEN, SOON, NEXT control the ordering of words being looked for. If you literally want a sequence of words, you would use NEXT (a quoted expression like “what is my name†actually decodes automatically to that sequence of words separated by NEXTs.). Except for special idioms which have this behavior, you are better off using THEN or SOON, which allows other words to intrude without impact. THEN allows any number of intervening words and SOON allows up to two. If your pattern is WHAT THEN IS THEN MY THEN NAME, you have named the essence of a larger set of sentences. In fact, you could probably go WHAT THEN MY THEN NAME safely. It’s like trying to guess the minimum number of words you need to understand on a distorting phone, to know what is being asked.
The minimum number of words you need often depends on context. If you are talking about favorite things, getting the input: Movie? is enough to know you are being asked “what is your favorite movie†There are pattern features that allow you make the pattern contingent upon being within the current topic or also matching keywords of the topic. So sentences that reference the topic can get you into the topic and once the topic becomes current, you don’t need to use those keywords to keep matching other patterns.
Patterns in continuations have similar issues but can usually be less specific. If you have made a statement like “I like this book a lotâ€, the pattern looking at the user’s response to that might well just be why because it’s the obvious question and it doesn’t matter whether it’s why do you like it or why is it so interesting and the odds the user will say why do you go swimming on Tuesdays is extremely unlikely in a continuation context. And even if they did, they wouldn’t find it amiss if you ignored their input and went and answered as though they had asked why you liked it. However, you have to be careful not to be as general as using a single wildcard “*†that matches anything. The problem is that if a user wants to change topics by asking some clearly different question, the system will ignore him and pretend he asked for the continuation. Using the pattern “what†is better (if the question would logically be what) except that what is a common question in a lot of topics. You may not be able to win here, and again it isnt fatal if the system just plows ahead in place.
Orthogonality
It is important to be able to add new data without worrying about how it affects existing data. The system strongly supports separation of data into topics such that interactions are minimal. If two topics have overlapping responders and either could handle the input, it doesn’t matter. If you are not in those topics, one of them will get chosen and handle the input and steer the conversation that way. If you are already in one of the topics, inertia will keep you there and the answer will come from that topic.
It could be that the lesser congruent topic gets chosen to start with. If the sentence clearly belongs more to a topic, you should decide how you know that. Topic picking comes from finding which topic has more keywords or keyword sequences covering the sentence. Adding keyword phrases to a topic will help focus which gets chosen (phrases are worth more than the sum of their individual words). For example, the topic baseball may have (baseball player umpire coach ball bat) as keywords and the topic baseball_job may have (baseball player coach umpire). This will not distinguish I am a professional baseball player to the more significant topic of the job. But addding in the keywords of that topic the word professional or the phrase “professional baseball†would route this toward baseball_job without distorting the keyword field with irrelevant words.
Within a topic, if you want to insure two overlapping responders go to different places, make sure one includes some important word and the other says explicitly it does not include it. For example, what is your favorite show could be coded as (WHAT AND FAVORITE AND SHOW) and what is your favorite kind of show as (WHAT AND FAVORITE AND KIND AND SHOW). But they collide because both can match the latter question. But you could reformulate the former as (WHAT AND FAVORITE AND SHOW AND !kind) to insure they don’t collide.
-----------------------------------
A topic begins with all of its t-lines (topic or gambit) lines first (as well as any continuation lines attached to them). They tell the story of the topic. The story should usually alternate between asking questions of the user and volunteering corresponding data, and probably in that order. Only asking questions gets boring quickly and the user feels cheated. Only giving information deprives the user of the sense that you are interested in them.
The order of asking then volunteering is important. If you volunteer information first (e.g., I own a dog), the user may ask a followup question (which a responder or continuation line could manage) or the user may volunteer some related information (e.g. I’ve heard there are over 300 breeds of dogs). Asking a question helps force the user to a more narrow range of responses. Then, when you volunteer your corresponding data, it seems like a fair exchange has happened and that continuity is being maintained.
When you write a topic sentence you should generally make it a stand-alone complete sentence. Sentence fragments and brief word answers are fine in continuations and responders because the user has the context of his triggering sentence. A topic sentence does not. It may or may not actually come immediately after its preceeding topic sentence. The user may drag the conversation off to a completely different topic for ages and then when that is done, the system may pop back to this topic and then issue the current topic sentence. Any prior context of this topic has long since been forgotten.