I guess it depends on which level that the xml needs to be interpreted: is it the bot's internal file processing mechanism that needs to process the xml and extract the text-only part, which is than sent to the pattern matcher for further processing or does the pattern matcher itself need to process the xml?
-In the first case, you need a well defined xml file format so that it can be properly queried (hardcoded stuff, perhaps you can use xslt's to query the xml and have those xslt definitions dynamic?).
-For the second case: in the end, an xml file is also just a string, so if you have a pattern matcher that's powerful enough to handle the xml specification, you should be able to define a set of patterns that are able to read any type of xml and extract the text values from them, and use these for further processing.
A long time ago, I implemented a general purpose xml parser (that was part of the library for a programming language that I developed), so I think I have a fairly good idea what it takes to make one. The first thing that came to mind was 'recursion' (pretty obvious): an xml element can have other xml elements as children, so the pattern matcher needs a way to declare a pattern that refernces itself. All the file formats can also be a problem, but that should be handled by the bot's internal file loading.
I'm not certain this can be done with AIML since I don't know if it can handle recursion (can an AIML pattern reference another pattern or itself?). Furthermore, defining xml tags in xml files is tedious, at best.
From what I remember about the sourceforge, it can't have patterns that reference other patterns, so that's out.
I'm not certain about chatscript, though I would be suprised if it couldn't do recursion, so I think it can parse xml. (anyone knows this exactly?, I'd really like to know)
The pattern matching language that I am using is based on compiler generator techniques, and should be able to handle the full xml specification, though it would definitely be slower compared to a C# xml parser for instance (I think). Code size is probably about the same as a compiler generator like coco/r.
From the top of my head, it would look something like:
TOPIC name : XMLElement
Rule name: Element
you say: <$FrontName {~XMLElement.attribs} >{~XMLElement.Element | $content} </$backName>
<$FrontName/>
When: $BackName && ($frontName != BackName)
bot says: there was an error in the xml formatting
else
bot says: $content:Evaluate
Rule name: attrib
Inputs: $AttribName = \' $attribValue \'
This is just a rough sketch, untested, with big wholes and errors, just a basic start for a simple xml element. Key here is the recursion: ~XMLElement.Element which allows for nested xml elements.
also:
':Evalulate' is actually called differently, but I forgot the exact name, is for the next release anyway. It sends the text part back to the pattern matcher for further evaluation.
This scheme probably also only works if the bot has the ability to turn on/off certain patterns. For instance, if you first need to extract the text out of the xml and process this seperatly, you need to make certain that the patterns which handle the content, don't overrule the xml patterns. In my system, this can be done by turning on and off an entire topic.
Finally, if you use the neural network, anything is possible, you just need to code it out.
I did some thinking about this last night. It's actually an interesting concept. You could potentially 'feed' a bot with large amounts of data like this. One thing has been bothering me though. As it is now, with my pattern matcher at least, it tries to use every word in the input as the start of a pattern in parallel, which is great for regular input, but just overkill for an xml formatted file. It shouldn't cause different results, just to much processing. This can be solved though, either by using a different input channel (there is text, int, image,... a new one would be xml), or there can somehow be a switch in the regular pattern matching code to select between 2 modes: parallel or single shot.
You can capture the structure of the xml file as an asset or a thesaurus, and if you use a predefined xml file structure, you could even mix both. These assets/thesaurus structures can then be queried to retrieve information. I think that a lot depends on the structure of the xml file. You could make a topic that simply captures the 'raw' structure of the xml file, something like this:
//note: this is untested, so pseudo code
when #bot.xml
#bot.($FrontName).value = $content
#bot.($FrontName) = #bot.xml
#bot.xml = #bot.($FrontName)
#bot -= $FrontName
else
#bot.xml.($FrontName).value = $content
but probably, you'll want to interpret the data a little more, like with the google api, 'forecastInformation' can be used to look up the city in the database, and when not existing, create an asset for it. The second part can be used to store weather information in the asset.
Do you perhaps have a specific xml file format in mind?
Could this "knowledge" then be retained within a file or even so
other such files could be appended to creat an ever larger
knowledgebase?
Yes, once it's stored in the neural network, it remains in there until a delete instruction is performed. So you can 'join' the data of multiple xml files.
Once the data is stored then additional data is likewise stored, would the two files be stored as individual files or could they be merged together into one large (and potentially growing) file?
Sort of. The contents of the files will be merged into 1 dataset, but internally, the xml files are no longer used, but instead, they are stored in a binary form, split across multiple database files (currently 8 ). So, once the data has been imported, it is internally stored, and the xml files no longer have any purpose (other than retaining the data in text form).
From a usage point of view, it all depends on the structure used to store the content of the files: you can either mimic the file structure, or transform it into something different.
There would need to be certain parameters or even keywords for the bot to use so that, at a later date, it could go to a specific portion of said file to obtain the information and answer question in that regard.
probably. But if you write a parser for a specific xml file format so that it merges the data properly, it will simply become part of the general dataset. For instance, the google weather api (http://blog.programmableweb.com/2010/02/08/googles-secret-weather-api/), if you were to save the weather info into the asset that represents the city (and not just store the xml file 'as is'), something like so:
$city = $CityName:ResolvePerson //can also use $CityName:FindAssetFromValue(name) which looks for an asset where $cityName is the value, 'name' is the attrib
#city.weather = $condition //store the 'cloudy' or other value in the 'weather' part'
#city.temp = $temp //store the actual temperature
//note: maybe also need to store some date info?
then you can return this info if a user asks for the weather in that city, like so (again untested, so pseudo code):
input: what's the weather in $cityName
calculate:
$city = $cityName:ResolvePerson
output when $city && #city.weather
It is #city.weather in $cityName. It's currently #city.temp
output when: $city
I know that place, but I don't know what the weather is like over there.
else: Never heard of $cityName
Note: in the code, I am switching between $city and #city This has to do with how you want to approach/use the variable content. When you write $value, it's just a regular variable, so when you assign to $city, the assigned value is temporarily retained in that variable name. #city is asset specific, so it will do some transformations to the variable content (like making certain that there is always just 1 item stored in the asset variable, regular vars like $city, can actually contain a list of values...
In short, when you want to store a temporary value for further calculation, use $xxx (like $city = xxxx) but if you need to do an asset operation, like store something as an asset value or retrieve an asset value, #xxx needs to be used. (takes some getting used to).
Here's something I could relatively easily do:
in the chatbot designer, add a new menu item to import any 'generic' file. When you do 'file/import/generic/ you can first select 1 or more files (xml or other type of text file) that need to be imported. Next, you select the topics that should be used for 'reading' and importing the content. Only patterns from those topics would be used during the pattern matching process. This could also be triggered from within the patterns: when sending some data back to the patternmatcher as input, you could specify the projects that are allowed to be used (this is relatively easy to add).
Also, as a side note: suppose you have 2 xml file formats which contain similar data, but formatted differently (with different labels and so). If you use thesaurus variables for the xml -element and attribute names, you can most likely (probably not always though) create 1 pattern definition able to handle both definitions. This could be something like:
<^Front:noun.WeatherTag {~XMLElement.attribs} >{~XMLElement.Element | $content} </$backName>
<^Back:noun.WeatherTag/>
something I forgot to mention: the designer application already supports 'asset xml' files. These allow you to import/export asset data. So if you have an xml file, you can transform it (using xslt or something else) into the asset xml format and import that.
I've also fixed the asset editor, so starting from the next release, you can edit this data like the attached image.
the xml file for the data in the image looks like:
<?xml version="1.0" encoding="utf-8"?>
<Asset ID="93b3585d-66b0-4214-ab81-817d735b0f5e">
<Name>user</Name>
<Items>
<Item>
<Attribute>
<Text Value="name" />
</Attribute>
<Data>
<DataItem>
<Meaning>Value</Meaning>
<Value>
<Text Value="jan" />
</Value>
</DataItem>
</Data>
</Item>
<Item>
<Attribute>
<Text Value="birthday" />
</Attribute>
<Data>
<DataItem>
<Meaning>Value</Meaning>
<Time>21/07/2011 0:00:00</Time>
</DataItem>
</Data>
</Item>
<Item>
<Attribute>
<Text Value="hand" />
</Attribute>
<Data>
<DataItem>
<Meaning>Value</Meaning>
<Children ID="d0761ff6-8cb3-425b-a111-e3591d40dc46" IsRoot="False">
<Item>
<Attribute>
<Text Value="location" />
</Attribute>
<Data>
<DataItem>
<Meaning>Value</Meaning>
<Value>
<Text Value="left" />
</Value>
</DataItem>
</Data>
</Item>
</Children>
</DataItem>
</Data>
</Item>
</Items>
</Asset>
It's a bit verbose and this is just a subset of every possible element, but it describes the data in full details.
yes, if you have the patterns and queries defined. For instance, suppose you have a bunch of records on the birthdays of famous people. you could transform each record into an asset like the one above (with the attributes 'name', 'birthday' and a 'value' data field for both) and import all of them into the chatbot.
Next, you need some patterns and queries like so:
input: when was ^name:noun.name born?
calculate: $person = $name:resolvePerson
output when $person && #person.birthday
$name was born on #person.birthday:month / #person.birthday:day / #person.birthday:year
output when $person
I have heard of $name, but I don't know his birthday.
else
I don't know ^name
//this is not tested, so pseudo again.