wikimedia resources

infurl · « **on:** March 15, 2018, 12:39:20 am »

I'm not sure how many of you are specialised in natural language understanding but I just found a publicly accessible project which is about to be released and which you might find interesting too.

https://www.wikidata.org/wiki/Wikidata:Lexicographical_data

Wikimedia is probably the greatest (albeit flawed) single source of information on the planet today and I've heard Wikipedia described as being one of the wonders of the modern world. Unfortunately all that information is encoded as wikitext which is a human friendly but machine unfriendly format making it quite hard to extract information for computing purposes. There are a number of related projects which attempt to address this, such as ConceptNet5 and Yago3 which extract facts from Wikimedia and Wiktionary and ground them by combining them with other resources such as WordNet.

http://www.conceptnet.io/

https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/

Wikidata is the branch of Wikimedia which has been manually extracting facts from Wikipedia and putting them in a machine readable format, and from next month, they will start to include lexical information from Wiktionary as well. This may or may not save me a lot of trouble. Maybe it will be helpful to you too.

Freddy · « **Reply #1 on:** March 15, 2018, 01:31:06 am »

Thanks, yes I might find that useful. I have some basic functions in ElfScript that use the Wikidata API - this project may help me expand that some more when I get back to working on it

infurl · « **Reply #2 on:** March 15, 2018, 10:30:23 pm »

Out of curiosity I downloaded the existing WikiData file (the full version) and loaded it into a database. The data file is 20GB compressed of JSON data and the resulting database table was 120GB. It took six hours to load and I'm still building an index on the JSON data. I think I will write a C program to unpack the JSON data into a relational model once I've had a chance to analyse it. It might be worth trying to update it incrementally as well, rather than reprocess the whole lot each time.

infurl · « **Reply #3 on:** March 16, 2018, 10:44:42 am »

Finally done. The index took 25 hours to build and takes 30GB on its own. I'm pretty sure a triple store would be a much better way to do this. JSON may be fine for exchanging small amounts of data but a database it isn't.

Freddy · « **Reply #4 on:** March 16, 2018, 02:23:52 pm »

That kind of thing is why I just went to online versions, but it's nice to have it local.

infurl · « **Reply #5 on:** March 16, 2018, 08:26:20 pm »

These are only little databases but at least the technology continues to improve to bring the cost of processing down. Yago is a third party project that is similar to WikiData but also combines it with other knowledge bases such as GeoNames, WordNet and ConceptNet. I processed an earlier version of Yago without the benefit of SSDs and it took a whole week. However I also wrote a noSQL solution in C which could build a completely cross checked and indexed implementation of Yago in less than an hour. I used techniques like sorting binary files from disk to disk to get a hundred fold speed up. One of the things that I discovered was that millions of Yago's records were corrupted due to a bug in their processing environment. I reported it and it was subsequently fixed.

The trouble with these knowledge resources is that you never really know what's in them until you get them onto your own system in a form that you can process quickly and easily. Even better is to ground them all on a common base line so they can be merged, compared and cross referenced.

Freddy · « **Reply #6 on:** March 17, 2018, 12:29:42 am »

I suppose that after all the shoehorning then for it to be useful depends a lot on reference time. Have you been able to get fast responses from all this data ? Not that it has to be lightning fast, I'm only thinking of it from a chatbot perspective when I ask that because that's mainly what I do.

infurl · « **Reply #7 on:** March 17, 2018, 12:54:53 am »

In JSON form it's not very fast at all, though it's useable with the index on it. I'm in the process of extracting the contents into a relational database in optimal normal form. Once that's done it should be able to execute thousands of queries per second and many more applications will be possible.

While relational form is the fastest, ultimately the most versatile format is "triples" or RDF. WikiData can already be downloaded as triples but I'm not sure it's complete or up to date. At any rate, they strongly recommend the JSON form so I'm working from that.

When you have triples you have exactly one fact per record. Each record has a predicate, a subject and an object. The predicate defines the relationship between the subject and the object. Predicates are predefined and limited but subject and object can be anything, including other triples. In practice a fourth field is added giving every triple a unique identifier. This is called reification and it allows triples to refer to each other more efficiently. It's also useful to add a fifth field which records a weight or confidence in the fact. This also allows facts that are known to be wrong to be rejected so they're not continuously being reinstated.

The great advantage with triples is that if you have multiple knowledge bases grounded in a common ontology, they can be easily merged or interconnected without any data conversion worries. This makes them incredibly powerful and you will find that there are now thousands of such interconnected knowledge bases all over the web, some of them containing trillions of facts.

Don Patrick · « **Reply #8 on:** March 17, 2018, 09:12:13 am »

Quote

In practice a fourth field is added giving every triple a unique identifier. This is called reification and it allows triples to refer to each other more efficiently.

This is something I've been wanting to implement, however, my knowledge database is in a constant state of change due to automated learning: A fact that was there before may have been deleted months later (e.g. if a fact turned out to be negligible rubbish). Wouldn't these identifiers require the database contents to remain constant, and wouldn't one run into a limit on integer size for the identifiers if the database becomes too big? Or if the identifiers are not direct memory addresses, do they refer to an index?

infurl · « **Reply #9 on:** March 17, 2018, 09:39:06 am »

If you are simulating a triple store with a relational database then you would assign a serial number to each triple and index it. Then it doesn't matter if there are gaps, as long as you never change a serial number once it's assigned. When you add a new fact, if it is already there, you use the existing record instead of creating a new one and return the existing serial number. As I mentioned previously, you can also use a weight attribute to mark deleted facts because even if you find something to be wrong, you want to avoid repeatedly putting it back in the knowledge base. 32-bit integers allow up to four billion distinct numbers which is plenty for a domain specific knowledge base, but you can opt for 64-bit integers without any problem.

I also use a packed binary format for gathering and merging knowledge bases. It omits the serial number and instead keeps all the references renumbered to index the records as an array. As it is so compact I can make three copies of the triples and sort each copy on each field. This allows incredibly fast binary searching and keeps memory and storage requirements to a minimum. I wouldn't recommend doing this yourself though unless you're prepared to write a lot of low level custom database operations in C. Other than the ability to merge knowledge bases extremely quickly this isn't a dynamic format either. Typically you would use two tiers, one fixed fast efficient knowledge base for the long term memory, and a slow but flexible knowledge base for the short term memory. Over time you can encapsulate things from the short term memory and merge them with the long term memory.

There are also plenty of free and open source industrial strength triple store software packages available that you could use too.

https://en.wikipedia.org/wiki/Redland_RDF_Application_Framework

https://en.wikipedia.org/wiki/Sesame_(framework)

https://en.wikipedia.org/wiki/Jena_(framework)

Don Patrick · « **Reply #10 on:** March 17, 2018, 10:44:34 am »

Quote

even if you find something to be wrong, you want to avoid repeatedly putting it back in the knowledge base. 32-bit integers allow up to four billion distinct numbers which is plenty for a domain specific knowledge base, but you can opt for 64-bit integers without any problem.

So basically the serial numbers keep ever increasing. I suppose I could switch to 64-bit integers once it becomes necessary. My program should be able to store every fact in the universe for every person and item in human history plus 100 years.

The reason I automatically clean up entries with zero confidence (I do keep untrue facts) is that every entry adds to the (exponential) search space for my inference engine, and my database is file-based and thus noticably slow. I suppose sooner or later I'll have to learn how to construct a faster database, but I am nowhere close to your remarkable abilities in that area yet. The long/short-term memory division is also further down on my to-do list, I generally keep my fingers crossed expecting future computers to become as fast as I need them to.

Thanks for all the references, Infurl.

Korrelan · « **Reply #11 on:** March 17, 2018, 05:19:18 pm »

I've been writing a chatbot engine as a side project.

I had no idea it was called 'tripples' or RDF but this is the structure I'm using.

ED: The last array column is obviously a skip link, once indexed each row points to the next relevant row for that subject. A simple loop only retrieves objects relevant to the subject; saves having to loop through the whole array searching. I use a similar method for searching the dictionary, and index of a, aa, ab, ac, ad, pointing/ linking to the start/ end positions in the dictionary array.

I don't think you will find a faster method... if you do let me know.

infurl · « **Reply #12 on:** March 18, 2018, 12:30:26 pm »

Quote from: korrelan on March 17, 2018, 05:19:18 pm

ED: The last array column is obviously a skip link, once indexed each row points to the next relevant row for that subject. A simple loop only retrieves objects relevant to the subject; saves having to loop through the whole array searching. I use a similar method for searching the dictionary, and index of a, aa, ab, ac, ad, pointing/ linking to the start/ end positions in the dictionary array. I don't think you will find a faster method... if you do let me know.

I mainly use top down splay trees which are a relatively recent invention. They look just like skip lists, but they rearrange themselves dynamically according to the actual search pattern, so those items that are searched for most often migrate to the shortest lists. I'm using them in dynamic structures handling millions of records at a time and the performance is phenomenal. They handle deletion as easily as insertion and its about 100 lines of C in total. (Actually they are so simple that when performance really matters I duplicate the splay tree code for each record type and have the compiler in-line and optimise everything resulting in a doubling of performance compared to calls to shared library functions with call-backs.)

However, you should never underestimate the performance of a simple linear search. As a general rule of thumb, if you are searching sets of less than 100 items, you might as well just use a linear search. By the time the compiler has optimised the hell out of it, it can beat any number of complex structures for smaller jobs. Also you can double its performance by using a linked list and keeping it sorted.

8pla.net · « **Reply #13 on:** March 18, 2018, 06:59:40 pm »

While looking at wikimedia resources, I came across ChatScript triples. So, I wanted to check my understanding.

Reference: https://en.wikipedia.org/wiki/ChatScript

"ChatScript supports facts â€“ triples of data, which can also be transient or permanent."

Code

table: ~inventors(^who ^what)
createfact(^who invent ^what)
DATA:
"Johannes Gutenberg" "printing press"
"Albert Einstein" ["Theory of Relativity" photon "Theory of General Relativity"]

"The above table links people to what they invented (1 per line) with Einstein getting a list of things he did."

Ported to PHP for discussion purposes of triples:

Code

function createfact($who, $invent, $what)
{
    foreach($who as $index=>$record){
    $buf = $who[$index][0]." ".$invent." ";
       foreach($what[$index] as $id=>$that){
          if($id) $buf .= " and ";
          $buf .= $that;
       }
    $fact[]=$buf; 
    }

    return $fact;
}

/* table */

$inventor=array(
array("Bruce Wilcox"),
array("Albert Einstein"),
);

$invention=array(
array("ChatScript"),
array("Theory of Relativity", "Theory of General Relativity"),
);


$facts = createfact($inventor,"invented", $invention);

print_r($facts)."\n";

For the C coders reading, PHP is like an easier version of C Language, if you can imagine that. One difference is the PHP starts its variable names with a dollar sign symbol. In most cases, it does not take much to port PHP to C Language, or vice versa. So, my question is, based on what I coded here in PHP, how can I improve my understanding of triples? Feel free to answer me in C Language. I love C programming.

PHP Program Output

Quote

array (
0 => 'Bruce Wilcox invented ChatScipt',
1 => 'Albert Einstein invented Theory of Relativity and Theory of General Relativity',
)

infurl · « **Reply #14 on:** March 18, 2018, 10:21:29 pm »

8pla.net I posted something like that here http://aidreams.co.uk/forum/index.php?topic=12949.msg51315#msg51315 a couple of days ago.

The facts or triples are just items of the form (predicate subject object) and then you use first order logic to define rules governing their behaviour. The rules contain logical operations like implication, equivalence, existential and universal quantifiers, and, or, and not. To facilitate processing, the rules and facts are converted to conjunctive normal form (CNF).

You can test the correctness of a new fact or rule, or find facts that meet some criteria, using logical resolution. The interface is very simple. You can "tell" the knowledge base something new and if it doesn't contradict anything that it already knows, it accepts it, otherwise it rejects it. You can also "ask" the knowledge base for the answer to a question and it will return a matching list of facts, as in the example that I posted.

If you are a interested a good way to learn more about it would be to start using the Prolog programming language.

wikimedia resources

infurl

wikimedia resources

Freddy

Re: wikimedia resources

infurl

Re: wikimedia resources

infurl

Re: wikimedia resources

Freddy

Re: wikimedia resources

infurl

Re: wikimedia resources

Freddy

Re: wikimedia resources

infurl

Re: wikimedia resources

Don Patrick

Re: wikimedia resources

infurl

Re: wikimedia resources

Don Patrick

Re: wikimedia resources

Korrelan

Re: wikimedia resources

infurl

Re: wikimedia resources

8pla.net

Re: wikimedia resources

infurl

Re: wikimedia resources

Recent Topics

Recent News

Users Online

Articles