Print Page - wikimedia resources

Member's Experiments & Projects => AI Programming => Topic started by: infurl on March 15, 2018, 12:39:20 am

Title: wikimedia resources
Post by: infurl on March 15, 2018, 12:39:20 am

I'm not sure how many of you are specialised in natural language understanding but I just found a publicly accessible project which is about to be released and which you might find interesting too.

https://www.wikidata.org/wiki/Wikidata:Lexicographical_data

Wikimedia is probably the greatest (albeit flawed) single source of information on the planet today and I've heard Wikipedia described as being one of the wonders of the modern world. Unfortunately all that information is encoded as wikitext which is a human friendly but machine unfriendly format making it quite hard to extract information for computing purposes. There are a number of related projects which attempt to address this, such as ConceptNet5 and Yago3 which extract facts from Wikimedia and Wiktionary and ground them by combining them with other resources such as WordNet.

http://www.conceptnet.io/

https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/

Wikidata is the branch of Wikimedia which has been manually extracting facts from Wikipedia and putting them in a machine readable format, and from next month, they will start to include lexical information from Wiktionary as well. This may or may not save me a lot of trouble. Maybe it will be helpful to you too.

Title: Re: wikimedia resources
Post by: Freddy on March 15, 2018, 01:31:06 am

Thanks, yes I might find that useful. I have some basic functions in ElfScript that use the Wikidata API - this project may help me expand that some more when I get back to working on it O0

Title: Re: wikimedia resources
Post by: infurl on March 15, 2018, 10:30:23 pm

Out of curiosity I downloaded the existing WikiData file (the full version) and loaded it into a database. The data file is 20GB compressed of JSON data and the resulting database table was 120GB. It took six hours to load and I'm still building an index on the JSON data. I think I will write a C program to unpack the JSON data into a relational model once I've had a chance to analyse it. It might be worth trying to update it incrementally as well, rather than reprocess the whole lot each time.

Title: Re: wikimedia resources
Post by: infurl on March 16, 2018, 10:44:42 am

Finally done. The index took 25 hours to build and takes 30GB on its own. I'm pretty sure a triple store would be a much better way to do this. JSON may be fine for exchanging small amounts of data but a database it isn't.

Title: Re: wikimedia resources
Post by: Freddy on March 16, 2018, 02:23:52 pm

That kind of thing is why I just went to online versions, but it's nice to have it local.

Title: Re: wikimedia resources
Post by: infurl on March 16, 2018, 08:26:20 pm

These are only little databases but at least the technology continues to improve to bring the cost of processing down. Yago is a third party project that is similar to WikiData but also combines it with other knowledge bases such as GeoNames, WordNet and ConceptNet. I processed an earlier version of Yago without the benefit of SSDs and it took a whole week. However I also wrote a noSQL solution in C which could build a completely cross checked and indexed implementation of Yago in less than an hour. I used techniques like sorting binary files from disk to disk to get a hundred fold speed up. One of the things that I discovered was that millions of Yago's records were corrupted due to a bug in their processing environment. I reported it and it was subsequently fixed.

The trouble with these knowledge resources is that you never really know what's in them until you get them onto your own system in a form that you can process quickly and easily. Even better is to ground them all on a common base line so they can be merged, compared and cross referenced.

Title: Re: wikimedia resources
Post by: Freddy on March 17, 2018, 12:29:42 am

I suppose that after all the shoehorning then for it to be useful depends a lot on reference time. Have you been able to get fast responses from all this data ? Not that it has to be lightning fast, I'm only thinking of it from a chatbot perspective when I ask that because that's mainly what I do.

Title: Re: wikimedia resources
Post by: infurl on March 17, 2018, 12:54:53 am

In JSON form it's not very fast at all, though it's useable with the index on it. I'm in the process of extracting the contents into a relational database in optimal normal form. Once that's done it should be able to execute thousands of queries per second and many more applications will be possible.

While relational form is the fastest, ultimately the most versatile format is "triples" or RDF. WikiData can already be downloaded as triples but I'm not sure it's complete or up to date. At any rate, they strongly recommend the JSON form so I'm working from that.

When you have triples you have exactly one fact per record. Each record has a predicate, a subject and an object. The predicate defines the relationship between the subject and the object. Predicates are predefined and limited but subject and object can be anything, including other triples. In practice a fourth field is added giving every triple a unique identifier. This is called reification and it allows triples to refer to each other more efficiently. It's also useful to add a fifth field which records a weight or confidence in the fact. This also allows facts that are known to be wrong to be rejected so they're not continuously being reinstated.

The great advantage with triples is that if you have multiple knowledge bases grounded in a common ontology, they can be easily merged or interconnected without any data conversion worries. This makes them incredibly powerful and you will find that there are now thousands of such interconnected knowledge bases all over the web, some of them containing trillions of facts.

Title: Re: wikimedia resources
Post by: Don Patrick on March 17, 2018, 09:12:13 am

Quote

In practice a fourth field is added giving every triple a unique identifier. This is called reification and it allows triples to refer to each other more efficiently.

This is something I've been wanting to implement, however, my knowledge database is in a constant state of change due to automated learning: A fact that was there before may have been deleted months later (e.g. if a fact turned out to be negligible rubbish). Wouldn't these identifiers require the database contents to remain constant, and wouldn't one run into a limit on integer size for the identifiers if the database becomes too big? Or if the identifiers are not direct memory addresses, do they refer to an index?

Title: Re: wikimedia resources
Post by: infurl on March 17, 2018, 09:39:06 am

If you are simulating a triple store with a relational database then you would assign a serial number to each triple and index it. Then it doesn't matter if there are gaps, as long as you never change a serial number once it's assigned. When you add a new fact, if it is already there, you use the existing record instead of creating a new one and return the existing serial number. As I mentioned previously, you can also use a weight attribute to mark deleted facts because even if you find something to be wrong, you want to avoid repeatedly putting it back in the knowledge base. 32-bit integers allow up to four billion distinct numbers which is plenty for a domain specific knowledge base, but you can opt for 64-bit integers without any problem.

I also use a packed binary format for gathering and merging knowledge bases. It omits the serial number and instead keeps all the references renumbered to index the records as an array. As it is so compact I can make three copies of the triples and sort each copy on each field. This allows incredibly fast binary searching and keeps memory and storage requirements to a minimum. I wouldn't recommend doing this yourself though unless you're prepared to write a lot of low level custom database operations in C. Other than the ability to merge knowledge bases extremely quickly this isn't a dynamic format either. Typically you would use two tiers, one fixed fast efficient knowledge base for the long term memory, and a slow but flexible knowledge base for the short term memory. Over time you can encapsulate things from the short term memory and merge them with the long term memory.

There are also plenty of free and open source industrial strength triple store software packages available that you could use too.

https://en.wikipedia.org/wiki/Redland_RDF_Application_Framework

https://en.wikipedia.org/wiki/Sesame_(framework)

https://en.wikipedia.org/wiki/Jena_(framework)

Title: Re: wikimedia resources
Post by: Don Patrick on March 17, 2018, 10:44:34 am

Quote

even if you find something to be wrong, you want to avoid repeatedly putting it back in the knowledge base. 32-bit integers allow up to four billion distinct numbers which is plenty for a domain specific knowledge base, but you can opt for 64-bit integers without any problem.

So basically the serial numbers keep ever increasing. I suppose I could switch to 64-bit integers once it becomes necessary. My program should be able to store every fact in the universe for every person and item in human history plus 100 years.

The reason I automatically clean up entries with zero confidence (I do keep untrue facts) is that every entry adds to the (exponential) search space for my inference engine, and my database is file-based and thus noticably slow. I suppose sooner or later I'll have to learn how to construct a faster database, but I am nowhere close to your remarkable abilities in that area yet. The long/short-term memory division is also further down on my to-do list, I generally keep my fingers crossed expecting future computers to become as fast as I need them to.

Thanks for all the references, Infurl.

Title: Re: wikimedia resources
Post by: Korrelan on March 17, 2018, 05:19:18 pm

I've been writing a chatbot engine as a side project.

I had no idea it was called 'tripples' or RDF but this is the structure I'm using.

(https://i.imgur.com/9Kxpn8Y.jpg)

:)

ED: The last array column is obviously a skip link, once indexed each row points to the next relevant row for that subject. A simple loop only retrieves objects relevant to the subject; saves having to loop through the whole array searching. I use a similar method for searching the dictionary, and index of a, aa, ab, ac, ad, pointing/ linking to the start/ end positions in the dictionary array.

I don't think you will find a faster method... if you do let me know.

:)

Title: Re: wikimedia resources
Post by: infurl on March 18, 2018, 12:30:26 pm

Quote from: korrelan on March 17, 2018, 05:19:18 pm

ED: The last array column is obviously a skip link, once indexed each row points to the next relevant row for that subject. A simple loop only retrieves objects relevant to the subject; saves having to loop through the whole array searching. I use a similar method for searching the dictionary, and index of a, aa, ab, ac, ad, pointing/ linking to the start/ end positions in the dictionary array. I don't think you will find a faster method... if you do let me know.

I mainly use top down splay trees which are a relatively recent invention. They look just like skip lists, but they rearrange themselves dynamically according to the actual search pattern, so those items that are searched for most often migrate to the shortest lists. I'm using them in dynamic structures handling millions of records at a time and the performance is phenomenal. They handle deletion as easily as insertion and its about 100 lines of C in total. (Actually they are so simple that when performance really matters I duplicate the splay tree code for each record type and have the compiler in-line and optimise everything resulting in a doubling of performance compared to calls to shared library functions with call-backs.)

However, you should never underestimate the performance of a simple linear search. As a general rule of thumb, if you are searching sets of less than 100 items, you might as well just use a linear search. By the time the compiler has optimised the hell out of it, it can beat any number of complex structures for smaller jobs. Also you can double its performance by using a linked list and keeping it sorted.

Title: Re: wikimedia resources
Post by: 8pla.net on March 18, 2018, 06:59:40 pm

While looking at wikimedia resources, I came across ChatScript triples. So, I wanted to check my understanding.

Reference: https://en.wikipedia.org/wiki/ChatScript

"ChatScript supports facts â€“ triples of data, which can also be transient or permanent."

Code

table: ~inventors(^who ^what)
createfact(^who invent ^what)
DATA:
"Johannes Gutenberg" "printing press"
"Albert Einstein" ["Theory of Relativity" photon "Theory of General Relativity"]

"The above table links people to what they invented (1 per line) with Einstein getting a list of things he did."

Ported to PHP for discussion purposes of triples:

Code

function createfact($who, $invent, $what)
{
    foreach($who as $index=>$record){
    $buf = $who[$index][0]." ".$invent." ";
       foreach($what[$index] as $id=>$that){
          if($id) $buf .= " and ";
          $buf .= $that;
       }
    $fact[]=$buf; 
    }

    return $fact;
}

/* table */

$inventor=array(
array("Bruce Wilcox"),
array("Albert Einstein"),
);

$invention=array(
array("ChatScript"),
array("Theory of Relativity", "Theory of General Relativity"),
);


$facts = createfact($inventor,"invented", $invention);

print_r($facts)."\n";

For the C coders reading, PHP is like an easier version of C Language, if you can imagine that. One difference is the PHP starts its variable names with a dollar sign symbol. In most cases, it does not take much to port PHP to C Language, or vice versa. So, my question is, based on what I coded here in PHP, how can I improve my understanding of triples? Feel free to answer me in C Language. I love C programming.

PHP Program Output

Quote

array (
0 => 'Bruce Wilcox invented ChatScipt',
1 => 'Albert Einstein invented Theory of Relativity and Theory of General Relativity',
)

Title: Re: wikimedia resources
Post by: infurl on March 18, 2018, 10:21:29 pm

8pla.net I posted something like that here http://aidreams.co.uk/forum/index.php?topic=12949.msg51315#msg51315 a couple of days ago.

The facts or triples are just items of the form (predicate subject object) and then you use first order logic to define rules governing their behaviour. The rules contain logical operations like implication, equivalence, existential and universal quantifiers, and, or, and not. To facilitate processing, the rules and facts are converted to conjunctive normal form (CNF).

You can test the correctness of a new fact or rule, or find facts that meet some criteria, using logical resolution. The interface is very simple. You can "tell" the knowledge base something new and if it doesn't contradict anything that it already knows, it accepts it, otherwise it rejects it. You can also "ask" the knowledge base for the answer to a question and it will return a matching list of facts, as in the example that I posted.

If you are a interested a good way to learn more about it would be to start using the Prolog programming language.

Title: Re: wikimedia resources
Post by: infurl on March 20, 2018, 09:57:46 pm

Manually extracting WikiData into a relational database took about a week and resulted in an 800GB database. Even with 4TB of SSD that doesn't seem all that practical. Luckily there is a lot of scope for further optimisation and that's what I'll be exploring next. I wrote a simple function to extract all the facts from a JSON blob.

Code

CREATE FUNCTION dScanJson
(
    aObject JSONB,
    aPath TEXT[] DEFAULT ARRAY[]::TEXT[]
)
    RETURNS SETOF TEXT[]
    LANGUAGE sql STABLE AS
$$
    WITH
        arrays(value,item) AS
            (SELECT * FROM JSONB_ARRAY_ELEMENTS(aObject) WITH ORDINALITY WHERE JSONB_TYPEOF(aObject) = 'array'),
        objects(key,value) AS
            (SELECT * FROM JSONB_EACH(aObject) WHERE JSONB_TYPEOF(aObject) = 'object')
    SELECT aPath||ARRAY[BTRIM(aObject::TEXT,'"')] WHERE JSONB_TYPEOF(aObject) = 'string' UNION
    SELECT dScanJson(value,aPath||ARRAY[item::TEXT]) FROM arrays UNION
    SELECT dScanJson(value,aPath||ARRAY[key]) FROM objects
$$;

This produces a set of tuples specifying the path and value of every fact in a JSON object. The first record alone yields over 1500 such facts. Here's a sample.

Code

{claims,P998,1,id,Q26$2530265C-B49F-4A4A-9F23-C1BFE27C27C7}
{claims,P998,1,mainsnak,datatype,external-id}
{claims,P998,1,mainsnak,datavalue,type,string}
{claims,P998,1,mainsnak,datavalue,value,Regional/Europe/United_Kingdom/Northern_Ireland/}
{claims,P998,1,mainsnak,property,P998}
{claims,P998,1,mainsnak,snaktype,value}
{claims,P998,1,rank,normal}
{claims,P998,1,type,statement}

As WikiData contains about 45 million JSON blobs, I would estimate the end result to contain as many as 100 billion facts or triples. Should be easy enough to handle with some more C. I'd hate to have to upgrade my server with another 100GB of RAM just to deal with this problem. :D

Title: Re: wikimedia resources
Post by: ivan.moony on March 20, 2018, 10:24:54 pm

That seems like compiling db from one format to another, while introducing bigger space occupation. Some other approach would be to keep all the data in a format that occupies the least of space, while just in time compiling data only when you need it. Space is usually additionally spent only when optimizing data access for speed, but at some point you run out of available memory for further optimizations to have all the data optimized.

I like to think there are two kinds of optimizations: (1) for speed and (2) for memory occupation. In a case of Wikidata, is there any chance for dynamic interchange between these two optimizations on demand?

Title: Re: wikimedia resources
Post by: infurl on March 20, 2018, 10:42:15 pm

That is exactly right Ivan, I'm looking for the optimum trade off between space and speed. The first thing that I tried was to just store the data as binary JSON in a simple database table and index that. The original data is 20GB compressed, the table took 120GB, and the index was another 30GB. It took 3 hours to download, 6 hours to import, and 9 hours to index. While the index allowed any given piece of information to be retrieved from all that mass of data in a minute or so that was way too slow for my needs, so I'm exploring other options.

I've already written and debugged a C version of the SQL function, it's about 700 lines. Ultimately I should be able to extract the data for just one language which would make it an order of magnitude smaller, and much more useful to other people, but for my purposes, I want it all. Incidentally I'm using ZFS (Zed File System) on my SSD so it automatically compresses the data on the fly and it's actually only taking one tenth of the space in reality, plus IO is even faster. Not everyone has a powerful enough system to run ZFS though.

Title: Re: wikimedia resources
Post by: infurl on March 24, 2018, 03:26:44 am

I've been running some tests during the week to get a sense of the best way to deal with this wikidata file. It's making me wish for the good old days when a movie seemed like a really big file. :D

The downloaded file is 20GB. It must have been packed on bzip2 on a very high setting because when I unpack and repack it with the default bzip2 setting it is 40GB. Unpacking it takes 3 hours. Repacking it took 36 hours. The full unpacked size is 426GB but on ZFS it only takes up 124GB which isn't quite so bad. It takes ZFS about an hour to save it compressed, its compression algorithm is much faster but not as tight as bzip2. At 500MB per second the time to read and write these files to SSD isn't really significant, but on a normal hard drive at 50MB per second that time would become significant.

It sure looks like the best policy is going to be to leave the file in its downloaded form, unpack it straight to memory, parse it and convert it to relational form, and send it straight to the database without storing it in an intermediate file. For software that takes a long time to perform its job, it's important to be able to stop and restart it without losing work. It is possible to resynchronise with blocks in the middle of a bzip2 data stream so that shouldn't be too difficult to do here, without having to rescan the file from the start. Two hours just to resume where it left off wouldn't be acceptable.

So, I'm trying to imagine the practicality of dumping and reloading a human brain with a flash drive, like they do in the Netflix series "Altered Carbon". Sadly it's not going to happen with our current technology, but they were supposed to be alien flash drives.

Title: Re: wikimedia resources
Post by: Freddy on March 25, 2018, 09:45:01 pm

Too much information ? Yes that sounds about right :D

Just look at book bound encyclopedia sets, like Britannica, if they even still make them. You should be happy you don't have images in there too ;)

It's very interesting though. I like all that sifting through large amounts of data, it can keep me happy for days finding out what can be done with it. My mountains of information are mere mole hills compared to this though.

Title: Re: wikimedia resources
Post by: LOCKSUIT on March 26, 2018, 02:56:55 am

What is this wikimedia for? Relational DataBase why? Are you building an AGI off others hard work (i.e. forked off of wikimedia)? For example you likely wouldn't build a GE to make a game - you would just use a existing GE.

Because I just happened to be reading about Wikimedia, wikidata, etc.

Title: Re: wikimedia resources
Post by: infurl on March 26, 2018, 03:07:42 am

Quote from: LOCKSUIT on March 26, 2018, 02:56:55 am

What is this wikimedia for? Relational DataBase why? Are you building an AGI off others hard work (i.e. forked off of wikimedia)? For example you likely wouldn't build a GE to make a game - you would just use a existing GE.

That's right Locksuit, build on the work of others. As Isaac Newton once said "If I have seen further it is by standing on ye shoulders of giants."

You might be interested in some of the other projects that are derived from Wikimedia too, such as ConceptNet5 and Yago3. They are very useful resources and I have studied them extensively. I found millions of errors in Yago which I reported to the authors and they subsequently fixed them. I believe that with my skills I can improve on those projects, just a little bit, and make a worthwhile contribution to the whole.

Title: Re: wikimedia resources
Post by: Freddy on March 26, 2018, 03:34:34 am

The thing about those kinds of projects is they are never really finished - it needs other people to take up the baton.

Title: Re: wikimedia resources
Post by: LOCKSUIT on March 26, 2018, 06:46:23 pm

Ok but, if I'm trying to teach a NLP LSTM like a child by feeding it lots of conversational text data from the web and then talking to it 1 on 1, then where does wikimedia fit-in? Is it better than feeding text off the internet? I.e. wikimedia has a better list of common sense knowledge (in the form of normal English and not a strange format)?

Title: Re: wikimedia resources
Post by: infurl on March 26, 2018, 09:43:52 pm

Quote from: LOCKSUIT on March 26, 2018, 06:46:23 pm

Ok but, if I'm trying to teach a NLP LSTM like a child by feeding it lots of conversational text data from the web and then talking to it 1 on 1, then where does wikimedia fit-in? Is it better than feeding text off the internet? I.e. wikimedia has a better list of common sense knowledge (in the form of normal English and not a strange format)?

Those are two very good questions Mr Suit.

To answer your first question, it takes a very long time for an entity to learn things from scratch, let alone accurately. Take a look at Carnegie Mellon's Never Ending Language Learning project which has been running for many years. It has learned a lot in that time but it still has some very strange ideas. For example, looking at the summary page I can see that a few weeks ago it came to the conclusion that larvae feed on bees. I don't have the time or patience for that.

http://rtw.ml.cmu.edu/rtw/

Another thing to consider is that the more you know, the faster you can learn, because you have more context in which to place new knowledge.

To answer your second question, what seems like a strange format to you (JSON or JavaScript Object Notation) is a much easier format for a computer to process. The syntax of JSON has been designed to make it very fast and easy to parse needing only one character look ahead. On the other hand, the English language which is so easy for you to parse and understand is still impossible for any computer to process completely, although some projects (including my own) are making progress in that area.

So, to sum up what I'm doing, I'm processing wikidata into a format that I can use directly, partly because it's quicker and easier than the alternatives, but also because I will then have something to leverage parsing of wikipedia in English. Not only will my software be able to learn faster, it will be able to check some things against it's built-in knowledge base to see if it's getting them right.

Title: Re: wikimedia resources
Post by: LOCKSUIT on March 26, 2018, 10:39:27 pm

Wikimedia/data's format is different. Like instead of it containing text knowledge like "Brown trees are ripe in the winter season ready to eat for bears." it contains text like "Brown Trees: ripe and only; red only in winter." right? I know that was a little rough what I just said but my question here is that it's text knowledge has a weird format right?

Second question. Lets say I went with it and used the wikis for my project's accusation of text knowledge. This will result in the baby mind talking English but it won't sound grammatical and I won't be able to speak to it using my own voice like you would normally to children. Right?

Title: Re: wikimedia resources
Post by: infurl on March 26, 2018, 11:35:27 pm

I think you may be missing a fundamental idea here. The goal is to separate representation and presentation. That is, the information is stored in a way that is independent of how it appears to you. Ultimately the knowledge base should be able to receive information in any form. That information is broken down into its fundamental elements (facts) and stored in a way that makes it easy to manipulate (think about) with rules. The results of that processing are then converted back into whatever format is most suitable for communication. It could be in the form of a diagram, English speech, or written Hindi.

So, the knowledge base has rules for converting speech or writing or other media in different languages and styles into facts, it has rules for thinking about facts to create new facts, and it has rules for translating facts back into speech or writing or other media, again in different languages and styles. The word for all this is ABSTRACTION.

Title: Re: wikimedia resources
Post by: LOCKSUIT on March 27, 2018, 01:16:36 am

So to sum it up, I can get text knowledge anywhere on the web, say WikiPEDIA, and the only different with WikiMEDIA/DATA is that the format is not grammatical fashion right? It converts factual sentences into factual very-short 'phrases' right?

Title: Re: wikimedia resources
Post by: infurl on March 27, 2018, 01:21:15 am

Close enough.

Title: Re: wikimedia resources
Post by: infurl on April 01, 2018, 01:33:34 am

I've made another interesting discovery about the WikiData that I've been processing over the past few weeks. Fully unpacked into a database system it takes up around 800GB, but short of putting it in optimal normal form I haven't made much attempt to reduce the size requirements yet. So now I've got to the point where it is organised enough that I can start to take a closer look at the contents, and the first thing to do there is determine just how much text there really is. In it's raw form the text runs to billions of phrases and hundreds of gigabytes, but if I sort the phrases in order I can drop all the duplicates. To my amazement (and a mixture of relief and disappointment) I found that there are only about 90 million unique phrases which fits comfortably in 3GB. This opens up a whole new range of practical processing and deployment options. Please stay tuned.

Ai Dreams Forum

Member's Experiments & Projects => AI Programming => Topic started by: infurl on March 15, 2018, 12:39:20 am