wikimedia resources

infurl · « **Reply #15 on:** March 20, 2018, 09:57:46 pm »

Manually extracting WikiData into a relational database took about a week and resulted in an 800GB database. Even with 4TB of SSD that doesn't seem all that practical. Luckily there is a lot of scope for further optimisation and that's what I'll be exploring next. I wrote a simple function to extract all the facts from a JSON blob.

Code

CREATE FUNCTION dScanJson
(
    aObject JSONB,
    aPath TEXT[] DEFAULT ARRAY[]::TEXT[]
)
    RETURNS SETOF TEXT[]
    LANGUAGE sql STABLE AS
$$
    WITH
        arrays(value,item) AS
            (SELECT * FROM JSONB_ARRAY_ELEMENTS(aObject) WITH ORDINALITY WHERE JSONB_TYPEOF(aObject) = 'array'),
        objects(key,value) AS
            (SELECT * FROM JSONB_EACH(aObject) WHERE JSONB_TYPEOF(aObject) = 'object')
    SELECT aPath||ARRAY[BTRIM(aObject::TEXT,'"')] WHERE JSONB_TYPEOF(aObject) = 'string' UNION
    SELECT dScanJson(value,aPath||ARRAY[item::TEXT]) FROM arrays UNION
    SELECT dScanJson(value,aPath||ARRAY[key]) FROM objects
$$;

This produces a set of tuples specifying the path and value of every fact in a JSON object. The first record alone yields over 1500 such facts. Here's a sample.

Code

{claims,P998,1,id,Q26$2530265C-B49F-4A4A-9F23-C1BFE27C27C7}
{claims,P998,1,mainsnak,datatype,external-id}
{claims,P998,1,mainsnak,datavalue,type,string}
{claims,P998,1,mainsnak,datavalue,value,Regional/Europe/United_Kingdom/Northern_Ireland/}
{claims,P998,1,mainsnak,property,P998}
{claims,P998,1,mainsnak,snaktype,value}
{claims,P998,1,rank,normal}
{claims,P998,1,type,statement}

As WikiData contains about 45 million JSON blobs, I would estimate the end result to contain as many as 100 billion facts or triples. Should be easy enough to handle with some more C. I'd hate to have to upgrade my server with another 100GB of RAM just to deal with this problem.

ivan.moony · « **Reply #16 on:** March 20, 2018, 10:24:54 pm »

That seems like compiling db from one format to another, while introducing bigger space occupation. Some other approach would be to keep all the data in a format that occupies the least of space, while just in time compiling data only when you need it. Space is usually additionally spent only when optimizing data access for speed, but at some point you run out of available memory for further optimizations to have all the data optimized.

I like to think there are two kinds of optimizations: (1) for speed and (2) for memory occupation. In a case of Wikidata, is there any chance for dynamic interchange between these two optimizations on demand?

infurl · « **Reply #17 on:** March 20, 2018, 10:42:15 pm »

That is exactly right Ivan, I'm looking for the optimum trade off between space and speed. The first thing that I tried was to just store the data as binary JSON in a simple database table and index that. The original data is 20GB compressed, the table took 120GB, and the index was another 30GB. It took 3 hours to download, 6 hours to import, and 9 hours to index. While the index allowed any given piece of information to be retrieved from all that mass of data in a minute or so that was way too slow for my needs, so I'm exploring other options.

I've already written and debugged a C version of the SQL function, it's about 700 lines. Ultimately I should be able to extract the data for just one language which would make it an order of magnitude smaller, and much more useful to other people, but for my purposes, I want it all. Incidentally I'm using ZFS (Zed File System) on my SSD so it automatically compresses the data on the fly and it's actually only taking one tenth of the space in reality, plus IO is even faster. Not everyone has a powerful enough system to run ZFS though.

infurl · « **Reply #18 on:** March 24, 2018, 03:26:44 am »

I've been running some tests during the week to get a sense of the best way to deal with this wikidata file. It's making me wish for the good old days when a movie seemed like a really big file.

The downloaded file is 20GB. It must have been packed on bzip2 on a very high setting because when I unpack and repack it with the default bzip2 setting it is 40GB. Unpacking it takes 3 hours. Repacking it took 36 hours. The full unpacked size is 426GB but on ZFS it only takes up 124GB which isn't quite so bad. It takes ZFS about an hour to save it compressed, its compression algorithm is much faster but not as tight as bzip2. At 500MB per second the time to read and write these files to SSD isn't really significant, but on a normal hard drive at 50MB per second that time would become significant.

It sure looks like the best policy is going to be to leave the file in its downloaded form, unpack it straight to memory, parse it and convert it to relational form, and send it straight to the database without storing it in an intermediate file. For software that takes a long time to perform its job, it's important to be able to stop and restart it without losing work. It is possible to resynchronise with blocks in the middle of a bzip2 data stream so that shouldn't be too difficult to do here, without having to rescan the file from the start. Two hours just to resume where it left off wouldn't be acceptable.

So, I'm trying to imagine the practicality of dumping and reloading a human brain with a flash drive, like they do in the Netflix series "Altered Carbon". Sadly it's not going to happen with our current technology, but they were supposed to be alien flash drives.

Freddy · « **Reply #19 on:** March 25, 2018, 09:45:01 pm »

Too much information ? Yes that sounds about right

Just look at book bound encyclopedia sets, like Britannica, if they even still make them. You should be happy you don't have images in there too

It's very interesting though. I like all that sifting through large amounts of data, it can keep me happy for days finding out what can be done with it. My mountains of information are mere mole hills compared to this though.

LOCKSUIT · « **Reply #20 on:** March 26, 2018, 02:56:55 am »

What is this wikimedia for? Relational DataBase why? Are you building an AGI off others hard work (i.e. forked off of wikimedia)? For example you likely wouldn't build a GE to make a game - you would just use a existing GE.

Because I just happened to be reading about Wikimedia, wikidata, etc.

infurl · « **Reply #21 on:** March 26, 2018, 03:07:42 am »

Quote from: LOCKSUIT on March 26, 2018, 02:56:55 am

What is this wikimedia for? Relational DataBase why? Are you building an AGI off others hard work (i.e. forked off of wikimedia)? For example you likely wouldn't build a GE to make a game - you would just use a existing GE.

That's right Locksuit, build on the work of others. As Isaac Newton once said "If I have seen further it is by standing on ye shoulders of giants."

You might be interested in some of the other projects that are derived from Wikimedia too, such as ConceptNet5 and Yago3. They are very useful resources and I have studied them extensively. I found millions of errors in Yago which I reported to the authors and they subsequently fixed them. I believe that with my skills I can improve on those projects, just a little bit, and make a worthwhile contribution to the whole.

Freddy · « **Reply #22 on:** March 26, 2018, 03:34:34 am »

The thing about those kinds of projects is they are never really finished - it needs other people to take up the baton.

LOCKSUIT · « **Reply #23 on:** March 26, 2018, 06:46:23 pm »

Ok but, if I'm trying to teach a NLP LSTM like a child by feeding it lots of conversational text data from the web and then talking to it 1 on 1, then where does wikimedia fit-in? Is it better than feeding text off the internet? I.e. wikimedia has a better list of common sense knowledge (in the form of normal English and not a strange format)?

infurl · « **Reply #24 on:** March 26, 2018, 09:43:52 pm »

Quote from: LOCKSUIT on March 26, 2018, 06:46:23 pm

Ok but, if I'm trying to teach a NLP LSTM like a child by feeding it lots of conversational text data from the web and then talking to it 1 on 1, then where does wikimedia fit-in? Is it better than feeding text off the internet? I.e. wikimedia has a better list of common sense knowledge (in the form of normal English and not a strange format)?

Those are two very good questions Mr Suit.

To answer your first question, it takes a very long time for an entity to learn things from scratch, let alone accurately. Take a look at Carnegie Mellon's Never Ending Language Learning project which has been running for many years. It has learned a lot in that time but it still has some very strange ideas. For example, looking at the summary page I can see that a few weeks ago it came to the conclusion that larvae feed on bees. I don't have the time or patience for that.

http://rtw.ml.cmu.edu/rtw/

Another thing to consider is that the more you know, the faster you can learn, because you have more context in which to place new knowledge.

To answer your second question, what seems like a strange format to you (JSON or JavaScript Object Notation) is a much easier format for a computer to process. The syntax of JSON has been designed to make it very fast and easy to parse needing only one character look ahead. On the other hand, the English language which is so easy for you to parse and understand is still impossible for any computer to process completely, although some projects (including my own) are making progress in that area.

So, to sum up what I'm doing, I'm processing wikidata into a format that I can use directly, partly because it's quicker and easier than the alternatives, but also because I will then have something to leverage parsing of wikipedia in English. Not only will my software be able to learn faster, it will be able to check some things against it's built-in knowledge base to see if it's getting them right.

LOCKSUIT · « **Reply #25 on:** March 26, 2018, 10:39:27 pm »

Wikimedia/data's format is different. Like instead of it containing text knowledge like "Brown trees are ripe in the winter season ready to eat for bears." it contains text like "Brown Trees: ripe and only; red only in winter." right? I know that was a little rough what I just said but my question here is that it's text knowledge has a weird format right?

Second question. Lets say I went with it and used the wikis for my project's accusation of text knowledge. This will result in the baby mind talking English but it won't sound grammatical and I won't be able to speak to it using my own voice like you would normally to children. Right?

infurl · « **Reply #26 on:** March 26, 2018, 11:35:27 pm »

I think you may be missing a fundamental idea here. The goal is to separate representation and presentation. That is, the information is stored in a way that is independent of how it appears to you. Ultimately the knowledge base should be able to receive information in any form. That information is broken down into its fundamental elements (facts) and stored in a way that makes it easy to manipulate (think about) with rules. The results of that processing are then converted back into whatever format is most suitable for communication. It could be in the form of a diagram, English speech, or written Hindi.

So, the knowledge base has rules for converting speech or writing or other media in different languages and styles into facts, it has rules for thinking about facts to create new facts, and it has rules for translating facts back into speech or writing or other media, again in different languages and styles. The word for all this is ABSTRACTION.

LOCKSUIT · « **Reply #27 on:** March 27, 2018, 01:16:36 am »

So to sum it up, I can get text knowledge anywhere on the web, say WikiPEDIA, and the only different with WikiMEDIA/DATA is that the format is not grammatical fashion right? It converts factual sentences into factual very-short 'phrases' right?

infurl · « **Reply #28 on:** March 27, 2018, 01:21:15 am »

Close enough.

infurl · « **Reply #29 on:** April 01, 2018, 01:33:34 am »

I've made another interesting discovery about the WikiData that I've been processing over the past few weeks. Fully unpacked into a database system it takes up around 800GB, but short of putting it in optimal normal form I haven't made much attempt to reduce the size requirements yet. So now I've got to the point where it is organised enough that I can start to take a closer look at the contents, and the first thing to do there is determine just how much text there really is. In it's raw form the text runs to billions of phrases and hundreds of gigabytes, but if I sort the phrases in order I can drop all the duplicates. To my amazement (and a mixture of relief and disappointment) I found that there are only about 90 million unique phrases which fits comfortably in 3GB. This opens up a whole new range of practical processing and deployment options. Please stay tuned.

wikimedia resources

infurl

Re: wikimedia resources

ivan.moony

Re: wikimedia resources

infurl

Re: wikimedia resources

infurl

Re: wikimedia resources

Freddy

Re: wikimedia resources

LOCKSUIT

Re: wikimedia resources

infurl

Re: wikimedia resources

Freddy

Re: wikimedia resources

LOCKSUIT

Re: wikimedia resources

infurl

Re: wikimedia resources

LOCKSUIT

Re: wikimedia resources

infurl

Re: wikimedia resources

LOCKSUIT

Re: wikimedia resources

infurl

Re: wikimedia resources

infurl

Re: wikimedia resources

Recent Topics

Recent News

Users Online

Articles