The size of Conceptnet... seriously!

Zero · « **on:** February 29, 2020, 01:38:32 pm »

Check out https://github.com/commonsense/conceptnet5/wiki/Build-process

Today, February 2020, it's 240 GB of disk space. Ok, you have a nice Beowulf in your basement, you can handle it Korr

but seriously! I have a feeling there's something wrong here. It just can't be that big. What do you think guys?

infurl · « **Reply #1 on:** February 29, 2020, 02:22:48 pm »

The finished ConceptNet database is not actually that big. I've got it loaded on my laptop and the database is only about 20 gigabytes comprising 33 million records. The 240 gigabytes listed there is the scratch space that's needed to extract and process all the data and crunch it down into its final form. It really is quite small and it is just one of many that I have aggregated into a unified pool of knowledge for my AI software.

In contrast to that, I've been doing a lot of work with WikiData. The raw data is about 200 gigabytes and I need a couple of terabytes of scratch space to do much with it, but that's not a problem for me as I have 5 terabytes of SSD on my laptop and far more than that on other slower disks.

The quickest way to get more computing resources is to sign up to Google Compute Engine (GCE) or Amazon Web Services (AWS). You can ramp up very quickly that way but you would have to pay for it. They give you a few hundred dollars of free credit to get started with, but that won't last you more than a couple of weeks.

Zero · « **Reply #2 on:** February 29, 2020, 02:43:21 pm »

Well I can't afford that.

But why don't they just give the 20Gb version too? I found in the download section a "pre-built list of all the edges (assertions)". How does it differ from the 20Gb you get after the process of crunching these famous 240Gb down?

infurl · « **Reply #3 on:** February 29, 2020, 03:02:25 pm »

The last version of the prebuilt edge list that I downloaded was a few hundred megabytes. It consists of a single comma-separated-values text file which contains triples (subject,predicate,object) and also a field defining extra properties in a JSON object. I use it by decompressing the file and loading it directly into a table in a relational database that assigns a serial number to each row and builds indexes on the columns of the table so it can be searched easily. I haven't done much more with it than that lately as I've been more interested in WikiData.

Decompressed, the raw data would be a few gigabytes in size, but by the time you lay it out in indexed structures for fast retrieval and processing it grows to the 20 gigabytes that I mentioned.

Here are the first 20 rows of the 33 million rows in the database to give you an idea.

Code

words=# select fpredicate,fsubject,fobject from tclause limit 20;
 fpredicate |      fsubject       |      fobject       
------------+---------------------+--------------------
 /r/Antonym | /c/ab/Ð°Ð³Ñ‹Ñ€ÑƒÐ°/n      | /c/ab/Ð°Ò§ÑÑƒÐ°
 /r/Antonym | /c/adx/thÉ™Ï‡ kwo/a   | /c/adx/Ê‚ap wÉ™
 /r/Antonym | /c/adx/tok po/a     | /c/adx/Ê‚a wÉ™
 /r/Antonym | /c/adx/Ê‚a wÉ™/a      | /c/adx/tok po
 /r/Antonym | /c/adx/Ê‚ap wÉ™/a     | /c/adx/thÉ™Ï‡ kwo
 /r/Antonym | /c/ae/𐬨𐬀𐬰𐬛𐬀𐬌𐬌𐬀𐬯𐬥𐬀/n | /c/ae/𐬛𐬀𐬉𐬎𐬎𐬀𐬌𐬌𐬀𐬯𐬥𐬀
 /r/Antonym | /c/ae/𐬵𐬎            | /c/ae/𐬛𐬎𐬱
 /r/Antonym | /c/af/aanskakel/v   | /c/af/afskakel
 /r/Antonym | /c/af/afgaan/v      | /c/af/opgaan
 /r/Antonym | /c/af/afskakel/v    | /c/af/aanskakel
 /r/Antonym | /c/af/aktiveer/v    | /c/af/deaktiveer
 /r/Antonym | /c/af/alkali/n      | /c/af/suur
 /r/Antonym | /c/af/blank/a       | /c/af/swart
 /r/Antonym | /c/af/dogter/n      | /c/af/seun
 /r/Antonym | /c/af/dogtertjie/n  | /c/af/seuntjie
 /r/Antonym | /c/af/geag/a        | /c/af/ongeag
 /r/Antonym | /c/af/gelukkig/a    | /c/af/ongelukkig
 /r/Antonym | /c/af/hoog/a        | /c/af/laag
 /r/Antonym | /c/af/iemand/n      | /c/af/niemand
 /r/Antonym | /c/af/ingaan/v      | /c/af/uitgaan
(20 rows)

Here are some of the predicates that are used, and the number of records for each.

Code

words=# select fpredicate,count(*) from tclause group by 1 order by 1 limit 20;
          fpredicate          |  count  
------------------------------+---------
 /r/Antonym                   |   67129
 /r/AtLocation                |   81756
 /r/CapableOf                 |   51607
 /r/Causes                    |   89413
 /r/CausesDesire              |   24899
 /r/CreatedBy                 |     391
 /r/DefinedAs                 |   11187
 /r/DerivedFrom               |  684228
 /r/Desires                   |   24887
 /r/DistinctFrom              |   71967
 /r/Entails                   |     405
 /r/EtymologicallyDerivedFrom |  176470
 /r/EtymologicallyRelatedTo   |  577305
 /r/ExternalURL               | 9621701
 /r/FormOf                    | 3664162
 /r/HasA                      |   17333
 /r/HasContext                |  817553
 /r/HasFirstSubevent          |   15858
 /r/HasLastSubevent           |    3370
 /r/HasPrerequisite           |   24808
(20 rows)

As you can see, ConceptNet contains 67129 antonyms in many different languages.

Code

words=# select count(distinct fpredicate) as predicates,count(distinct fsubject) as subjects,count(distinct fobject) as objects from tclause;
 predicates | subjects | objects  
------------+----------+----------
         50 | 16564137 | 13157777
(1 row)

Zero · « **Reply #4 on:** February 29, 2020, 04:02:13 pm »

Yes I have it decompressed at 9Gb, currently playing with it, streaming it up and down... I like the fact that relations are so easy to understand. On the other hand, I already saw a couple of weird assertions. I think a good way to go, in my project, would be to keep the structure, but drop the content. Or at least, carefully select a some tiny parts of it, double-checking everything. You can't just swallow everything and hope there's not too much junk in there.

Thanks for sharing your knowledge.

Zero · « **Reply #5 on:** February 29, 2020, 10:33:33 pm »

So here is what I did. I took a list of 1750 words used by french 4yo children. I added 400 words from the semantic field of "computers", and used it to filter and format the raw 9Gb of Conceptnet. That thing will probably talk like a geek.

Edit:

Wow, hey what's wrong with this picture?
Is that what they call "open mind common sense"? There's no way I'm using Conceptnet data.

infurl · « **Reply #6 on:** March 01, 2020, 12:43:25 am »

Quote from: Zero on February 29, 2020, 10:33:33 pm

Is that what they call "open mind common sense"? There's no way I'm using Conceptnet data.

While all the statements are true by themselves, when taken together they create a picture that is not accurate because it is incomplete. It could lead to the fallacy of faulty generalization.

https://en.wikipedia.org/wiki/List_of_fallacies

Bias in training data is one of the biggest problems facing machine learning at the present time. I've heard of much worse cases than this. If you search for "bias in machine learning" you will find many articles about it, such as this one.

https://techcrunch.com/2018/11/06/3-ways-to-avoid-bias-in-machine-learning/

There are many projects attempting to remedy the situation but there is still a lot of work to do.

Good on you for paying attention to the data that you're working with. Incidentally, you shouldn't reject the whole of ConceptNet just because there are errors in it. If you find problems like that you should bring them to the attention of the people maintaining the knowledge base. They are working hard to try to improve it.

You've rekindled my interest in ConceptNet too; if I have time I will do some more work with it today.

Zero · « **Reply #7 on:** March 01, 2020, 01:38:45 am »

I think you're a wise man. I tend to get angry too quickly. I understand it's a giant work, and by nature a problem that's hard to tackle. I really wish this is a top priority for the Conceptnet team, because after all, we're talking about the very foundations of a lot of potentially powerful projects based on it.

Thanks for the links.

WriterOfMinds · « **Reply #8 on:** March 02, 2020, 06:55:30 pm »

Acuitas couldn't ingest the ConceptNet data in Zero's example either. Leaving aside the social bias issue, most of it isn't even well-formatted. Let's look at just the IsA relations:

[zimbabwe] is_type_of [a country]

Articles are a grammatical frill that need to be stripped before a noun is reduced to its essence, since the article attached to a noun can change with context.

[zimbabwe] is_type_of [a country in Africa]

This would need to be distilled into two different relationships: "[zimbabwe] is_type_of [country]" and "[zimbabwe] is_located [in] [Africa]."

[zimbabwe] is_type_of [in Africa]

What?

[zimbabwe] is_type_of [subsahara]

This should be "[zimbabwe] has_quality [sub-saharan]."

[zimbabwe] is_type_of [country (n)]

This is the only one I would describe as correct, and it needs a little additional processing to pop off the part-of-speech superscript.

So, that's 20% good data and 80% garbage data, which doesn't make this database seem like much of a time saver.

Don Patrick · « **Reply #9 on:** March 02, 2020, 07:50:59 pm »

Most people I've met who have tried to implement ConceptNet compain about the mistakes, there are too many. The story goes that at one point in its development, they crowdsourced some of the data for something like 5 cents per fact. This also attracted non-native speakers and resulted in spelling errors and questionable categorisation, with the official categories already being flawed to begin with in my opinion.

If one were looking for a yes/no answer on "Is X a type of country?" (i.e. when you can rely on the question itself to be sensible) then you might use ConceptNet to look that up. If you're looking to populate your AI's brain with reliable knowledge however, this is not it.

Zero · « **Reply #10 on:** March 02, 2020, 10:13:35 pm »

Actually, weird articles get added by the web interface. Here is the JSON I extracted:

Code

{
    "forward": {
        "IsA": [
            "country",
            "country",
            "country in africa",
            "in africa",
            "subsahara"
        ],
        "PartOf": [
            "africa"
        ],
        "ReceivesAction": [
            "populated by negroes"
        ],
        "RelatedTo": [
            "aids",
            "appendix:countries of world",
            "hiv",
            "zimbabwe",
            "republic of zimbabwe",
            "rhodesia",
            "southern rhodesia"
        ],
        "Synonym": [
            "zimbabwe"
        ],
        "dbpedia/capital": [
            "harare"
        ]
    },
    "backward": {
        "AtLocation": [
            "alot of poor people",
            "gazelle"
        ],
        "DerivedFrom": [
            "zimbabwean"
        ],
        "FormOf": [
            "zimbabwes"
        ],
        "HasContext": [
            "dagga",
            "go moggy",
            "goffel",
            "gumaguma",
            "joburg",
            "makoronyera",
            "zhing zhong",
            "zim"
        ],
        "PartOf": [
            "bulawayo",
            "harare",
            "harare",
            "victoria",
            "zambezi"
        ],
        "RelatedTo": [
            "african hunting dog",
            "bulawayo",
            "chatsworth",
            "fanagalo",
            "harare",
            "high school",
            "hosho",
            "hwange",
            "kalanga",
            "kaodzera",
            "kariba",
            "mashonaland",
            "mashonaland east",
            "masvingo",
            "matabeleland",
            "mount pleasant",
            "mthwakazi",
            "ndebele",
            "northern ndebele",
            "rhodesia",
            "sadza",
            "salisbury",
            "save",
            "sena",
            "shona",
            "test nation",
            "victoria falls",
            "wilton",
            "zambezi",
            "zim",
            "zimbabwe",
            "zimbabwean"
        ],
        "Synonym": [
            "republic of zimbabwe",
            "rhodesia",
            "southern rhodesia",
            "zimbabwe"
        ]
    }
}

"IsA country" is fine. But for example we find "IsA in africa", which is obviously incorrect, "IsA subsahara" too. So yeah, maybe not 80% junk, but still a lot of junk.
Paying people for crowdsourcing was silly IMO.

But it can be used as an informal "todo list" (an AI should know those things).

infurl · « **Reply #11 on:** March 02, 2020, 10:17:43 pm »

Going back twenty years, there were originally two competing crowd-sourced artificial intelligence projects: MindPixel and Open Mind Common Sense. MindPixel submissions were strictly moderated whereas Open Mind would accept anything. As a result, Open Mind was a lot more popular with users but its content was of very poor quality. During this time the authors of these projects (Chris McKinstry and Push Singh) were engaged in a very public feud and eventually they both committed suicide.

https://en.wikipedia.org/wiki/Mindpixel
https://en.wikipedia.org/wiki/Open_Mind_Common_Sense

Since OMCS had the backing of MIT the effort was preserved and it eventually formed the basis of ConceptNet.

If you are concerned about the quality of the data that you are using then you should turn your attention to WikiData. It uses the same successful model as Wikipedia and while not perfect either, it adheres to best practices and it is as good as these things get.

https://www.wikidata.org/wiki/Wikidata:Main_Page

Zero · « **Reply #12 on:** March 02, 2020, 10:38:15 pm »

In fact, I don't think I'll just fill my AI with ad-hoc content. But I took some inspiration from Conceptnet's relations to build my own little draft of upper ontology: draft. The set of relations used in Conceptnet is very straightforward. The good thing is that it feels natural, it's like how we human think things. But I modified it, because I thought it was wrong on some details (for example, there was no distinction between instance and subclass, which seemed unjustifiable).

I'm studying Wikidata now.

The size of Conceptnet... seriously!

Zero

The size of Conceptnet... seriously!

infurl

Re: The size of Conceptnet... seriously!

Zero

Re: The size of Conceptnet... seriously!

infurl

Re: The size of Conceptnet... seriously!

Zero

Re: The size of Conceptnet... seriously!

Zero

Re: The size of Conceptnet... seriously!

infurl

Re: The size of Conceptnet... seriously!

Zero

Re: The size of Conceptnet... seriously!

WriterOfMinds

Re: The size of Conceptnet... seriously!

Don Patrick

Re: The size of Conceptnet... seriously!

Zero

Re: The size of Conceptnet... seriously!

infurl

Re: The size of Conceptnet... seriously!

Zero

Re: The size of Conceptnet... seriously!

Recent Topics

Recent News

Users Online

Articles