While looking at wikimedia resources, I came across ChatScript triples. So, I wanted to check my understanding.
Reference: https://en.wikipedia.org/wiki/ChatScript
"ChatScript supports facts – triples of data, which can also be transient or permanent."
table: ~inventors(^who ^what)
createfact(^who invent ^what)
DATA:
"Johannes Gutenberg" "printing press"
"Albert Einstein" ["Theory of Relativity" photon "Theory of General Relativity"]
"The above table links people to what they invented (1 per line) with Einstein getting a list of things he did."
Ported to PHP for discussion purposes of triples:
function createfact($who, $invent, $what)
{
foreach($who as $index=>$record){
$buf = $who[$index][0]." ".$invent." ";
foreach($what[$index] as $id=>$that){
if($id) $buf .= " and ";
$buf .= $that;
}
$fact[]=$buf;
}
return $fact;
}
/* table */
$inventor=array(
array("Bruce Wilcox"),
array("Albert Einstein"),
);
$invention=array(
array("ChatScript"),
array("Theory of Relativity", "Theory of General Relativity"),
);
$facts = createfact($inventor,"invented", $invention);
print_r($facts)."\n";
For the C coders reading, PHP is like an easier version of C Language, if you can imagine that. One difference is the PHP starts its variable names with a dollar sign symbol. In most cases, it does not take much to port PHP to C Language, or vice versa. So, my question is, based on what I coded here in PHP, how can I improve my understanding of triples? Feel free to answer me in C Language. I love C programming.
PHP Program Output
array (
0 => 'Bruce Wilcox invented ChatScipt',
1 => 'Albert Einstein invented Theory of Relativity and Theory of General Relativity',
)
Manually extracting WikiData into a relational database took about a week and resulted in an 800GB database. Even with 4TB of SSD that doesn't seem all that practical. Luckily there is a lot of scope for further optimisation and that's what I'll be exploring next. I wrote a simple function to extract all the facts from a JSON blob.
CREATE FUNCTION dScanJson
(
aObject JSONB,
aPath TEXT[] DEFAULT ARRAY[]::TEXT[]
)
RETURNS SETOF TEXT[]
LANGUAGE sql STABLE AS
$$
WITH
arrays(value,item) AS
(SELECT * FROM JSONB_ARRAY_ELEMENTS(aObject) WITH ORDINALITY WHERE JSONB_TYPEOF(aObject) = 'array'),
objects(key,value) AS
(SELECT * FROM JSONB_EACH(aObject) WHERE JSONB_TYPEOF(aObject) = 'object')
SELECT aPath||ARRAY[BTRIM(aObject::TEXT,'"')] WHERE JSONB_TYPEOF(aObject) = 'string' UNION
SELECT dScanJson(value,aPath||ARRAY[item::TEXT]) FROM arrays UNION
SELECT dScanJson(value,aPath||ARRAY[key]) FROM objects
$$;
This produces a set of tuples specifying the path and value of every fact in a JSON object. The first record alone yields over 1500 such facts. Here's a sample.
{claims,P998,1,id,Q26$2530265C-B49F-4A4A-9F23-C1BFE27C27C7}
{claims,P998,1,mainsnak,datatype,external-id}
{claims,P998,1,mainsnak,datavalue,type,string}
{claims,P998,1,mainsnak,datavalue,value,Regional/Europe/United_Kingdom/Northern_Ireland/}
{claims,P998,1,mainsnak,property,P998}
{claims,P998,1,mainsnak,snaktype,value}
{claims,P998,1,rank,normal}
{claims,P998,1,type,statement}
As WikiData contains about 45 million JSON blobs, I would estimate the end result to contain as many as 100 billion facts or triples. Should be easy enough to handle with some more C. I'd hate to have to upgrade my server with another 100GB of RAM just to deal with this problem. :D