Once ready locally with unit-tests, apply the triple-analysis method to bigger data in spark (a day).
Update 19/5/2021
Ran the code on a day (10 May 2021) of data and did some basic analysis. The detailed analysis (along with the code) is in the pdf of the Notebook.
~92% of the daily SPARQL queries are processed successfully. Those that aren't parsed contain some additional prefixes like mwapi. All queries that failed parsing were not checked.
Example Extracted Triples
The Query:
SELECT * WHERE { ?s wdt:P31/wdt:P279 <_:bn>; skos:altLabel "alias"@en. }
The extracted triples:
[ TripleInfo( NodeInfo(NODE_VAR,s), //subject NodeInfo(PATH,<http://www.wikidata.org/prop/direct/P31>/<http://www.wikidata.org/prop/direct/P279>), //predicate NodeInfo(NODE_BLANK,bn) //object ), TripleInfo( NodeInfo(NODE_VAR,s), //subject NodeInfo(NODE_URI,skos:altLabel), //predicate NodeInfo(NODE_LITERAL,alias@en) //object ) ]
Distribution of Node Types for subject, object, and predicate nodes.
subjectNodeType | count |
NODE_URI | 16104919 |
NODE_VAR | 14032060 |
NODE_LITERAL | 30 |
predicateNodeType | count |
NODE_URI | 27011280 |
NODE_VAR | 1953145 |
PATH | 1172584 |
objectNodeType | count |
NODE_VAR | 13886031 |
NODE_LITERAL | 11110623 |
NODE_URI | 5140353 |
NODE_BLANK | 2 |
Distribution of the values these nodes contain - combined and per node category (sub/pred/obj) - is in the notebook.
Triples Distribution
Based on Node types
triple_string | count |
NODE_URI NODE_URI NODE_LITERAL | 8268924 |
NODE_VAR NODE_URI NODE_VAR | 7443036 |
NODE_URI NODE_URI NODE_VAR | 4997654 |
NODE_URI NODE_URI NODE_URI | 2508824 |
NODE_VAR NODE_URI NODE_LITERAL | 2113018 |
NODE_VAR NODE_URI NODE_URI | 1679795 |
NODE_VAR NODE_VAR NODE_VAR | 943751 |
NODE_VAR PATH NODE_URI | 862532 |
NODE_VAR NODE_VAR NODE_LITERAL | 721584 |
NODE_VAR PATH NODE_VAR | 216456 |
NODE_URI NODE_VAR NODE_VAR | 204853 |
NODE_URI PATH NODE_VAR | 80251 |
NODE_VAR NODE_VAR NODE_URI | 44789 |
NODE_URI NODE_VAR NODE_URI | 38165 |
NODE_VAR PATH NODE_LITERAL | 7097 |
NODE_URI PATH NODE_URI | 6248 |
NODE_LITERAL NODE_URI NODE_VAR | 29 |
NODE_VAR NODE_VAR NODE_BLANK | 2 |
NODE_LITERAL NODE_VAR NODE_VAR | 1 |
Top 50 triples based on values, but keeping variables and blank nodes obfuscated. Because we can name variables anything, the triple still remains the same.
triple_string | count |
bd:serviceParam wikibase:language en | 3717731 |
NODE_VAR rdfs:label NODE_VAR | 1390180 |
NODE_VAR wdt:P279 NODE_VAR | 1245462 |
gas:program gas:out1 NODE_VAR | 1242919 |
gas:program gas:out NODE_VAR | 1242919 |
gas:program gas:traversalDirection Forward | 1242594 |
gas:program gas:gasClass com.bigdata.rdf.graph.analytics.SSSP | 1242387 |
gas:program gas:linkType wdt:P279 | 1242332 |
gas:program gas:maxIterations 3^^http://www.w3.org/2001/XMLSchema#integer | 1242307 |
NODE_VAR NODE_VAR NODE_VAR | 943751 |
NODE_VAR http://www.wikidata.org/prop/direct/P31/(http://www.wikidata.org/prop/direct/P279)* wd:Q16521 | 677313 |
NODE_VAR schema:about NODE_VAR | 584918 |
bd:serviceParam wikibase:language [AUTO_LANGUAGE],en | 555352 |
NODE_VAR schema:isPartOf https://en.wikipedia.org/ | 312901 |
NODE_VAR wdt:P569 NODE_VAR | 289123 |
NODE_VAR wdt:P570 NODE_VAR | 283221 |
NODE_VAR wdt:P1630 NODE_VAR | 251418 |
NODE_VAR wikibase:propertyType NODE_VAR | 248225 |
NODE_VAR schema:name NODE_VAR | 210927 |
NODE_VAR wdt:P31 NODE_VAR | 207363 |
NODE_VAR wdt:P18 NODE_VAR | 150968 |
NODE_VAR pq:P6552 NODE_VAR | 136415 |
NODE_VAR p:P2002 NODE_VAR | 136376 |
NODE_VAR rdf:type wikibase:Property | 120507 |
NODE_VAR wikibase:claim NODE_VAR | 82803 |
NODE_VAR wdt:P856 NODE_VAR | 79010 |
NODE_VAR wikibase:statementProperty NODE_VAR | 78692 |
hint:Query hint:optimizer None | 68602 |
NODE_VAR schema:inLanguage en | 65903 |
NODE_VAR skos:altLabel NODE_VAR | 64923 |
NODE_VAR wdt:P577 NODE_VAR | 61687 |
NODE_VAR pq:P1545 NODE_VAR | 55542 |
NODE_VAR schema:isPartOf https://sv.wikipedia.org/ | 55440 |
NODE_VAR wdt:P282 wd:Q8229 | 50172 |
http://www.wikidata.org schema:dateModified NODE_VAR | 49241 |
NODE_VAR wdt:P21 NODE_VAR | 48106 |
NODE_VAR wdt:P50 NODE_VAR | 46583 |
NODE_VAR schema:description NODE_VAR | 46269 |
NODE_VAR wikibase:propertyType wikibase:ExternalId | 44854 |
NODE_VAR wdt:P31 wd:Q5 | 43986 |
NODE_VAR p:P179 NODE_VAR | 42521 |
NODE_VAR wdt:P300 NODE_VAR | 42304 |
bd:serviceParam wikibase:language fr,en,it,sp,de | 41919 |
NODE_VAR wdt:P227 NODE_VAR | 39279 |
NODE_VAR wdt:P136 NODE_VAR | 39038 |
NODE_VAR wdt:P27 NODE_VAR | 38119 |
NODE_VAR ps:P179 NODE_VAR | 36276 |
NODE_VAR wdt:P19 NODE_VAR | 33985 |
NODE_VAR wdt:P1843 NODE_VAR | 33459 |
NODE_VAR wdt:P106 NODE_VAR | 32173 |