Page MenuHomePhabricator

Test triple-analysis functions over a large dataset with Spark
Closed, ResolvedPublic

Description

Once ready locally with unit-tests, apply the triple-analysis method to bigger data in spark (a day).

Update 19/5/2021

Ran the code on a day (10 May 2021) of data and did some basic analysis. The detailed analysis (along with the code) is in the pdf of the Notebook.

~92% of the daily SPARQL queries are processed successfully. Those that aren't parsed contain some additional prefixes like mwapi. All queries that failed parsing were not checked.

Example Extracted Triples

The Query:

SELECT * WHERE
    {
      ?s wdt:P31/wdt:P279 <_:bn>;
         skos:altLabel "alias"@en.
    }

The extracted triples:

[

TripleInfo(
    NodeInfo(NODE_VAR,s), //subject
    NodeInfo(PATH,<http://www.wikidata.org/prop/direct/P31>/<http://www.wikidata.org/prop/direct/P279>), //predicate
    NodeInfo(NODE_BLANK,bn) //object
), 

TripleInfo(
    NodeInfo(NODE_VAR,s), //subject
    NodeInfo(NODE_URI,skos:altLabel), //predicate
    NodeInfo(NODE_LITERAL,alias@en) //object
)

]
Distribution of Node Types for subject, object, and predicate nodes.
subjectNodeTypecount
NODE_URI16104919
NODE_VAR14032060
NODE_LITERAL30
predicateNodeTypecount
NODE_URI27011280
NODE_VAR1953145
PATH1172584
objectNodeTypecount
NODE_VAR13886031
NODE_LITERAL11110623
NODE_URI5140353
NODE_BLANK2

Distribution of the values these nodes contain - combined and per node category (sub/pred/obj) - is in the notebook.

Triples Distribution

Based on Node types

triple_stringcount
NODE_URI NODE_URI NODE_LITERAL8268924
NODE_VAR NODE_URI NODE_VAR7443036
NODE_URI NODE_URI NODE_VAR4997654
NODE_URI NODE_URI NODE_URI2508824
NODE_VAR NODE_URI NODE_LITERAL2113018
NODE_VAR NODE_URI NODE_URI1679795
NODE_VAR NODE_VAR NODE_VAR943751
NODE_VAR PATH NODE_URI862532
NODE_VAR NODE_VAR NODE_LITERAL721584
NODE_VAR PATH NODE_VAR216456
NODE_URI NODE_VAR NODE_VAR204853
NODE_URI PATH NODE_VAR80251
NODE_VAR NODE_VAR NODE_URI44789
NODE_URI NODE_VAR NODE_URI38165
NODE_VAR PATH NODE_LITERAL7097
NODE_URI PATH NODE_URI6248
NODE_LITERAL NODE_URI NODE_VAR29
NODE_VAR NODE_VAR NODE_BLANK2
NODE_LITERAL NODE_VAR NODE_VAR1

Top 50 triples based on values, but keeping variables and blank nodes obfuscated. Because we can name variables anything, the triple still remains the same.

triple_stringcount
bd:serviceParam wikibase:language en3717731
NODE_VAR rdfs:label NODE_VAR1390180
NODE_VAR wdt:P279 NODE_VAR1245462
gas:program gas:out1 NODE_VAR1242919
gas:program gas:out NODE_VAR1242919
gas:program gas:traversalDirection Forward1242594
gas:program gas:gasClass com.bigdata.rdf.graph.analytics.SSSP1242387
gas:program gas:linkType wdt:P2791242332
gas:program gas:maxIterations 3^^http://www.w3.org/2001/XMLSchema#integer1242307
NODE_VAR NODE_VAR NODE_VAR943751
NODE_VAR http://www.wikidata.org/prop/direct/P31/(http://www.wikidata.org/prop/direct/P279)* wd:Q16521677313
NODE_VAR schema:about NODE_VAR584918
bd:serviceParam wikibase:language [AUTO_LANGUAGE],en555352
NODE_VAR schema:isPartOf https://en.wikipedia.org/312901
NODE_VAR wdt:P569 NODE_VAR289123
NODE_VAR wdt:P570 NODE_VAR283221
NODE_VAR wdt:P1630 NODE_VAR251418
NODE_VAR wikibase:propertyType NODE_VAR248225
NODE_VAR schema:name NODE_VAR210927
NODE_VAR wdt:P31 NODE_VAR207363
NODE_VAR wdt:P18 NODE_VAR150968
NODE_VAR pq:P6552 NODE_VAR136415
NODE_VAR p:P2002 NODE_VAR136376
NODE_VAR rdf:type wikibase:Property120507
NODE_VAR wikibase:claim NODE_VAR82803
NODE_VAR wdt:P856 NODE_VAR79010
NODE_VAR wikibase:statementProperty NODE_VAR78692
hint:Query hint:optimizer None68602
NODE_VAR schema:inLanguage en65903
NODE_VAR skos:altLabel NODE_VAR64923
NODE_VAR wdt:P577 NODE_VAR61687
NODE_VAR pq:P1545 NODE_VAR55542
NODE_VAR schema:isPartOf https://sv.wikipedia.org/55440
NODE_VAR wdt:P282 wd:Q822950172
http://www.wikidata.org schema:dateModified NODE_VAR49241
NODE_VAR wdt:P21 NODE_VAR48106
NODE_VAR wdt:P50 NODE_VAR46583
NODE_VAR schema:description NODE_VAR46269
NODE_VAR wikibase:propertyType wikibase:ExternalId44854
NODE_VAR wdt:P31 wd:Q543986
NODE_VAR p:P179 NODE_VAR42521
NODE_VAR wdt:P300 NODE_VAR42304
bd:serviceParam wikibase:language fr,en,it,sp,de41919
NODE_VAR wdt:P227 NODE_VAR39279
NODE_VAR wdt:P136 NODE_VAR39038
NODE_VAR wdt:P27 NODE_VAR38119
NODE_VAR ps:P179 NODE_VAR36276
NODE_VAR wdt:P19 NODE_VAR33985
NODE_VAR wdt:P1843 NODE_VAR33459
NODE_VAR wdt:P106 NODE_VAR32173