Once ready locally with unit-tests, apply the triple-analysis method to bigger data in spark (a day).
#### Update 19/5/2021
Ran the code on a day (10 May 2021) of data and did some basic analysis. The detailed analysis (along with the code) is in the pdf of the Notebook.
{F34459867}
~92% of the daily SPARQL queries are processed successfully. Those that aren't parsed contain some additional prefixes like `mwapi`. All queries that failed parsing were not checked.
##### Example Extracted Triples
The Query:
```
SELECT * WHERE
{
?s wdt:P31/wdt:P279 <_:bn>;
skos:altLabel "alias"@en.
}
```
The extracted triples:
```
[
TripleInfo(
NodeInfo(NODE_VAR,s), //subject
NodeInfo(PATH,<http://www.wikidata.org/prop/direct/P31>/<http://www.wikidata.org/prop/direct/P279>), //predicate
NodeInfo(NODE_BLANK,bn) //object
),
TripleInfo(
NodeInfo(NODE_VAR,s), //subject
NodeInfo(NODE_URI,skos:altLabel), //predicate
NodeInfo(NODE_LITERAL,alias@en) //object
)
]
```
##### Distribution of Node Types for subject, object, and predicate nodes.
|subjectNodeType| count|
| NODE_URI|16104919|
| NODE_VAR|14032060|
| NODE_LITERAL| 30|
|predicateNodeType| count|
| NODE_URI|27011280|
| NODE_VAR| 1953145|
| PATH| 1172584|
|objectNodeType| count|
| NODE_VAR|13886031|
| NODE_LITERAL|11110623|
| NODE_URI| 5140353|
| NODE_BLANK| 2|
Distribution of the values these nodes contain - combined and per node category (sub/pred/obj) - is in the notebook.
##### Triples Distribution
Based on Node types
|triple_string |count |
|NODE_URI NODE_URI NODE_LITERAL|8268924|
|NODE_VAR NODE_URI NODE_VAR |7443036|
|NODE_URI NODE_URI NODE_VAR |4997654|
|NODE_URI NODE_URI NODE_URI |2508824|
|NODE_VAR NODE_URI NODE_LITERAL|2113018|
|NODE_VAR NODE_URI NODE_URI |1679795|
|NODE_VAR NODE_VAR NODE_VAR |943751 |
|NODE_VAR PATH NODE_URI |862532 |
|NODE_VAR NODE_VAR NODE_LITERAL|721584 |
|NODE_VAR PATH NODE_VAR |216456 |
|NODE_URI NODE_VAR NODE_VAR |204853 |
|NODE_URI PATH NODE_VAR |80251 |
|NODE_VAR NODE_VAR NODE_URI |44789 |
|NODE_URI NODE_VAR NODE_URI |38165 |
|NODE_VAR PATH NODE_LITERAL |7097 |
|NODE_URI PATH NODE_URI |6248 |
|NODE_LITERAL NODE_URI NODE_VAR|29 |
|NODE_VAR NODE_VAR NODE_BLANK |2 |
|NODE_LITERAL NODE_VAR NODE_VAR|1 |
Top 50 triples based on values, but keeping variables and blank nodes obfuscated. Because we can name variables anything, the triple still remains the same.
|triple_string |count |
|bd:serviceParam wikibase:language en |3717731|
|NODE_VAR rdfs:label NODE_VAR |1390180|
|NODE_VAR wdt:P279 NODE_VAR |1245462|
|gas:program gas:out1 NODE_VAR |1242919|
|gas:program gas:out NODE_VAR |1242919|
|gas:program gas:traversalDirection Forward |1242594|
|gas:program gas:gasClass com.bigdata.rdf.graph.analytics.SSSP |1242387|
|gas:program gas:linkType wdt:P279 |1242332|
|gas:program gas:maxIterations 3^^http://www.w3.org/2001/XMLSchema#integer |1242307|
|NODE_VAR NODE_VAR NODE_VAR |943751 |
|NODE_VAR <http://www.wikidata.org/prop/direct/P31>/(<http://www.wikidata.org/prop/direct/P279>)* wd:Q16521|677313 |
|NODE_VAR schema:about NODE_VAR |584918 |
|bd:serviceParam wikibase:language [AUTO_LANGUAGE],en |555352 |
|NODE_VAR schema:isPartOf https://en.wikipedia.org/ |312901 |
|NODE_VAR wdt:P569 NODE_VAR |289123 |
|NODE_VAR wdt:P570 NODE_VAR |283221 |
|NODE_VAR wdt:P1630 NODE_VAR |251418 |
|NODE_VAR wikibase:propertyType NODE_VAR |248225 |
|NODE_VAR schema:name NODE_VAR |210927 |
|NODE_VAR wdt:P31 NODE_VAR |207363 |
|NODE_VAR wdt:P18 NODE_VAR |150968 |
|NODE_VAR pq:P6552 NODE_VAR |136415 |
|NODE_VAR p:P2002 NODE_VAR |136376 |
|NODE_VAR rdf:type wikibase:Property |120507 |
|NODE_VAR wikibase:claim NODE_VAR |82803 |
|NODE_VAR wdt:P856 NODE_VAR |79010 |
|NODE_VAR wikibase:statementProperty NODE_VAR |78692 |
|hint:Query hint:optimizer None |68602 |
|NODE_VAR schema:inLanguage en |65903 |
|NODE_VAR skos:altLabel NODE_VAR |64923 |
|NODE_VAR wdt:P577 NODE_VAR |61687 |
|NODE_VAR pq:P1545 NODE_VAR |55542 |
|NODE_VAR schema:isPartOf https://sv.wikipedia.org/ |55440 |
|NODE_VAR wdt:P282 wd:Q8229 |50172 |
|http://www.wikidata.org schema:dateModified NODE_VAR |49241 |
|NODE_VAR wdt:P21 NODE_VAR |48106 |
|NODE_VAR wdt:P50 NODE_VAR |46583 |
|NODE_VAR schema:description NODE_VAR |46269 |
|NODE_VAR wikibase:propertyType wikibase:ExternalId |44854 |
|NODE_VAR wdt:P31 wd:Q5 |43986 |
|NODE_VAR p:P179 NODE_VAR |42521 |
|NODE_VAR wdt:P300 NODE_VAR |42304 |
|bd:serviceParam wikibase:language fr,en,it,sp,de |41919 |
|NODE_VAR wdt:P227 NODE_VAR |39279 |
|NODE_VAR wdt:P136 NODE_VAR |39038 |
|NODE_VAR wdt:P27 NODE_VAR |38119 |
|NODE_VAR ps:P179 NODE_VAR |36276 |
|NODE_VAR wdt:P19 NODE_VAR |33985 |
|NODE_VAR wdt:P1843 NODE_VAR |33459 |
|NODE_VAR wdt:P106 NODE_VAR |32173 |