Investigate triple counts difference between dumps and what blazegraph reports
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	dcausse
	Dec 5 2019, 11:34 AM

Description

Seen in munged dumps:

Nov 6 munged dump: 5 909 445 794
Nov 15 dump (lexeme): 21 591 655
Triples count as reported in grafana (Nov 5): 8 500 000 000

Event Timeline

dcausse created this task.Dec 5 2019, 11:34 AM

Restricted Application added a project: Wikidata. · View Herald TranscriptDec 5 2019, 11:34 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We should export the triples from a production journal to try to understand where are the differences. To do this we need to copy a journal and run some tools provided by blazegraph.

The tool is ExportKB to run it we need all the jars present in the war (the jar for the Updater is not sufficient).

extract the war somewhere:

mkdir /tmp/blazegraph-war && cd /tmp/blazegraph-war && jar xvf /srv/deployment/wdqs/wdqs/blazegraph-service-*-SNAPSHOT.war

Then move to the folder containing the wikidata.jnl file and run:

java -Dlogback.configurationFile=/tmp/blazegraph-war/WEB-INF/classes/logback.xml -cp '/tmp/blazegraph-war/WEB-INF/lib/*' -server com.bigdata.rdf.sail.ExportKB -outdir journal_export/ -format Turtle /srv/deployment/wdqs/wdqs/RWStore.properties wdq

We don't have to run this on a production machine, we just need the wdqs war and the RWStore.properties, the required space will probably be somewhere between 500Gb to 1Tb, ideally we'd like them to be in HDFS in the analytics network so we could use more compute to run basic aggregations to detect where are the differences.

Discussing with @elukey:

we have enough space on stat1004:/srv
we can transfer using a similar process as we do between wdqs nodes (netcat / pigz / ...)

Mentioned in SAL (#wikimedia-operations) [2019-12-06T13:31:19Z] <gehel> starting transfer of blazegraph journal from wdqs1007 to stat1004 - T239898

Chiming in: I suggest using Spark for investigations - Given the size of the dataset, parallel computation should help. This means another hop for the data: --> stat1004 --> HDFS. Please ping if you want/need help :)
edition:
--> I should have read the last line of the ticket before commenting ...

I recounted properly (using a rdf parser) the triple count from the dump after the munge operation and found 8.9B triples, closing as invalid.

Investigate triple counts difference between dumps and what blazegraph reportsClosed, InvalidPublicActions

Description

Event Timeline

Investigate triple counts difference between dumps and what blazegraph reports
Closed, InvalidPublic
Actions