Seen in munged dumps:
- Nov 6 munged dump: 5 909 445 794
- Nov 15 dump (lexeme): 21 591 655
- Triples count as reported in grafana (Nov 5): 8 500 000 000
Seen in munged dumps:
We should export the triples from a production journal to try to understand where are the differences. To do this we need to copy a journal and run some tools provided by blazegraph.
The tool is ExportKB to run it we need all the jars present in the war (the jar for the Updater is not sufficient).
mkdir /tmp/blazegraph-war && cd /tmp/blazegraph-war && jar xvf /srv/deployment/wdqs/wdqs/blazegraph-service-*-SNAPSHOT.war
Then move to the folder containing the wikidata.jnl file and run:
java -Dlogback.configurationFile=/tmp/blazegraph-war/WEB-INF/classes/logback.xml -cp '/tmp/blazegraph-war/WEB-INF/lib/*' -server com.bigdata.rdf.sail.ExportKB -outdir journal_export/ -format Turtle /srv/deployment/wdqs/wdqs/RWStore.properties wdq
We don't have to run this on a production machine, we just need the wdqs war and the RWStore.properties, the required space will probably be somewhere between 500Gb to 1Tb, ideally we'd like them to be in HDFS in the analytics network so we could use more compute to run basic aggregations to detect where are the differences.
Discussing with @elukey:
Mentioned in SAL (#wikimedia-operations) [2019-12-06T13:31:19Z] <gehel> starting transfer of blazegraph journal from wdqs1007 to stat1004 - T239898
Chiming in: I suggest using Spark for investigations - Given the size of the dataset, parallel computation should help. This means another hop for the data: --> stat1004 --> HDFS. Please ping if you want/need help :)
edition:
--> I should have read the last line of the ticket before commenting ...
I recounted properly (using a rdf parser) the triple count from the dump after the munge operation and found 8.9B triples, closing as invalid.