Data corruption when loading RDF data into WDQS
Open, HighPublic

Description

Some data are not loaded correctly into WDQS. There are several tickets about it, specific example: wds:Q58696890-2376E531-DD35-4851-AE50-5D2F0770BCED has different data on:
'wdq5', 'wdq4', 'wdq23', 'wdq8', 'wdq24', 'wdq25'
than on:
wdq6', 'wdq21', 'wdq22', 'wdq3', 'wdq7', 'wdq26', 'wdq10'
and another different data on wdq9.

The difference is that the statement in the former is missing rank, type and reference.

Restricted Application added a project: Wikidata. · View Herald TranscriptWed, Nov 21, 6:05 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Timestamps for data updates:

wdq10: 2018-11-20T05:49:19Z
wdq6: 2018-11-20T05:49:25Z
wdq26: 2018-11-20T05:49:31Z
wdq21: 2018-11-20T05:49:32Z
wdq22: 2018-11-20T05:49:32Z
wdq3: 2018-11-20T05:49:35Z
wdq7: 2018-11-20T05:49:40Z

wdq9: 2018-11-20T05:49:39Z

wdq8: 2018-11-20T05:49:35Z
wdq5: 2018-11-20T05:49:39Z
wdq25: 2018-11-20T05:49:39Z
wdq4: 2018-11-20T05:49:48Z
wdq24: 2018-11-20T05:49:53Z
wdq23: 2018-11-20T05:50:13Z

Looks like the group that has the good data was updated earlier than the one that has bad data, even though timestamps overlap a bit. But the groups appear to be distinct. So it's possible something happened in the meantime... Maybe we need to capture updates RDF to see what the servers are seeing.

Smalyshev triaged this task as High priority.
Smalyshev added a project: User-Smalyshev.
Smalyshev claimed this task.
Smalyshev moved this task from Backlog to Doing on the User-Smalyshev board.
Addshore moved this task from incoming to monitoring on the Wikidata board.Wed, Nov 21, 8:20 AM

Change 475241 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable dumping RDF data for debugging purposes

https://gerrit.wikimedia.org/r/475241

Change 475243 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable dumping RDF on test & internal

https://gerrit.wikimedia.org/r/475243

I also discover for some items the data is not the latest revision: e.g. for Q57529925 we have all servers except wdq5 on 795730255 but wdq5 on 795729753. This seems to be related to bursts of robotic edits on the same entry, which may suggest there's some kind of race condition.

Maybe the Wikidata replica Updater is reading from is lagging behind the updates and returns an old revision? Not sure if that is possible. If it's possible to read old data after we've got Kafka message with new data, that would be a lot of trouble for us.

timings for the last one:

wdq5: 2018-11-21T23:35:56Z
wdq4: 2018-11-21T23:35:04Z
wdq6: 2018-11-21T23:34:44Z
wdq21: 2018-11-21T23:34:55Z
wdq22: 2018-11-21T23:34:49Z
wdq23: 2018-11-21T23:35:07Z
wdq3: 2018-11-21T23:35:03Z
wdq7: 2018-11-21T23:34:51Z
wdq8: 2018-11-21T23:34:56Z
wdq24: 2018-11-21T23:34:47Z
wdq25: 2018-11-21T23:34:56Z
wdq26: 2018-11-21T23:34:56Z
wdq9: 2018-11-21T23:34:40Z
wdq10: 2018-11-21T23:34:51Z

So wdq5 is actually updating very late compared to other hosts, yet it gets stale data! Very strange.

Change 475241 merged by Gehel:
[operations/puppet@production] Enable dumping RDF data for debugging purposes

https://gerrit.wikimedia.org/r/475241

Change 475243 merged by Gehel:
[operations/puppet@production] Enable dumping RDF on test & internal

https://gerrit.wikimedia.org/r/475243

Could it also be the cause of T210495 ?

abian added a subscriber: abian.Sun, Dec 2, 11:55 AM

RDF dumps confirm that data is coming fine through

This also happens for single-update items, which haven't been touched for a while, so it's not some kind of update race. Weird thing is it happens on multiple servers in the same way. E.g. check on Q3601865 reveals:

['wdq21', 'wdq22', 'wdq23', 'wdq24', 'wdq25', 'wdq26'] - are fine
['wdq6', 'wdq7', 'wdq8', 'wdq9', 'wdq10'] - are missing rank statements for Q36018657-1D11B168-9B90-44EF-88EE-4B8B09EE08A8 - all for the same statement ID!
['wdq5', 'wdq4', 'wdq3'] - are missing rank statements and P2860 statement for the same statement.

This does not look completely random data corruption - something is going on. I'll try to dump SPARQL we're sending next to see what's up there.

Change 477410 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Stop RDF dumps

https://gerrit.wikimedia.org/r/477410

Change 477410 merged by Gehel:
[operations/puppet@production] Stop RDF dumps

https://gerrit.wikimedia.org/r/477410

Change 477429 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable SPARQL logging to a separate file

https://gerrit.wikimedia.org/r/477429

Change 477429 merged by Gehel:
[operations/puppet@production] Enable SPARQL logging to a separate file

https://gerrit.wikimedia.org/r/477429

Mentioned in SAL (#wikimedia-operations) [2018-12-04T19:24:53Z] <smalyshev@deploy1001> Started deploy [wdqs/wdqs@81dac18]: Install new Updater for T210044 investigation

Mentioned in SAL (#wikimedia-operations) [2018-12-04T19:35:30Z] <smalyshev@deploy1001> Finished deploy [wdqs/wdqs@81dac18]: Install new Updater for T210044 investigation (duration: 10m 36s)

SPARQL dumps show that data is present in SPARQL but not in the database. Filed https://github.com/blazegraph/database/issues/109 with upstream and will dig into it further to see what we can find out there.

The query is captured in F27383365.

May clearing of deleted items be related? https://www.wikidata.org/wiki/Lexeme:L31707 has been deleted since 14 November 2018, but is still in WDQS:

https://query.wikidata.org/#DESCRIBE%20wd%3AL31707

@Fnielsen deletion issues were because of T210451

KaMan added a subscriber: KaMan.Sun, Dec 9, 1:13 PM