Page MenuHomePhabricator

Data corruption when loading RDF data into WDQS
Closed, ResolvedPublic

Description

Some data are not loaded correctly into WDQS. There are several tickets about it, specific example: wds:Q58696890-2376E531-DD35-4851-AE50-5D2F0770BCED has different data on:
'wdq5', 'wdq4', 'wdq23', 'wdq8', 'wdq24', 'wdq25'
than on:
wdq6', 'wdq21', 'wdq22', 'wdq3', 'wdq7', 'wdq26', 'wdq10'
and another different data on wdq9.

The difference is that the statement in the former is missing rank, type and reference.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Timestamps for data updates:

wdq10: 2018-11-20T05:49:19Z
wdq6: 2018-11-20T05:49:25Z
wdq26: 2018-11-20T05:49:31Z
wdq21: 2018-11-20T05:49:32Z
wdq22: 2018-11-20T05:49:32Z
wdq3: 2018-11-20T05:49:35Z
wdq7: 2018-11-20T05:49:40Z

wdq9: 2018-11-20T05:49:39Z

wdq8: 2018-11-20T05:49:35Z
wdq5: 2018-11-20T05:49:39Z
wdq25: 2018-11-20T05:49:39Z
wdq4: 2018-11-20T05:49:48Z
wdq24: 2018-11-20T05:49:53Z
wdq23: 2018-11-20T05:50:13Z

Looks like the group that has the good data was updated earlier than the one that has bad data, even though timestamps overlap a bit. But the groups appear to be distinct. So it's possible something happened in the meantime... Maybe we need to capture updates RDF to see what the servers are seeing.

Change 475241 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable dumping RDF data for debugging purposes

https://gerrit.wikimedia.org/r/475241

Change 475243 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable dumping RDF on test & internal

https://gerrit.wikimedia.org/r/475243

I also discover for some items the data is not the latest revision: e.g. for Q57529925 we have all servers except wdq5 on 795730255 but wdq5 on 795729753. This seems to be related to bursts of robotic edits on the same entry, which may suggest there's some kind of race condition.

Maybe the Wikidata replica Updater is reading from is lagging behind the updates and returns an old revision? Not sure if that is possible. If it's possible to read old data after we've got Kafka message with new data, that would be a lot of trouble for us.

timings for the last one:

wdq5: 2018-11-21T23:35:56Z
wdq4: 2018-11-21T23:35:04Z
wdq6: 2018-11-21T23:34:44Z
wdq21: 2018-11-21T23:34:55Z
wdq22: 2018-11-21T23:34:49Z
wdq23: 2018-11-21T23:35:07Z
wdq3: 2018-11-21T23:35:03Z
wdq7: 2018-11-21T23:34:51Z
wdq8: 2018-11-21T23:34:56Z
wdq24: 2018-11-21T23:34:47Z
wdq25: 2018-11-21T23:34:56Z
wdq26: 2018-11-21T23:34:56Z
wdq9: 2018-11-21T23:34:40Z
wdq10: 2018-11-21T23:34:51Z

So wdq5 is actually updating very late compared to other hosts, yet it gets stale data! Very strange.

Change 475241 merged by Gehel:
[operations/puppet@production] Enable dumping RDF data for debugging purposes

https://gerrit.wikimedia.org/r/475241

Change 475243 merged by Gehel:
[operations/puppet@production] Enable dumping RDF on test & internal

https://gerrit.wikimedia.org/r/475243

RDF dumps confirm that data is coming fine through

This also happens for single-update items, which haven't been touched for a while, so it's not some kind of update race. Weird thing is it happens on multiple servers in the same way. E.g. check on Q3601865 reveals:

['wdq21', 'wdq22', 'wdq23', 'wdq24', 'wdq25', 'wdq26'] - are fine
['wdq6', 'wdq7', 'wdq8', 'wdq9', 'wdq10'] - are missing rank statements for Q36018657-1D11B168-9B90-44EF-88EE-4B8B09EE08A8 - all for the same statement ID!
['wdq5', 'wdq4', 'wdq3'] - are missing rank statements and P2860 statement for the same statement.

This does not look completely random data corruption - something is going on. I'll try to dump SPARQL we're sending next to see what's up there.

Change 477410 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Stop RDF dumps

https://gerrit.wikimedia.org/r/477410

Change 477410 merged by Gehel:
[operations/puppet@production] Stop RDF dumps

https://gerrit.wikimedia.org/r/477410

Change 477429 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable SPARQL logging to a separate file

https://gerrit.wikimedia.org/r/477429

Change 477429 merged by Gehel:
[operations/puppet@production] Enable SPARQL logging to a separate file

https://gerrit.wikimedia.org/r/477429

Mentioned in SAL (#wikimedia-operations) [2018-12-04T19:24:53Z] <smalyshev@deploy1001> Started deploy [wdqs/wdqs@81dac18]: Install new Updater for T210044 investigation

Mentioned in SAL (#wikimedia-operations) [2018-12-04T19:35:30Z] <smalyshev@deploy1001> Finished deploy [wdqs/wdqs@81dac18]: Install new Updater for T210044 investigation (duration: 10m 36s)

SPARQL dumps show that data is present in SPARQL but not in the database. Filed https://github.com/blazegraph/database/issues/109 with upstream and will dig into it further to see what we can find out there.

May clearing of deleted items be related? https://www.wikidata.org/wiki/Lexeme:L31707 has been deleted since 14 November 2018, but is still in WDQS:

https://query.wikidata.org/#DESCRIBE%20wd%3AL31707

Hey guys! Any news about this problem?

@Floatingpurr I have fixed some problematic data but still didn't find the root cause yet. Do you have some specific issues you think are related to this?

Hey @Smalyshev, no other specific issues here other than T207675. I was just curious about the cause of problem : )
Thank you!

Maybe the Wikidata replica Updater is reading from is lagging behind the updates and returns an old revision? Not sure if that is possible. If it's possible to read old data after we've got Kafka message with new data, that would be a lot of trouble for us.

That seems totally possible to me, it's a shame the maxlag can not be requested by the updater while using Special:EntityData.
Most of the time the replicas are not lagged, but if your requesting the entity data sub 1 second hitting old data could be possible.

I went and looked at the requests of the updater.

"uri_path":"/wiki/Special:EntityData/Q60305784.ttl","uri_query":"?nocache=1550568436299&flavor=dump"

If I'm correct nocache here is just the current ts of the request added to avoid cache hits?
Although this will avoid hitting the varnish cache, you could still be getting old data in some situations.

It might make sense to remove this and instead pass in either the oldid or revision param (if this data is available in kafka).
You'll then get a 404 if the revision is not yet retrievable and should probably wait a few seconds and try again, the correct amount of time to wait could probably be determined by maxlag, but maxlag doesn't currently work with Special:EntityData at all.

It might make sense to remove this and instead pass in either the oldid or revision param

No, this has horrible performance impact if several edits happen in a row, since it only fetches the oldid, not the latest one, so instead of one update with the latest ID you get 10 updates with each of the intermediate IDs.

In any case, stale reads are discussed in T212550.

No, this has horrible performance impact if several edits happen in a row, since it only fetches the oldid, not the latest one, so instead of one update with the latest ID you get 10 updates with each of the intermediate IDs.

I think we have had this exact conversation on another ticket before

In any case, stale reads are discussed in T212550.

I was going to comment on T212550. but it didn't quite fit so will comment here

It almost sounds like rather than having a live stream of edit events, or at least acting entirely on a live stream of edit events, the updater should instead do internal batching (waiting for a few seconds before trying to get the updated data to see if there are subsequent writes) or it should be listening to an event stream that already does this (events batches in mediawiki).
Such a batched event stream probably only makes sense for wikibase edits? it could look something like (entity:Q123,latestrev:123,changecount:5) for example.

I could be making assumptions about the finomation the updater needs here.
I'm going to stop thinking on this topic now (unless poked)

It almost sounds like rather than having a live stream of edit events, or at least acting entirely on a live stream of edit events, the updater should instead do internal batching

I thought about it, the problem here is that the starting point can be anything, so batching is a bit problematic. That said, the above is not exactly true, we have dump as natural starting point, and then we have starting points based off where the previous update finished. So if we create an aggregated stream which can be synchronized with dumps (e.g. make dump issue an event to the same stream and make aggregator never cross dump events) then it could work. Note however the batched stream does not really solve the problem of stale updates - even with batches update can still read old data. Right now we detect this situation and re-queue the update, but ideally we'd use ChronologyProtector to fetch the proper data.

No recent instances of this, so I think it's fixed now.