Page MenuHomePhabricator

API sometimes does not return consistent data (sha1 does not match content)
Open, Needs TriagePublic

Description

I do a call to the API like this one:

https://de.wikipedia.org/wiki/Spezial:ApiSandbox#action=query&format=json&prop=revisions&pageids=365531&formatversion=2&rvprop=timestamp|content|ids|sha1&rvslots=main

And I had two problems with it:
From the logfile of my script, times are UTC

* 2019-09-25 09:42:05 Process Page Id:365531 Rev Id:192573058 Timestamp:20190925093700 Title:Liste der konsularischen Vertretungen in Hamburg

This call returned the content of the previous revision along with the revision id and the timestamp of the newest revision. At least, thats what it looked like, when analyzing the data created by my script.

On that day, the replag was very high, I wrote in the chat the following line:
[09:40:35] <Wurgl> https://tools.wmflabs.org/replag/ <-- 5 hours lag on enwiki? 7 hours on wikidata, others have 2 and 3 hours? What's going on here?

So I changed my code and retrieve the sha1-checksum too, compute that checksum in my script and compare it. No problem for about a week, then suddly my logfile shows another similar case:

2019-10-03 11:18:30 Process Page Id:7645820 Rev Id:192815153 Timestamp:20191003111824 Title:MTV Eintracht Celle
2019-10-03 11:18:30 *** SHA1 does not match computed: da39a3ee5e6b4b0d3255bfef95601890afd80709 API: 628e76e06f3d101ee48121a03cfd10195f9dd784

So here the API returned in one single call a content and a sha1-checksum which did not match. Even worse, the revision table does not hold any of these two checksums, so something is odd here.

Something is mixed up here. The returned sha1-checksum shall always match the content and the content shall always be the one of the reported timestamp and revision id.

Event Timeline

Wurgl created this task.Oct 3 2019, 1:09 PM
Restricted Application added a project: Core Platform Team. · View Herald TranscriptOct 3 2019, 1:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Anomie added a subscriber: Anomie.

At first analysis this seems likely to be due to bugs in your code rather than the API, but I'll leave it open for the moment while asking for more information.

From the logfile of my script, times are UTC

* 2019-09-25 09:42:05 Process Page Id:365531 Rev Id:192573058 Timestamp:20190925093700 Title:Liste der konsularischen Vertretungen in Hamburg

This call returned the content of the previous revision along with the revision id and the timestamp of the newest revision. At least, thats what it looked like, when analyzing the data created by my script.

That seems unlikely, even when there was high replag. I can't think of a way where the code would be likely to return the wrong revision's content, rather than no content or an error about missing content. What's the evidence for this? Is the source code for your script publicly posted somewhere?

So I changed my code and retrieve the sha1-checksum too, compute that checksum in my script and compare it. No problem for about a week, then suddly my logfile shows another similar case:

2019-10-03 11:18:30 Process Page Id:7645820 Rev Id:192815153 Timestamp:20191003111824 Title:MTV Eintracht Celle
2019-10-03 11:18:30 *** SHA1 does not match computed: da39a3ee5e6b4b0d3255bfef95601890afd80709 API: 628e76e06f3d101ee48121a03cfd10195f9dd784

So here the API returned in one single call a content and a sha1-checksum which did not match. Even worse, the revision table does not hold any of these two checksums, so something is odd here.

The dewiki revision table for that revision contains 'big7540fscbb6h8ndw5aeqedosj5wn8', which is encoded as base-36. That corresponds to '628e76e06f3d101ee48121a03cfd10195f9dd784' in base-16, which is what you report the API returned. When I fetch the content of that revision now, I calculate the same sha1.

I note that the checksum you report calculating, da39a3ee5e6b4b0d3255bfef95601890afd80709, is the checksum of the empty string.

The returned sha1-checksum shall always match the content

Note that's generally true for text-based content like wikitext, but may not be true for other formats. That includes the revision-level checksum for revisions with multiple slots.

Wurgl added a comment.Oct 5 2019, 10:11 AM

okay, the empty data seems to be solvable, I got the following response from the API:
(showing just the relevant part)

{"pageid":9577707,"ns":0,"title":"Chambon (Charente-Maritime)","revisions":[{"revid":192865788,"parentid":192865756,"timestamp":"2019-10-05T07:14:27Z","sha1":"1e29e6f2c752a9e3dd172b2d2acfe1844eb58fa4","slotsmissing":true}]}

The full request and the full answer can be found in file /data/project/persondata/data/2019-10-05 07:14:29.txt

I will investigate more.

Anomie added a comment.Oct 7 2019, 5:42 PM

That particular example is due to T212428: includes/Revision/RevisionStore.php: Main slot of revision (number) not found in database!. Sometimes when a revision is first created MediaWiki is able to load the revision row but not the content row, even though DB transactions should ensure that both become visible at the same time. Unfortunately we've yet to figure out what is causing that.