Page MenuHomePhabricator

Investigate time-out issues with Wikidata harvests
Closed, ResolvedPublic


When running the harvester in docker for the nl-wd_(nl) dataset (in the Wikidata branch) I get
HTTPSConnectionPool(host='', port=443): Read timed out. (read timeout=30)
Repeating the run makes it work so I'm guessing the sparql query finishes correctly but the harvester is simply not patient enough.

Event Timeline

So a quick and dirty solution might be to catch the timeout. Sleep for a short time then re-run. That way the query will finish on the sparql end and we get the result.

The sleep would have to be short enough that the sparql endpoint doesn't decide to invalidate the data.

Change 371602 had a related patch set uploaded (by Lokal Profil; owner: Lokal Profil):
[labs/tools/heritage@wikidata] Handle sparql http timeout issue

Change 371602 merged by jenkins-bot:
[labs/tools/heritage@wikidata] Handle sparql http timeout issue

Lokal_Profil claimed this task.

Additionally an upstream fix in pywikibot has been made in

I am not sure the timeout catch actually works: testing the nl_nl, I got the timeout on the Query service UI, but on pywikibot side I just had an empty result.

Could something have changed with the query service timeout reporting?

Change 532414 had a related patch set uploaded (by Jean-Frédéric; owner: Jean-Frédéric):
[labs/tools/heritage@master] Improve SPARQL efficiency for nl_nl

Change 532414 merged by jenkins-bot:
[labs/tools/heritage@master] Improve SPARQL efficiency for nl_nl and au_en

Mentioned in SAL (#wikimedia-cloud) [2019-08-29T20:49:21Z] <JeanFred> Deploy latest from Git master: c518c02 (T172690)

@Lokal_Profil / @JeanFred: All related patches in Gerrit have been merged, so I'm boldly resolving this task.
(Please reopen if there is more to do in this task - thanks!)