Page MenuHomePhabricator

Investigate time-out issues with Wikidata harvests
Open, Needs TriagePublic

Description

When running the harvester in docker for the nl-wd_(nl) dataset (in the Wikidata branch) I get
HTTPSConnectionPool(host='query.wikidata.org', port=443): Read timed out. (read timeout=30)
Repeating the run makes it work so I'm guessing the sparql query finishes correctly but the harvester is simply not patient enough.

Event Timeline

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptAug 7 2017, 2:14 PM

So a quick and dirty solution might be to catch the timeout. Sleep for a short time then re-run. That way the query will finish on the sparql end and we get the result.

The sleep would have to be short enough that the sparql endpoint doesn't decide to invalidate the data.

Change 371602 had a related patch set uploaded (by Lokal Profil; owner: Lokal Profil):
[labs/tools/heritage@wikidata] Handle sparql http timeout issue

https://gerrit.wikimedia.org/r/371602

Change 371602 merged by jenkins-bot:
[labs/tools/heritage@wikidata] Handle sparql http timeout issue

https://gerrit.wikimedia.org/r/371602

Lokal_Profil closed this task as Resolved.Aug 17 2017, 8:48 PM
Lokal_Profil claimed this task.

Additionally an upstream fix in pywikibot has been made in
https://gerrit.wikimedia.org/r/371697

JeanFred reopened this task as Open.Aug 26 2019, 5:47 PM

I am not sure the timeout catch actually works: testing the nl_nl, I got the timeout on the Query service UI, but on pywikibot side I just had an empty result.

Could something have changed with the query service timeout reporting?

Change 532414 had a related patch set uploaded (by Jean-Frédéric; owner: Jean-Frédéric):
[labs/tools/heritage@master] Improve SPARQL efficiency for nl_nl

https://gerrit.wikimedia.org/r/532414

Change 532414 merged by jenkins-bot:
[labs/tools/heritage@master] Improve SPARQL efficiency for nl_nl and au_en

https://gerrit.wikimedia.org/r/532414

Mentioned in SAL (#wikimedia-cloud) [2019-08-29T20:49:21Z] <JeanFred> Deploy latest from Git master: c518c02 (T172690)