Page MenuHomePhabricator

Performance problems with pywikibot's pagegenerator
Closed, ResolvedPublic

Description

Hi, i'm currently working on an open source university project (https://openartbrowser.org/ , https://github.com/hochschule-darmstadt/openartbrowser).

We query different datasets from wikidata about art. In our ETL process we get the data we want with help of the pywikibot libary. This libary loads the wikidata sites by their qId's which we query beforehand with SPARQL (pagegenerator.WikidataSPARQLPageGenerator). This whole process of extracting around 150.000 entries took us 47 hours last time measured at the ending of october.

We want to improve our crawler performance so that we can test new features faster.

Our implementation can be viewed here https://github.com/hochschule-darmstadt/openartbrowser/blob/staging/scripts/Wikidata%20crawler/ArtOntologyCrawler.py in the extract_artworks function. This function first queries all qIds of Paintings, Drawings and Sculptures. Next we iterate over the items returned by the page generator. When measuring the times i came accross that the item.get() takes from 0.5 to 3-4 seconds. I assume that this is a page load for all data on the page of an wikidata entity.

The only possibility i see at the moment to improve this data extraction is multi-threading because wikidata allows five queries in parallel (which equals five page loads). A direct SPARQL queries seems to be not possible because requests are very limited (see https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits).

Maybe there is another way of solving this performance issue.
Best regards.

Event Timeline

Have you tried to play around with "SETTINGS TO AVOID SERVER OVERLOAD" options in Pywikibot's user-config.py?

I played around with them, but i saw no change in performance. I set following options to following values because i thought they would affect the performance:
minthrottle = 0 # default value
maxthrottle = 1 # default 60
put_throttle = 0 # 10 (i only read pages so this was probably unnecessary)
maxlag = 1 # default 5
step = -1 # default value

For 100 pages it takes about 1 minute and 20 seconds. For 1000 it takes ~12 minutes. I have the same times when i tested it with the default config.

Have you considered to get data from dumps, if possible?

I thought about data dumps aswell. But i think it's a bit overkill to download 60 Gb to query about 100 Mb of data so i wanted to check first if there are any other possibilities regarding the SPARQL endpoint or pywikibot.

Perhaps try WikidataIntegrator instead of Pywikibot (if it offers the same functionality you need). I'm unsure what is its current state of development, so make sure you don't throw away your original code. A performance comparison of the two libraries from your project would be interesting.

Decreasing maxlag parameter may lead to decreasing performance because some queries are probably waiting for a lower server lag

Thanks for the answers.
I also tried the default value and higher values like 10, 20 or 60 for maxlag. But there were unfortunately no improvments with maxlag changed.
I'll take a look at WikidataIntegrator and compare it with pywikibot.

When measuring the times i came accross that the item.get() takes from 0.5 to 3-4 seconds.

Probably T226157: Data retrieval may be very long and heavy because of parsing Link during SiteLink initialization.

Makes sense, because the API call itself really takes milliseconds. In comparison with pywikibot.Page a text from some wiki article is taken roughly 10x faster than getting a WD item

Ok interesting, so i guess i have to wait until this issue is resolved or is there another functionality which i can use that won't load and initialize sitelinks? For a new feature we use the wikipedia sitelink but the performance problem occured already before we used sitelinks.

You may use preload_sites.py maintenance script in meantime until the underlying problem has a better solution. The script is introduced with release 6.0.0, currently master branch.

Change 660809 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [IMPR] Create a SiteLink with getitem method

https://gerrit.wikimedia.org/r/660809

Xqt claimed this task.
Xqt triaged this task as High priority.

Change 660809 merged by jenkins-bot:
[pywikibot/core@master] [IMPR] Create a SiteLink with getitem method

https://gerrit.wikimedia.org/r/660809