Page MenuHomePhabricator

Performance problems with pywikibot's pagegenerator
Open, Needs TriagePublic

Description

Hi, i'm currently working on an open source university project (https://openartbrowser.org/ , https://github.com/hochschule-darmstadt/openartbrowser).

We query different datasets from wikidata about art. In our ETL process we get the data we want with help of the pywikibot libary. This libary loads the wikidata sites by their qId's which we query beforehand with SPARQL (pagegenerator.WikidataSPARQLPageGenerator). This whole process of extracting around 150.000 entries took us 47 hours last time measured at the ending of october.

We want to improve our crawler performance so that we can test new features faster.

Our implementation can be viewed here https://github.com/hochschule-darmstadt/openartbrowser/blob/staging/scripts/Wikidata%20crawler/ArtOntologyCrawler.py in the extract_artworks function. This function first queries all qIds of Paintings, Drawings and Sculptures. Next we iterate over the items returned by the page generator. When measuring the times i came accross that the item.get() takes from 0.5 to 3-4 seconds. I assume that this is a page load for all data on the page of an wikidata entity.

The only possibility i see at the moment to improve this data extraction is multi-threading because wikidata allows five queries in parallel (which equals five page loads). A direct SPARQL queries seems to be not possible because requests are very limited (see https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits).

Maybe there is another way of solving this performance issue.
Best regards.

Event Timeline

Tilomi created this task.Nov 16 2019, 1:15 PM
Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptNov 16 2019, 1:15 PM

Have you tried to play around with "SETTINGS TO AVOID SERVER OVERLOAD" options in Pywikibot's user-config.py?

Tilomi added a comment.EditedNov 16 2019, 5:29 PM

I played around with them, but i saw no change in performance. I set following options to following values because i thought they would affect the performance:
minthrottle = 0 # default value
maxthrottle = 1 # default 60
put_throttle = 0 # 10 (i only read pages so this was probably unnecessary)
maxlag = 1 # default 5
step = -1 # default value

For 100 pages it takes about 1 minute and 20 seconds. For 1000 it takes ~12 minutes. I have the same times when i tested it with the default config.

Mpaa added a subscriber: Mpaa.Nov 16 2019, 5:34 PM

Have you considered to get data from dumps, if possible?

I thought about data dumps aswell. But i think it's a bit overkill to download 60 Gb to query about 100 Mb of data so i wanted to check first if there are any other possibilities regarding the SPARQL endpoint or pywikibot.

Dvorapa added a comment.EditedNov 16 2019, 7:27 PM

Perhaps try WikidataIntegrator instead of Pywikibot (if it offers the same functionality you need). I'm unsure what is its current state of development, so make sure you don't throw away your original code. A performance comparison of the two libraries from your project would be interesting.

Xqt added a subscriber: Xqt.Nov 16 2019, 8:12 PM

Decreasing maxlag parameter may lead to decreasing performance because some queries are probably waiting for a lower server lag

Thanks for the answers.
I also tried the default value and higher values like 10, 20 or 60 for maxlag. But there were unfortunately no improvments with maxlag changed.
I'll take a look at WikidataIntegrator and compare it with pywikibot.

When measuring the times i came accross that the item.get() takes from 0.5 to 3-4 seconds.

Probably T226157: Data retrieval may be very long and heavy because of SiteLink initialization.

Dvorapa added a comment.EditedNov 17 2019, 12:56 PM

When measuring the times i came accross that the item.get() takes from 0.5 to 3-4 seconds.

Probably T226157: Data retrieval may be very long and heavy because of SiteLink initialization.

Makes sense, because the API call itself really takes milliseconds. In comparison with pywikibot.Page a text from some wiki article is taken roughly 10x faster than getting a WD item

Ok interesting, so i guess i have to wait until this issue is resolved or is there another functionality which i can use that won't load and initialize sitelinks? For a new feature we use the wikipedia sitelink but the performance problem occured already before we used sitelinks.