Hi, i'm currently working on an open source university project (https://openartbrowser.org/ , https://github.com/hochschule-darmstadt/openartbrowser).
We query different datasets from wikidata about art. In our ETL process we get the data we want with help of the pywikibot libary. This libary loads the wikidata sites by their qId's which we query beforehand with SPARQL (pagegenerator.WikidataSPARQLPageGenerator). This whole process of extracting around 150.000 entries took us 47 hours last time measured at the ending of october.
We want to improve our crawler performance so that we can test new features faster.
Our implementation can be viewed here https://github.com/hochschule-darmstadt/openartbrowser/blob/staging/scripts/Wikidata%20crawler/ArtOntologyCrawler.py in the extract_artworks function. This function first queries all qIds of Paintings, Drawings and Sculptures. Next we iterate over the items returned by the page generator. When measuring the times i came accross that the item.get() takes from 0.5 to 3-4 seconds. I assume that this is a page load for all data on the page of an wikidata entity.
The only possibility i see at the moment to improve this data extraction is multi-threading because wikidata allows five queries in parallel (which equals five page loads). A direct SPARQL queries seems to be not possible because requests are very limited (see https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits).
Maybe there is another way of solving this performance issue.