Page MenuHomePhabricator

Optimize querying for surname
Closed, ResolvedPublic5 Estimated Story Points

Description

The first implementation of the surname matcher (parent) is very slow. Because there are so many surname items on WD, pre-caching all of them turned out to be impossible. That's why right now, we're running a query for every (possible) surname, which is too slow and resource-heavy to work with a large-scale upload.

Event Timeline

@Lokal_Profil โ€“ this is the surname matching problem we talked about earlier today.

Since there are too many surnames on Wikidata for it to be practical (or, in fact, possible) to download them when starting the script (in importer.py), querying is done in the WikidataItem object in this particular way (implemented in parent task).

This is slow; obviously, implementing a cache would be the next step โ€“ saving time on repeating surnames, both successfully matched and unmatched. The problem is that the cache would have to be held in the parent importer.py, but the WikidataItem object is not able to communicate with it.

[[ https://github.com/lokal-profil/wikidata-stuff/blob/9e68dd008ffeb86aca8c55801bc20d41e334f849/wikidataStuff/helpers.py#L207 | wikidataStuff.helpers.match_name() ]] implements this and holds a a local cache (global in helpers.py). It behaves differently when running on ToolForge (SQL on replica table) to when it runs on any other machine (action api via pywikibot), the two return different values as the sql implementation isn't primarily intended for exact matches. You can disable the ToolForge behaviour using the no_wdss flag.

It might be worth comparing these two with your sparql implementation to see which is quicker. Back in the day (early 2016) sql on the replica table was way quicker, but at that time there was also no sparql implementation.

Cachewise you can either:

  • have a cache object in WikidataItem which you pass as a parameter to the function in importer.py,
  • use a global cache object in importer.py,
  • implement real caching using e.g cachetools where you would cache the actual function call.

Ended up implementing a cache object in WikidataItem that is handled in importer.py. Additionally, it is saved as a local file. This means that when re-running the script, the names that got matched in the previous run don't have to be queried for again, which further makes the execution faster.