Maniphest T205276

Optimize querying for surname
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	Alicia_Fagerving_WMSE
	Sep 24 2018, 12:38 PM

Description

The first implementation of the surname matcher (parent) is very slow. Because there are so many surname items on WD, pre-caching all of them turned out to be impossible. That's why right now, we're running a query for every (possible) surname, which is too slow and resource-heavy to work with a large-scale upload.

Related Objects
Search...

Status	Assigned	Task
Resolved	Alicia_Fagerving_WMSE	T202527 Process and upload the Libris authorities database
Duplicate	Alicia_Fagerving_WMSE	T202528 Mass upload of person data
Resolved	Alicia_Fagerving_WMSE	T202882 Implement adding surname
Resolved	Alicia_Fagerving_WMSE	T205276 Optimize querying for surname
Declined	None	T206600 Move cache handling to WikidataItem

Event Timeline

Alicia_Fagerving_WMSE claimed this task.Sep 24 2018, 12:38 PM

Alicia_Fagerving_WMSE created this task.

Alicia_Fagerving_WMSE added a parent task: T202882: Implement adding surname.

Alicia_Fagerving_WMSE moved this task from 📆 This week to 🗃️ Inbox on the User-Alicia_Fagerving_WMSE board.Sep 28 2018, 7:20 AM

@Lokal_Profil – this is the surname matching problem we talked about earlier today.

Since there are too many surnames on Wikidata for it to be practical (or, in fact, possible) to download them when starting the script (in importer.py), querying is done in the WikidataItem object in this particular way (implemented in parent task).

This is slow; obviously, implementing a cache would be the next step – saving time on repeating surnames, both successfully matched and unmatched. The problem is that the cache would have to be held in the parent importer.py, but the WikidataItem object is not able to communicate with it.

[[ https://github.com/lokal-profil/wikidata-stuff/blob/9e68dd008ffeb86aca8c55801bc20d41e334f849/wikidataStuff/helpers.py#L207 | wikidataStuff.helpers.match_name() ]] implements this and holds a a local cache (global in helpers.py). It behaves differently when running on ToolForge (SQL on replica table) to when it runs on any other machine (action api via pywikibot), the two return different values as the sql implementation isn't primarily intended for exact matches. You can disable the ToolForge behaviour using the no_wdss flag.

It might be worth comparing these two with your sparql implementation to see which is quicker. Back in the day (early 2016) sql on the replica table was way quicker, but at that time there was also no sparql implementation.

Cachewise you can either:

have a cache object in WikidataItem which you pass as a parameter to the function in importer.py,
use a global cache object in importer.py,
implement real caching using e.g cachetools where you would cache the actual function call.

Alicia_Fagerving_WMSE moved this task from 🗃️ Inbox to 📆 This week on the User-Alicia_Fagerving_WMSE board.Oct 9 2018, 6:46 AM

Alicia_Fagerving_WMSE set the point value for this task to 5.Oct 9 2018, 10:57 AM

Alicia_Fagerving_WMSE mentioned this in T205278: Implement adding first name.Oct 10 2018, 8:35 AM

Alicia_Fagerving_WMSE moved this task from 📆 This week to ☑️ Done on the User-Alicia_Fagerving_WMSE board.Oct 10 2018, 10:02 AM

Ended up implementing a cache object in WikidataItem that is handled in importer.py. Additionally, it is saved as a local file. This means that when re-running the script, the names that got matched in the previous run don't have to be queried for again, which further makes the execution faster.

Alicia_Fagerving_WMSE moved this task from Backlog to Done on the WMSE-Library-Data-2018 board.Oct 10 2018, 11:03 AM

Alicia_Fagerving_WMSE mentioned this in T204343: Save unmatched surnames to report.Oct 10 2018, 11:07 AM

Merged: https://github.com/Vesihiisi/Biblioteksdata/commit/c852002a7d1897f255b009e908390e88c2322451

Alicia_Fagerving_WMSE closed this task as Resolved.Oct 25 2018, 2:37 PM

Lokal_Profil closed subtask T206600: Move cache handling to WikidataItem as Declined.Feb 28 2022, 11:24 AM

Optimize querying for surnameClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Optimize querying for surname
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...