Opinions posted here are my own, unless stated otherwise
I think something along these lines makes sense, please comment here: https://github.com/wmde/doctrine-term-store/pull/8/files
I am also not happy with this task, as again it specifies a solution, not an outcome. I much rather have "avoid expensive cleanup during the request to the degree this is possible" as acceptance criteria in the story.
Tue, Apr 23
We talked about this (two weeks ago?) and concluded there likely is no need to introduce anything in wikibase/term-store. I'm pretty annoyed with this task now since it specifies a solution rather than a problem that needs to be solved.
WikibaseImport contains a limited number of items and properties
This means we have to go with the "smart update using diff" approach, since otherwise we do not know which terms have been removed. Not clear to me it will make sense to do the cleanup in post-request, we might end up only delaying a few % of the cost. I suggest to first make it work on write and then see if we can gain a lot by moving stuff to a job.
I'm calling it a day. Current guess is that the tables are not created right because we are not using this setting in mediawiki/doctrine-connection.
There is some issues though. Some properties result in an error, and on re-run many of them do.
With https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/501992/ the rebuilding works using the Doctrine Term Store.
Mon, Apr 22
@alaa_wmde what is the status of this?
AFAIK the script we currently have (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/505670) is sufficient for this task. It has continuation based on page id rather than property id. I figure that won't fly for items but likely is OK for properties. Do we need continuation at all for properties?
We already did some of this while working on the property script: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/505679
Sun, Apr 21
Yesterday while thinking about design stuff I randomly realized that we might not need a script like this. Can't we just use https://github.com/Wikidata/WikibaseImport to important a bunch of real entities? If that is too slow, then perhaps we can use https://github.com/JeroenDeDauw/Replicator to import JSON dumps.
Tue, Apr 16
Sun, Apr 14
Fri, Apr 12
Don't think it is a good idea to modify the existing RebuildTermSqlIndex code. We can just create a new script. The existing code has things in there we don't need, and still having the wb_terms specific thing around might be useful to various users.
@alaa_wmde is this done?
Thu, Apr 11
Wed, Apr 10
"delete everything" means deleting all terms for an item/property in the item/property_terms table, rather than just those that actually need to be removed.
Tue, Apr 9
We won't be doing this as per https://phabricator.wikimedia.org/T220150
We figured we go with delete and insert everything. Task description updated to reflect this.
While not the best fro a design or flexibility perspective, I suspect the most pragmatic approach here is to just create a MW maintenance script in Wikibase (Repo?).
@alaa_wmde are you working on this? If so, please link the stuff you have so far.
Mon, Apr 8
Sun, Apr 7
Also part of this: https://github.com/wmde/wikibase-term-store/pull/7
Setting this up turned out to be easier than expected. The library has running tests on TravisCI and can transform Mysqli and SQLite (PDO) based MW Database objects.
Sat, Apr 6
The updating optimization ticket is relevant for this cleanup. We now have two main approaches:
I came to a similar conclusion after trying to write some code without looking at this ticket first :) It might still be worth it to do the diff because it helps https://phabricator.wikimedia.org/T220150. I'll comment more there.
@alaa_wmde seems to have a different idea of how the maintenance script would work then I do.
Fri, Apr 5
After a bunch of consideration I moderately prefer "2. Create dedicated library". That approach does not clutter anything and it gives us a building block we can use in other projects, which I think is worth the small initial investment.
So this is semi-blocked on figuring out what we do for labs, since that impacts the reasons for immediate cleanup.
I was wondering about how much extra complexity the post request approach (4) would bring. In particular, which info do we need to give to the job. Giving the property id is not sufficient. You could give the ids of the text records and then in the job check if they are still unused and do the same for the higher level records that point to those text records. Thing is, if you already need to find the unused records in the request, then you can just as well delete them right away. Either way you have a performance penalty. So I think the simpler approach (immediate cleanup (3)) makes more sense as a starting point. Does that make sense to you?
I renamed this back to an implementation agnostic task, since else we don't have anything to track the completion of the service implementation, and a parent task for more detailed things such as the cleanup logic.
huh. Why is the DBAL connection part of this checkpoint and not checkpoint 2? I was expecting it to be part of checkpoint 2
Wikibase Client also uses the TermIndex stuff, so I'm afraid we can't just put this into Wikibase Repository. Some options: