Maintenance script to be written in T270247.
The parent task contains two queries to collect deleted items, which one should be used for the script is to be decided.
Maintenance script to be written in T270247.
The parent task contains two queries to collect deleted items, which one should be used for the script is to be decided.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Addshore | T263730 Wikidata term store contains rows for deleted items | |||
Resolved | Lucas_Werkmeister_WMDE | T270249 Run maintenance script to remove deleted items from term store on production |
The maintenance script is repo/maintenance/RemoveDeletedItemsFromTermStore.php and will be available in production starting with wmf/1.36.0-wmf.28 (next week).
Wait, I just realized the maintenance script isn’t enough. We need to do two things:
If we just run the maintenance script on its own, it won’t do anything – all the term IDs we would pass to it are still referenced in wbt_item_terms, after all.
Should this be another maintenance script (or added to the same maintenance script?), or should I try to run the correct DELETE SQL statements directly? Something like:
SELECT DISTINCT wbit_item_id FROM wbt_item_terms JOIN logging ON log_namespace = 0 AND SUBSTRING(log_title, 2) = wbit_item_id LEFT JOIN page ON page_namespace = 0 AND page_title = log_title WHERE log_type = 'delete' AND log_action = 'delete' AND log_timestamp > '20190901000000' AND page_id IS NULL;
DELETE wbt_item_terms FROM wbt_item_terms LEFT JOIN page ON page_namespace = 0 AND page_title = CONCAT('Q', CAST(wbit_item_id AS CHAR)) WHERE page_id IS NULL AND wbit_item_id IN (...); -- list from stats machine
Given that there are 9593 distinct item IDs (T263730#6490963 has 9594 distinct results, but that’s because it includes the log_title and log_timestamp – apparently, one item was deleted twice) and 1145260 total wbt_item_terms rows that need to be deleted, we definitely need to batch those deletes and wait for replication in between. We might need another maintenance script after all…
So with a new maintenance script, the procedure could look like this:
Or we merge the responsibilities of 2.A and 2.B into one maintenance script after all:
In any case – at some point, we need to delete those 1.1 milion wbt_item_term rows belonging to 9593 deleted item IDs; assuming a batch size of 100 and a replication wait time of 1 s (and no time for the deletes themselves), that would take over three hours. Definitely needs a scheduled maintenance window, and possibly we’ll want to break it up into multiple batches of item IDs outside the script as well.
I created a new task, T274890: Make RemoveDeletedItemsFromTermStore.php maintenance script remove wbt_item_terms rows, to improve the maintenance script. Moving this to Stalled in the meantime.
I’ve added two entries for this on the deployments calendar: tomorrow (Thursday) during the EU backport window, I’ll test this using the four item IDs Q581768,Q739279,Q774383,Q852302; then, on Monday after the EU backport window, I’ll run the script with the remaining item IDs (probably still split into several batches).
Mentioned in SAL (#wikimedia-operations) [2021-03-11T12:13:44Z] <Lucas_WMDE> lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds 581768,739279,774383,852302 # T270249, finished in 1.124s
Maintenance script seems to work fine – removing four items brought us down from 9593 items to 9589 items. I’ll call that a successful test and do the rest on Monday.
Mentioned in SAL (#wikimedia-operations) [2021-03-15T12:10:04Z] <Lucas_WMDE> lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 5,54p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, 50 items
Mentioned in SAL (#wikimedia-operations) [2021-03-15T12:11:21Z] <Lucas_WMDE> lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 55,554p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, 500 items
Mentioned in SAL (#wikimedia-operations) [2021-03-15T12:13:52Z] <Lucas_WMDE> lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 555,5554p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, 5000 items
Mentioned in SAL (#wikimedia-operations) [2021-03-15T12:23:31Z] <Lucas_WMDE> lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 5555,9593p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, remaining 4039 items
Mentioned in SAL (#wikimedia-operations) [2021-03-15T12:31:07Z] <Lucas_WMDE> maintenance scripts for T270249 completed successfully, no more terms for deleted items found on stat1007
In the end, the maintenance script was thankfully substantially faster than that :)
You can see the impact of the script on the s8 master (db1104), by the way:
Each of those DELETE spikes or plateaus corresponds to one run of the maintenance script. (They never reached the level of INSERT or UPDATE, though.)
The changes in the I/O and Semaphores graphs might also be due to this, I’m not sure: