Page MenuHomePhabricator

Run maintenance script to remove deleted items from term store on production
Closed, ResolvedPublic

Description

Maintenance script to be written in T270247.

The parent task contains two queries to collect deleted items, which one should be used for the script is to be decided.

Event Timeline

noarave updated the task description. (Show Details)

The maintenance script is repo/maintenance/RemoveDeletedItemsFromTermStore.php and will be available in production starting with wmf/1.36.0-wmf.28 (next week).

Wait, I just realized the maintenance script isn’t enough. We need to do two things:

  • delete wbt_item_terms rows corresponding to deleted items
  • run maintenance script to remove now-unused terms

If we just run the maintenance script on its own, it won’t do anything – all the term IDs we would pass to it are still referenced in wbt_item_terms, after all.

Should this be another maintenance script (or added to the same maintenance script?), or should I try to run the correct DELETE SQL statements directly? Something like:

on stats machine
SELECT DISTINCT wbit_item_id
FROM wbt_item_terms
JOIN logging
  ON log_namespace = 0
  AND SUBSTRING(log_title, 2) = wbit_item_id
LEFT JOIN page
  ON page_namespace = 0
  AND page_title = log_title
WHERE log_type = 'delete'
AND log_action = 'delete'
AND log_timestamp > '20190901000000'
AND page_id IS NULL;
on master db
DELETE wbt_item_terms
FROM wbt_item_terms
LEFT JOIN page
  ON page_namespace = 0
  AND page_title = CONCAT('Q', CAST(wbit_item_id AS CHAR))
WHERE page_id IS NULL
AND wbit_item_id IN (...); -- list from stats machine

Given that there are 9593 distinct item IDs (T263730#6490963 has 9594 distinct results, but that’s because it includes the log_title and log_timestamp – apparently, one item was deleted twice) and 1145260 total wbt_item_terms rows that need to be deleted, we definitely need to batch those deletes and wait for replication in between. We might need another maintenance script after all…

So with a new maintenance script, the procedure could look like this:

  1. on stats machine:
    1. get distinct wbit_item_id (9593)
    2. get distinct wbit_term_in_lang_id (15758)
  2. on maintenance host:
    1. run to-be-created new maintenance script with item IDs from 1.A
      • batches item IDs
        • checks that those items really don’t exist
        • selects affected rows (wbit_id)
        • batches those rows
          • deletes those rows, waiting for replication each time
    2. run RemoveDeletedItemsFromTermStore.php with term IDs from 1.B

Or we merge the responsibilities of 2.A and 2.B into one maintenance script after all:

  1. on stats machine:
    1. get distinct wbit_item_id
  2. on maintenance host:
    1. run RemoveDeletedItemsFromTermStore.php with item IDs from 1.A
      • batches item IDs
        • checks that those items really don’t exist
        • selects affected rows (wbit_id, wbit_term_in_lang_id)
        • batches those rows
          • deletes those rows, waiting for replication each time
      • batches the term_in_lang IDs
        • cleans those terms, waiting for replication each time

In any case – at some point, we need to delete those 1.1 milion wbt_item_term rows belonging to 9593 deleted item IDs; assuming a batch size of 100 and a replication wait time of 1 s (and no time for the deletes themselves), that would take over three hours. Definitely needs a scheduled maintenance window, and possibly we’ll want to break it up into multiple batches of item IDs outside the script as well.

I’ve added two entries for this on the deployments calendar: tomorrow (Thursday) during the EU backport window, I’ll test this using the four item IDs Q581768,Q739279,Q774383,Q852302; then, on Monday after the EU backport window, I’ll run the script with the remaining item IDs (probably still split into several batches).

Mentioned in SAL (#wikimedia-operations) [2021-03-11T12:13:44Z] <Lucas_WMDE> lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds 581768,739279,774383,852302 # T270249, finished in 1.124s

Maintenance script seems to work fine – removing four items brought us down from 9593 items to 9589 items. I’ll call that a successful test and do the rest on Monday.

Mentioned in SAL (#wikimedia-operations) [2021-03-15T12:10:04Z] <Lucas_WMDE> lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 5,54p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, 50 items

Mentioned in SAL (#wikimedia-operations) [2021-03-15T12:11:21Z] <Lucas_WMDE> lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 55,554p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, 500 items

Mentioned in SAL (#wikimedia-operations) [2021-03-15T12:13:52Z] <Lucas_WMDE> lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 555,5554p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, 5000 items

Mentioned in SAL (#wikimedia-operations) [2021-03-15T12:23:31Z] <Lucas_WMDE> lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 5555,9593p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, remaining 4039 items

Mentioned in SAL (#wikimedia-operations) [2021-03-15T12:31:07Z] <Lucas_WMDE> maintenance scripts for T270249 completed successfully, no more terms for deleted items found on stat1007

In any case – at some point, we need to delete those 1.1 milion wbt_item_term rows belonging to 9593 deleted item IDs; assuming a batch size of 100 and a replication wait time of 1 s (and no time for the deletes themselves), that would take over three hours. Definitely needs a scheduled maintenance window, and possibly we’ll want to break it up into multiple batches of item IDs outside the script as well.

In the end, the maintenance script was thankfully substantially faster than that :)

You can see the impact of the script on the s8 master (db1104), by the way:

Each of those DELETE spikes or plateaus corresponds to one run of the maintenance script. (They never reached the level of INSERT or UPDATE, though.)

The changes in the I/O and Semaphores graphs might also be due to this, I’m not sure:


https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1104&var-port=9104&from=1615808395517&to=1615811995517