Page MenuHomePhabricator

Manual runs of the maintenance script to remove old drafts from CX database
Closed, ResolvedPublic

Description

As part of the efforts to remove very old drafts from the CX database (T183890), manual runs of the script created in T189087 are needed.

In the first run, only target the oldest of the drafts, then bring date step by step closer to 1-year (our target is 15 months, which is one year after the 3 month notification warning).

Before first run, we may want to announce our intent in tech news.

Tasks:

Event Timeline

KartikMistry removed KartikMistry as the assignee of this task.
KartikMistry claimed this task.

How many rows are planned to be deleted (approximately)? This will tell us if we have to optimize the table afterwards.

These are the statistics right now:

root@db1069[wikishared]> show table status like 'cx_%';
+-----------------+--------+---------+------------+----------+----------------+-------------+-----------------+--------------+-------
| Name            | Engine | Version | Row_format | Rows     | Avg_row_length | Data_length | Max_data_length | Index_length | Data_f
+-----------------+--------+---------+------------+----------+----------------+-------------+-----------------+--------------+-------
| cx_corpora      | InnoDB |      10 | Compact    | 11319084 |           4114 | 46571438080 |               0 |   1330610176 |   4194
| cx_lists        | InnoDB |      10 | Compact    |     5951 |             68 |      409600 |               0 |       163840 |       
| cx_suggestions  | InnoDB |      10 | Compact    |   294894 |             76 |    22593536 |               0 |     11943936 |   7340
| cx_translations | InnoDB |      10 | Compact    |   451246 |            384 |   173719552 |               0 |     50495488 |   5242
| cx_translators  | InnoDB |      10 | Compact    |   449199 |             47 |    21364736 |               0 |            0 |   4194
+-----------------+--------+---------+------------+----------+----------------+-------------+-----------------+--------------+

root@db1069[wikishared]> SELECT count(*) FROM cx_corpora;
+----------+
| count(*) |
+----------+
| 10266545 |
+----------+
1 row in set (1 min 22.41 sec)

root@db1069:/srv/sqldata$ ls -lha wikishared/cx_*.ibd
-rw-rw---- 1 mysql mysql  46G May 30 08:16 wikishared/cx_corpora.ibd
-rw-rw---- 1 mysql mysql 656K May 30 02:51 wikishared/cx_lists.ibd
-rw-rw---- 1 mysql mysql  44M May 30 07:14 wikishared/cx_suggestions.ibd
-rw-rw---- 1 mysql mysql 224M May 30 08:16 wikishared/cx_translations.ibd
-rw-rw---- 1 mysql mysql  28M May 30 08:16 wikishared/cx_translators.ibd

Once we reach our target, I estimate around ~500k rows will be dropped by this script.

select count(*) from cx_translations left join cx_corpora on translation_id = cxc_translation_id where translation_status = 'draft' and translation_target_url is null and translation_last_updated_timestamp < '20170301000000';

Change 437176 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[mediawiki/extensions/ContentTranslation@master] Add waitForReplication support

https://gerrit.wikimedia.org/r/437176

Change 437176 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] Add waitForReplication support

https://gerrit.wikimedia.org/r/437176

As I mentioned in the email thread, I would like to see if we have a configurable batch size of the number of rows that we are going to delete at the time.
I think it was said it would delete around 55k rows now? Having a configurable batch size + the already added wait for replication check would make the script a lot safer and probably could be scheduled to be run on a per week basis (or something like that) without any human babysitting.

For the first run, please add it to a deployment window on its own- or if it is going to take >1h, just add it on the top of week, at "Week of:" on the Deployment page. This is only informative, to avoid running schema changes and maintenance at the same time on the same tables- so we centralize there all db maintenance and avoid surprises. I know it is unlikely that happens for x1, but it really only takes 5 seconds to add a comment there and let us know asynchronously.

If it is going to be run periodically in the future, we would need to puppetize it, but I guess that is too early now to think about that.

I think I can add that in a simple way. Related question: users can also delete their drafts individually from the web ui (same what we do inside the foreach loop), should that too have batching then?

We can ask for a window for the first run, it should take at most few minutes to run it.

users can also delete their drafts individually from the web ui (same what we do inside the foreach loop), should that too have batching then?

If they can delete from than 1000 row at the same time, yes. 1000 is a magic number, that is highly variable depending on many factors, but that is a good "we don't know better, but we believe it will take less than 1 second to execute" number.

I think I can add that in a simple way. Related question: users can also delete their drafts individually from the web ui (same what we do inside the foreach loop), should that too have batching then?

You probably want that too indeed, the idea is to avoid creating transactions that will take long and can generate contention/lag. It is better to create smaller transactions that happen more often if needed.

Actually, how is one supposed to do batching for deletes?

I tried this:

		$options[ 'LIMIT' ] = $batch;

		do {
			$res = $dbw->delete( 'cx_corpora', $conditions, __METHOD__, $options );
			// Wait for slaves
		} while ( $res );

But it goes into infinite loop, and I am not sure whether delete with LIMIT is even safe with replication. Is the only option to do select first and then batch with cxc_translation_id' => [ array of values ]?

delete with LIMIT

delete with limit is prohibited unless it goes with an order by. See https://www.mediawiki.org/wiki/Development_policy#Database_policy (non-deterministic queries).

Batches have to be controlled through a primary key or another unique key- if they take a long time (e.g. >1-5 seconds, they should be scheduled for background execution (jobque).

Change 439938 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/ContentTranslation@master] Add TranslationStorageManager::deleteTranslationDataGently

https://gerrit.wikimedia.org/r/439938

Change 439938 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] Add TranslationStorageManager::deleteTranslationDataGently

https://gerrit.wikimedia.org/r/439938

@jcrespo @Marostegui Are we good to go with script after last patch (https://gerrit.wikimedia.org/r/439938)? Let us know. We can do dry run and schedule script after that.

We are on an offsite till next Monday so it is probably better to wait until we are fully back just in case

We are on an offsite till next Monday so it is probably better to wait until we are fully back just in case

Any other blockers, @Marostegui ? We can schedule script run in this week if not.

KartikMistry moved this task from Backlog to Done on the Language-2018-July-September board.
KartikMistry updated the task description. (Show Details)

We have finished first run without any issue. 154 drafts were purged successfully. I'll schedule another run and create new task for it.