Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis
Open, Stalled, MediumPublic
Actions

Assigned To

Authored By

	Umherirrender
	May 3 2022, 8:13 PM

Description

Apply T306747 to wmf wikis

ALTERs to run: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GlobalBlocking/+/785376/1/sql/mysql/patch-globalblocks-timestamps.sql
Where to run those changes: centralauth database only
When to run those changes: any time
If the schema change is backwards compatible: Yes
If the schema change has been tested already on some of the test/beta wikis: beta cluster is running with the new schema
if the data should be made available on the labs replicas and/or dumps: no change of the existing rules

This makes gb_timestamp a fixed length column and removes default of empty string from gb_expiry

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved	BUG REPORT	Marostegui	T307647 2022-05-05 Wikimedia full site outage
Stalled		Marostegui	T307501 Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis
Resolved		Marostegui	T308126 Migrate a s7 DB host to mariadb 10.6

Event Timeline

Umherirrender created this task.May 3 2022, 8:13 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 3 2022, 8:13 PM

Umherirrender triaged this task as Low priority.May 3 2022, 8:14 PM

I guess:

Where to run those changes: everywhere

Should be:

Where to run those changes: centralauth

Am I right?

Yes, it is a special table, exists only once per "farm".

Umherirrender updated the task description. (Show Details)May 4 2022, 4:59 PM

I ran this directly on the master with replication.

This has caused a massive outage and had to be reverted:

root@db1181.eqiad.wmnet[centralauth]> ALTER TABLE   globalblocks CHANGE  gb_timestamp gb_timestamp varchar(14) NOT NULL, CHANGE  gb_expiry gb_expiry VARBINARY(14) DEFAULT '' NOT NULL;
Query OK, 110397 rows affected (1.129 sec)
Records: 110397  Duplicates: 0  Warnings: 0

lmata added a project: SRE-OnFire.May 5 2022, 5:52 AM

lmata subscribed.

Ladsgroup mentioned this in T307648: Audit database usage of GlobalBlocking extension.May 5 2022, 6:04 AM

Checking the optimizer output for this query:

SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`WHERE (gb_range_start  LIKE '5B85%' ESCAPE '`' ) AND (gb_range_start <= '5B85B2D2') AND (gb_range_end >= '5B85B2D2') AND (gb_expiry > '20220505012805');

The result is the same on the original schema and the schema after the schema change. Still investigating

Marostegui mentioned this in T307647: 2022-05-05 Wikimedia full site outage.May 5 2022, 10:23 AM

Marostegui added a parent task: T307647: 2022-05-05 Wikimedia full site outage.

Mentioned in SAL (#wikimedia-operations) [2022-05-05T10:30:10Z] <marostegui> Alter globalblocks on db1127 T307501

Altered db1127 and pooled it with very low weight to see if there's some slowness there

db1127 is now fully repooled, and I am not seeing any locks or high latency for those queries (or any). Still checking things

As another test I have altered db1101:3317 without depooling it (to sort of simulate what happened earlier today when the change arrived thru replication). The change itself took around 8 seconds (so the table was locked for 8 seconds). There was no trace of pile ups on the host.
However, during those 8 seconds, there were 8 reports of this query taking 5 seconds to run:

SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`    WHERE (gb_range_start  LIKE 'v6-2%' ESCAPE '`' ) AND (gb_range_start <= 'v6-2A02C207204105070000000000000099') AND (gb_range_end >= 'v6-2A02C207204105070000000000000099') AND (gb_expiry > '20220505113815')

That explain is:

root@db1101.eqiad.wmnet[centralauth]> explain SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`    WHERE (gb_range_start  LIKE 'v6-2%' ESCAPE '`' ) AND (gb_range_start <= 'v6-2A02C207204105070000000000000099') AND (gb_range_end >= 'v6-2A02C207204105070000000000000099') AND (gb_expiry > '20220505113815')  ;
+------+-------------+--------------+-------+--------------------+----------+---------+------+------+------------------------------------+
| id   | select_type | table        | type  | possible_keys      | key      | key_len | ref  | rows | Extra                              |
+------+-------------+--------------+-------+--------------------+----------+---------+------+------+------------------------------------+
|    1 | SIMPLE      | globalblocks | range | gb_range,gb_expiry | gb_range | 74      | NULL | 3552 | Using index condition; Using where |
+------+-------------+--------------+-------+--------------------+----------+---------+------+------+------------------------------------+
1 row in set (0.001 sec)

If I revert the change the query explain is a lot better:

root@db1101.eqiad.wmnet[centralauth]> explain SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`    WHERE (gb_range_start  LIKE 'v6-2%' ESCAPE '`' ) AND (gb_range_start <= 'v6-2A02C207204105070000000000000099') AND (gb_range_end >= 'v6-2A02C207204105070000000000000099') AND (gb_expiry > '20220505113815')  ;
+------+-------------+--------------+-------+--------------------+-----------+---------+------+------+------------------------------------+
| id   | select_type | table        | type  | possible_keys      | key       | key_len | ref  | rows | Extra                              |
+------+-------------+--------------+-------+--------------------+-----------+---------+------+------+------------------------------------+
|    1 | SIMPLE      | globalblocks | range | gb_range,gb_expiry | gb_expiry | 16      | NULL | 1    | Using index condition; Using where |
+------+-------------+--------------+-------+--------------------+-----------+---------+------+------+------------------------------------+
1 row in set (0.000 sec)

So if this table is queried a lot, then I can see the slow down across all the hosts of s7.

Reverted the schema change on db1127 too (as the query plan changes there as well).

This looks like a classic optimizer bug, when sometimes it picks the right index and sometimes it doesn't:

Altering the host:

root@db1101.eqiad.wmnet[centralauth]> ALTER TABLE  /*_*/globalblocks
    -> CHANGE  gb_timestamp gb_timestamp BINARY(14) NOT NULL,
    -> CHANGE  gb_expiry gb_expiry VARBINARY(14) NOT NULL;
Query OK, 110271 rows affected (13.890 sec)
Records: 110271  Duplicates: 0  Warnings: 0

root@db1101.eqiad.wmnet[centralauth]> SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`    WHERE (gb_range_start  LIKE 'v6-2%' ESCAPE '`' ) AND (gb_range_start <= 'v6-2A02C207204105070000000000000099') AND (gb_range_end >= 'v6-2A02C207204105070000000000000099') AND (gb_expiry > '20220505113815')  ;

<snip>
1 row in set (6.787 sec)

And this is the optimizer analyze where we can see it picking gb_range

| {
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 1.067,
    "table": {
      "table_name": "globalblocks",
      "access_type": "range",
      "possible_keys": ["gb_range", "gb_expiry"],
      "key": "gb_range",
      "key_length": "74",
      "used_key_parts": ["gb_range_start", "gb_range_end"],
      "r_loops": 1,
      "rows": 3552,
      "r_rows": 1,
      "r_total_time_ms": 1.0456,
      "filtered": 50,
      "r_filtered": 100,
      "index_condition": "globalblocks.gb_range_start like 'v6-2%' escape '`' and globalblocks.gb_range_start <= 'v6-2A02C207204105070000000000000099' and globalblocks.gb_range_end >= 'v6-2A02C207204105070000000000000099'",
      "attached_condition": "globalblocks.gb_expiry > '20220505113815'"
    }
  }
} |

And now let's revert and the query is back being super fast:

root@db1101.eqiad.wmnet[centralauth]> ALTER TABLE   globalblocks CHANGE  gb_timestamp gb_timestamp varchar(14) NOT NULL, CHANGE  gb_expiry gb_expiry VARBINARY(14) DEFAULT '' NOT NULL;
Query OK, 110271 rows affected (8.233 sec)
Records: 110271  Duplicates: 0  Warnings: 0

root@db1101.eqiad.wmnet[centralauth]> SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`    WHERE (gb_range_start  LIKE 'v6-2%' ESCAPE '`' ) AND (gb_range_start <= 'v6-2A02C207204105070000000000000099') AND (gb_range_end >= 'v6-2A02C207204105070000000000000099') AND (gb_expiry > '20220505113815')  ;
<snip>

However, the optimizer keeps saying it would pick gb_range and it does, but the query now is super fast:

  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 0.8109,
    "table": {
      "table_name": "globalblocks",
      "access_type": "range",
      "possible_keys": ["gb_range", "gb_expiry"],
      "key": "gb_range",
      "key_length": "74",
      "used_key_parts": ["gb_range_start", "gb_range_end"],
      "r_loops": 1,
      "rows": 3552,
      "r_rows": 1,
      "r_total_time_ms": 0.7956,
      "filtered": 50,
      "r_filtered": 100,
      "index_condition": "globalblocks.gb_range_start like 'v6-2%' escape '`' and globalblocks.gb_range_start <= 'v6-2A02C207204105070000000000000099' and globalblocks.gb_range_end >= 'v6-2A02C207204105070000000000000099'",
      "attached_condition": "globalblocks.gb_expiry > '20220505113815'"
    }
  }
} |

"analyzing_range_alternatives": {
  "range_scan_alternatives": [
    {
      "index": "gb_range",
      "ranges": [
        "(v6-2,v6-2A02C207204105070000000000000099) <= (gb_range_start,gb_range_end) <= (v6-2A02C207204105070000000000000099)"
      ],
      "rowid_ordered": false,
      "using_mrr": false,
      "index_only": false,
      "rows": 3552,
      "cost": 4466.6,
      "chosen": true
    },

Marostegui added a project: Sustainability (Incident Followup).May 5 2022, 12:52 PM

Marostegui raised the priority of this task from Low to Medium.May 6 2022, 1:09 PM

I don't think we should attempt this schema change again. There's not much benefit and it is very risky. We can try to see if the optimizer behaves better on 10.6 once we migrate to it (T301879). Once I have a testing host in s7 I can try to give it a go, and if it still behaves weirdly, we should probably decline this task.

Marostegui mentioned this in T308126: Migrate a s7 DB host to mariadb 10.6.May 11 2022, 1:04 PM

Marostegui added a subtask: T308126: Migrate a s7 DB host to mariadb 10.6.

Just pasting this for the record here with one other test I have done on db2108 (after the alter table):
Before:

Variable_name	Value
Handler_commit	1
Handler_delete	0
Handler_discover	0
Handler_external_lock	0
Handler_icp_attempts	2225
Handler_icp_match	1
Handler_mrr_init	0
Handler_mrr_key_refills	0
Handler_mrr_rowid_refills	0
Handler_prepare	0
Handler_read_first	0
Handler_read_key	1
Handler_read_last	0
Handler_read_next	1
Handler_read_prev	0
Handler_read_retry	0
Handler_read_rnd	0
Handler_read_rnd_deleted	0
Handler_read_rnd_next	0
Handler_rollback	0
Handler_savepoint	0
Handler_savepoint_rollback	0
Handler_tmp_delete	0
Handler_tmp_update	0
Handler_tmp_write	0
Handler_update	0
Handler_write	0

After the schema change:

Variable_name	Value
Handler_commit	1
Handler_delete	0
Handler_discover	0
Handler_external_lock	0
Handler_icp_attempts	99552
Handler_icp_match	99552
Handler_mrr_init	0
Handler_mrr_key_refills	0
Handler_mrr_rowid_refills	0
Handler_prepare	0
Handler_read_first	0
Handler_read_key	23
Handler_read_last	0
Handler_read_next	99552
Handler_read_prev	0
Handler_read_retry	0
Handler_read_rnd	0
Handler_read_rnd_deleted	0
Handler_read_rnd_next	0
Handler_rollback	0
Handler_savepoint	0
Handler_savepoint_rollback	0
Handler_tmp_delete	0
Handler_tmp_update	0
Handler_tmp_write	0
Handler_update	0
Handler_write	0

I have left db2108 with the original table definition, that is, before this schema change.

And now we run it again:

Variable_name	Value
Handler_commit	1
Handler_delete	0
Handler_discover	0
Handler_external_lock	0
Handler_icp_attempts	2225
Handler_icp_match	1
Handler_mrr_init	0
Handler_mrr_key_refills	0
Handler_mrr_rowid_refills	0
Handler_prepare	0
Handler_read_first	0
Handler_read_key	23
Handler_read_last	0
Handler_read_next	1
Handler_read_prev	0
Handler_read_retry	0
Handler_read_rnd	0
Handler_read_rnd_deleted	0
Handler_read_rnd_next	0
Handler_rollback	0
Handler_savepoint	0
Handler_savepoint_rollback	0
Handler_tmp_delete	0
Handler_tmp_update	0
Handler_tmp_write	0
Handler_update	0
Handler_write	0

Mentioned in SAL (#wikimedia-operations) [2022-05-12T05:34:45Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2122 T307501', diff saved to https://phabricator.wikimedia.org/P27792 and previous config saved to /var/cache/conftool/dbconfig/20220512-053444-marostegui.json

I have migrated db2122 to 10.6 and it looks like the query, after the alter table, goes fast and the optimizer behaves correctly:

mysql:root@localhost [centralauth]> ALTER TABLE  /*_*/globalblocks
    -> CHANGE  gb_timestamp gb_timestamp BINARY(14) NOT NULL,
    -> CHANGE  gb_expiry gb_expiry VARBINARY(14) NOT NULL;
Query OK, 97872 rows affected (7.521 sec)
Records: 97872  Duplicates: 0  Warnings: 0

mysql:root@localhost [centralauth]> SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`    WHERE (gb_range_start  LIKE 'v6-2%' ESCAPE '`' ) AND (gb_range_start <= 'v6-2A02C207204105070000000000000099') AND (gb_range_end >= 'v6-2A02C207204105070000000000000099') AND (gb_expiry > '20220505113815')  ;

<snip>
1 row in set (0.020 sec)



mysql:root@localhost [centralauth]>  SHOW STATUS like 'Hand%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 2224  |
| Handler_icp_match          | 1     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 1     |
| Handler_read_last          | 0     |
| Handler_read_next          | 1     |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 0     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_delete         | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
27 rows in set (0.001 sec)

I will double check this again with an eqiad host to make sure this was not just pure luck

Mentioned in SAL (#wikimedia-operations) [2022-05-12T05:41:38Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db2122 T307501', diff saved to https://phabricator.wikimedia.org/P27793 and previous config saved to /var/cache/conftool/dbconfig/20220512-054138-marostegui.json

And of course, I alter db1127 before migrating to 10.6 and now the optimizer goes well...

mysql:root@localhost [centralauth]> FLUSH STATUS; pager cat > /dev/null; SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`    WHERE (gb_range_start  LIKE 'v6-2%' ESCAPE '`' ) AND (gb_range_start <= 'v6-2A02C207204105070000000000000099') AND (gb_range_end >= 'v6-2A02C207204105070000000000000099') AND (gb_expiry > '20220505113815')  ; nopager; SHOW STATUS like 'Hand%';
Query OK, 0 rows affected (0.000 sec)

PAGER set to 'cat > /dev/null'
1 row in set (0.001 sec)

PAGER set to stdout
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 2224  |
| Handler_icp_match          | 1     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 1     |
| Handler_read_last          | 0     |
| Handler_read_next          | 1     |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 0     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_delete         | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
27 rows in set (0.000 sec)

The migration to 10.6 at least, keeps the optimizer working fine.

Reverted the alter table on db1127 and db2140 to leave those hosts consistent with the rest.

Marostegui moved this task from Backlog to In progress on the Schema-change-in-production board.May 12 2022, 9:46 AM

Marostegui moved this task from Triage to In progress on the DBA board.May 23 2022, 1:29 PM

Marostegui closed subtask T308126: Migrate a s7 DB host to mariadb 10.6 as Resolved.May 24 2022, 5:06 AM

Going to stall this until we are on 10.6

Marostegui moved this task from In progress to Blocked on the DBA board.May 27 2022, 12:40 PM

Universal_Omega subscribed.Jun 10 2022, 8:01 PM

Firestar464 awarded a token.Jun 12 2022, 8:47 AM

Firestar464 subscribed.

Firestar464 unsubscribed.Jun 12 2022, 8:51 AM

Volans added a project: SRE-Sprint-Week-Sustainability-March2023.Mar 21 2023, 9:42 AM

Volans moved this task from Backlog to E_TOO_BIG_MAYBE_OKR? on the SRE-Sprint-Week-Sustainability-March2023 board.

Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikisOpen, Stalled, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis
Open, Stalled, MediumPublic
Actions

Related Objects
Search...