Page MenuHomePhabricator

Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis
Open, Stalled, MediumPublic

Description

Apply T306747 to wmf wikis

  1. ALTERs to run: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GlobalBlocking/+/785376/1/sql/mysql/patch-globalblocks-timestamps.sql
  2. Where to run those changes: centralauth database only
  3. When to run those changes: any time
  4. If the schema change is backwards compatible: Yes
  5. If the schema change has been tested already on some of the test/beta wikis: beta cluster is running with the new schema
  6. if the data should be made available on the labs replicas and/or dumps: no change of the existing rules

This makes gb_timestamp a fixed length column and removes default of empty string from gb_expiry

Event Timeline

I guess:

Where to run those changes: everywhere

Should be:

Where to run those changes: centralauth

Am I right?

Yes, it is a special table, exists only once per "farm".

Marostegui claimed this task.

I ran this directly on the master with replication.

This has caused a massive outage and had to be reverted:

root@db1181.eqiad.wmnet[centralauth]> ALTER TABLE   globalblocks CHANGE  gb_timestamp gb_timestamp varchar(14) NOT NULL, CHANGE  gb_expiry gb_expiry VARBINARY(14) DEFAULT '' NOT NULL;
Query OK, 110397 rows affected (1.129 sec)
Records: 110397  Duplicates: 0  Warnings: 0

Checking the optimizer output for this query:

SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`WHERE (gb_range_start  LIKE '5B85%' ESCAPE '`' ) AND (gb_range_start <= '5B85B2D2') AND (gb_range_end >= '5B85B2D2') AND (gb_expiry > '20220505012805');

The result is the same on the original schema and the schema after the schema change. Still investigating

Mentioned in SAL (#wikimedia-operations) [2022-05-05T10:30:10Z] <marostegui> Alter globalblocks on db1127 T307501

Altered db1127 and pooled it with very low weight to see if there's some slowness there

db1127 is now fully repooled, and I am not seeing any locks or high latency for those queries (or any). Still checking things

As another test I have altered db1101:3317 without depooling it (to sort of simulate what happened earlier today when the change arrived thru replication). The change itself took around 8 seconds (so the table was locked for 8 seconds). There was no trace of pile ups on the host.
However, during those 8 seconds, there were 8 reports of this query taking 5 seconds to run:

SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`    WHERE (gb_range_start  LIKE 'v6-2%' ESCAPE '`' ) AND (gb_range_start <= 'v6-2A02C207204105070000000000000099') AND (gb_range_end >= 'v6-2A02C207204105070000000000000099') AND (gb_expiry > '20220505113815')

That explain is:

root@db1101.eqiad.wmnet[centralauth]> explain SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`    WHERE (gb_range_start  LIKE 'v6-2%' ESCAPE '`' ) AND (gb_range_start <= 'v6-2A02C207204105070000000000000099') AND (gb_range_end >= 'v6-2A02C207204105070000000000000099') AND (gb_expiry > '20220505113815')  ;
+------+-------------+--------------+-------+--------------------+----------+---------+------+------+------------------------------------+
| id   | select_type | table        | type  | possible_keys      | key      | key_len | ref  | rows | Extra                              |
+------+-------------+--------------+-------+--------------------+----------+---------+------+------+------------------------------------+
|    1 | SIMPLE      | globalblocks | range | gb_range,gb_expiry | gb_range | 74      | NULL | 3552 | Using index condition; Using where |
+------+-------------+--------------+-------+--------------------+----------+---------+------+------+------------------------------------+
1 row in set (0.001 sec)

If I revert the change the query explain is a lot better:

root@db1101.eqiad.wmnet[centralauth]> explain SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`    WHERE (gb_range_start  LIKE 'v6-2%' ESCAPE '`' ) AND (gb_range_start <= 'v6-2A02C207204105070000000000000099') AND (gb_range_end >= 'v6-2A02C207204105070000000000000099') AND (gb_expiry > '20220505113815')  ;
+------+-------------+--------------+-------+--------------------+-----------+---------+------+------+------------------------------------+
| id   | select_type | table        | type  | possible_keys      | key       | key_len | ref  | rows | Extra                              |
+------+-------------+--------------+-------+--------------------+-----------+---------+------+------+------------------------------------+
|    1 | SIMPLE      | globalblocks | range | gb_range,gb_expiry | gb_expiry | 16      | NULL | 1    | Using index condition; Using where |
+------+-------------+--------------+-------+--------------------+-----------+---------+------+------+------------------------------------+
1 row in set (0.000 sec)

So if this table is queried a lot, then I can see the slow down across all the hosts of s7.

Reverted the schema change on db1127 too (as the query plan changes there as well).

This looks like a classic optimizer bug, when sometimes it picks the right index and sometimes it doesn't:

Altering the host:

root@db1101.eqiad.wmnet[centralauth]> ALTER TABLE  /*_*/globalblocks
    -> CHANGE  gb_timestamp gb_timestamp BINARY(14) NOT NULL,
    -> CHANGE  gb_expiry gb_expiry VARBINARY(14) NOT NULL;
Query OK, 110271 rows affected (13.890 sec)
Records: 110271  Duplicates: 0  Warnings: 0

root@db1101.eqiad.wmnet[centralauth]> SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`    WHERE (gb_range_start  LIKE 'v6-2%' ESCAPE '`' ) AND (gb_range_start <= 'v6-2A02C207204105070000000000000099') AND (gb_range_end >= 'v6-2A02C207204105070000000000000099') AND (gb_expiry > '20220505113815')  ;

<snip>
1 row in set (6.787 sec)

And this is the optimizer analyze where we can see it picking gb_range

| {
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 1.067,
    "table": {
      "table_name": "globalblocks",
      "access_type": "range",
      "possible_keys": ["gb_range", "gb_expiry"],
      "key": "gb_range",
      "key_length": "74",
      "used_key_parts": ["gb_range_start", "gb_range_end"],
      "r_loops": 1,
      "rows": 3552,
      "r_rows": 1,
      "r_total_time_ms": 1.0456,
      "filtered": 50,
      "r_filtered": 100,
      "index_condition": "globalblocks.gb_range_start like 'v6-2%' escape '`' and globalblocks.gb_range_start <= 'v6-2A02C207204105070000000000000099' and globalblocks.gb_range_end >= 'v6-2A02C207204105070000000000000099'",
      "attached_condition": "globalblocks.gb_expiry > '20220505113815'"
    }
  }
} |

And now let's revert and the query is back being super fast:

root@db1101.eqiad.wmnet[centralauth]> ALTER TABLE   globalblocks CHANGE  gb_timestamp gb_timestamp varchar(14) NOT NULL, CHANGE  gb_expiry gb_expiry VARBINARY(14) DEFAULT '' NOT NULL;
Query OK, 110271 rows affected (8.233 sec)
Records: 110271  Duplicates: 0  Warnings: 0

root@db1101.eqiad.wmnet[centralauth]> SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`    WHERE (gb_range_start  LIKE 'v6-2%' ESCAPE '`' ) AND (gb_range_start <= 'v6-2A02C207204105070000000000000099') AND (gb_range_end >= 'v6-2A02C207204105070000000000000099') AND (gb_expiry > '20220505113815')  ;
<snip>

However, the optimizer keeps saying it would pick gb_range and it does, but the query now is super fast:

  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 0.8109,
    "table": {
      "table_name": "globalblocks",
      "access_type": "range",
      "possible_keys": ["gb_range", "gb_expiry"],
      "key": "gb_range",
      "key_length": "74",
      "used_key_parts": ["gb_range_start", "gb_range_end"],
      "r_loops": 1,
      "rows": 3552,
      "r_rows": 1,
      "r_total_time_ms": 0.7956,
      "filtered": 50,
      "r_filtered": 100,
      "index_condition": "globalblocks.gb_range_start like 'v6-2%' escape '`' and globalblocks.gb_range_start <= 'v6-2A02C207204105070000000000000099' and globalblocks.gb_range_end >= 'v6-2A02C207204105070000000000000099'",
      "attached_condition": "globalblocks.gb_expiry > '20220505113815'"
    }
  }
} |
"analyzing_range_alternatives": {
  "range_scan_alternatives": [
    {
      "index": "gb_range",
      "ranges": [
        "(v6-2,v6-2A02C207204105070000000000000099) <= (gb_range_start,gb_range_end) <= (v6-2A02C207204105070000000000000099)"
      ],
      "rowid_ordered": false,
      "using_mrr": false,
      "index_only": false,
      "rows": 3552,
      "cost": 4466.6,
      "chosen": true
    },
Marostegui raised the priority of this task from Low to Medium.May 6 2022, 1:09 PM

I don't think we should attempt this schema change again. There's not much benefit and it is very risky. We can try to see if the optimizer behaves better on 10.6 once we migrate to it (T301879). Once I have a testing host in s7 I can try to give it a go, and if it still behaves weirdly, we should probably decline this task.

Just pasting this for the record here with one other test I have done on db2108 (after the alter table):
Before:

Variable_name	Value
Handler_commit	1
Handler_delete	0
Handler_discover	0
Handler_external_lock	0
Handler_icp_attempts	2225
Handler_icp_match	1
Handler_mrr_init	0
Handler_mrr_key_refills	0
Handler_mrr_rowid_refills	0
Handler_prepare	0
Handler_read_first	0
Handler_read_key	1
Handler_read_last	0
Handler_read_next	1
Handler_read_prev	0
Handler_read_retry	0
Handler_read_rnd	0
Handler_read_rnd_deleted	0
Handler_read_rnd_next	0
Handler_rollback	0
Handler_savepoint	0
Handler_savepoint_rollback	0
Handler_tmp_delete	0
Handler_tmp_update	0
Handler_tmp_write	0
Handler_update	0
Handler_write	0

After the schema change:

Variable_name	Value
Handler_commit	1
Handler_delete	0
Handler_discover	0
Handler_external_lock	0
Handler_icp_attempts	99552
Handler_icp_match	99552
Handler_mrr_init	0
Handler_mrr_key_refills	0
Handler_mrr_rowid_refills	0
Handler_prepare	0
Handler_read_first	0
Handler_read_key	23
Handler_read_last	0
Handler_read_next	99552
Handler_read_prev	0
Handler_read_retry	0
Handler_read_rnd	0
Handler_read_rnd_deleted	0
Handler_read_rnd_next	0
Handler_rollback	0
Handler_savepoint	0
Handler_savepoint_rollback	0
Handler_tmp_delete	0
Handler_tmp_update	0
Handler_tmp_write	0
Handler_update	0
Handler_write	0

I have left db2108 with the original table definition, that is, before this schema change.

And now we run it again:

Variable_name	Value
Handler_commit	1
Handler_delete	0
Handler_discover	0
Handler_external_lock	0
Handler_icp_attempts	2225
Handler_icp_match	1
Handler_mrr_init	0
Handler_mrr_key_refills	0
Handler_mrr_rowid_refills	0
Handler_prepare	0
Handler_read_first	0
Handler_read_key	23
Handler_read_last	0
Handler_read_next	1
Handler_read_prev	0
Handler_read_retry	0
Handler_read_rnd	0
Handler_read_rnd_deleted	0
Handler_read_rnd_next	0
Handler_rollback	0
Handler_savepoint	0
Handler_savepoint_rollback	0
Handler_tmp_delete	0
Handler_tmp_update	0
Handler_tmp_write	0
Handler_update	0
Handler_write	0

Mentioned in SAL (#wikimedia-operations) [2022-05-12T05:34:45Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2122 T307501', diff saved to https://phabricator.wikimedia.org/P27792 and previous config saved to /var/cache/conftool/dbconfig/20220512-053444-marostegui.json

I have migrated db2122 to 10.6 and it looks like the query, after the alter table, goes fast and the optimizer behaves correctly:

mysql:root@localhost [centralauth]> ALTER TABLE  /*_*/globalblocks
    -> CHANGE  gb_timestamp gb_timestamp BINARY(14) NOT NULL,
    -> CHANGE  gb_expiry gb_expiry VARBINARY(14) NOT NULL;
Query OK, 97872 rows affected (7.521 sec)
Records: 97872  Duplicates: 0  Warnings: 0

mysql:root@localhost [centralauth]> SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`    WHERE (gb_range_start  LIKE 'v6-2%' ESCAPE '`' ) AND (gb_range_start <= 'v6-2A02C207204105070000000000000099') AND (gb_range_end >= 'v6-2A02C207204105070000000000000099') AND (gb_expiry > '20220505113815')  ;

<snip>
1 row in set (0.020 sec)



mysql:root@localhost [centralauth]>  SHOW STATUS like 'Hand%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 2224  |
| Handler_icp_match          | 1     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 1     |
| Handler_read_last          | 0     |
| Handler_read_next          | 1     |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 0     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_delete         | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
27 rows in set (0.001 sec)

I will double check this again with an eqiad host to make sure this was not just pure luck

Mentioned in SAL (#wikimedia-operations) [2022-05-12T05:41:38Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db2122 T307501', diff saved to https://phabricator.wikimedia.org/P27793 and previous config saved to /var/cache/conftool/dbconfig/20220512-054138-marostegui.json

And of course, I alter db1127 before migrating to 10.6 and now the optimizer goes well...

mysql:root@localhost [centralauth]> FLUSH STATUS; pager cat > /dev/null; SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`    WHERE (gb_range_start  LIKE 'v6-2%' ESCAPE '`' ) AND (gb_range_start <= 'v6-2A02C207204105070000000000000099') AND (gb_range_end >= 'v6-2A02C207204105070000000000000099') AND (gb_expiry > '20220505113815')  ; nopager; SHOW STATUS like 'Hand%';
Query OK, 0 rows affected (0.000 sec)

PAGER set to 'cat > /dev/null'
1 row in set (0.001 sec)

PAGER set to stdout
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 2224  |
| Handler_icp_match          | 1     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 1     |
| Handler_read_last          | 0     |
| Handler_read_next          | 1     |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 0     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_delete         | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
27 rows in set (0.000 sec)

The migration to 10.6 at least, keeps the optimizer working fine.

Reverted the alter table on db1127 and db2140 to leave those hosts consistent with the rest.

Marostegui changed the task status from Open to Stalled.May 27 2022, 12:40 PM

Going to stall this until we are on 10.6