Page MenuHomePhabricator

Remove the compatibility layer of block schema in wikireplicas
Closed, ResolvedPublic

Description

When doing T355034: Deploy new block_target schema we have added a compatibility layer using a view in wikireplicas to make sure tools don't break. That's good but shouldn't be kept forever. The main purpose of views in wikireplica is to hide private data and such added complexities for long period of time can cause info leak as it has happened before (The data was properly hidden in the current views but was leaking private information in the compat views). To de-risk such info leaks, I suggest announcing the breaking change and drop the compat layer.

Update: Running maintain-views manually as the update-views cookbook always fails waiting for table locks. I'm keeping track of which hosts have been updated in the list below:

  • an-redacteddb1001
  • clouddb1013
  • clouddb1014
  • clouddb1015
  • clouddb1016
  • clouddb1017
  • clouddb1018
  • clouddb1019
  • clouddb1020

Details

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Aklapper renamed this task from Remove the compatability layer of block schema in wikireplicas to Remove the compatibility layer of block schema in wikireplicas.Apr 2 2025, 9:01 AM

@Ladsgroup I can write an email to cloud-announce to inform users of the upcoming change. What is the change exactly? Can you prepare a patch and attach it to this task?

fnegri triaged this task as Medium priority.Apr 16 2025, 4:01 PM

I'd appericiate it you do it. My plate has been overflowing :( I can make the patch. The exact change is that ipblocks, ipblocks_ipindex, and ipblocks_compat table views will be dropped. Users must query block and block_target tables instead.

Change #1137262 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] maintain-views: Drop views on ipblocks*

https://gerrit.wikimedia.org/r/1137262

fnegri changed the task status from Open to In Progress.Apr 17 2025, 1:45 PM
fnegri claimed this task.

Thanks @Ladsgroup I'll send an email to cloud-announce and think of other venues where we should send the announcement (maybe Tech News?)

I think June, 1st would be a good date to merge & apply your patch, so that users are notified about 1 month before the change.

Realized June 1st is a Sunday, so maybe June 2nd is a better date :)

This was also announced in Tech News: 2025-18, I have added another reminder to Tech News: 2025-22 that goes out on May, 26th.

I will merge the patch on June, 3rd.

Reminder that I'm gonna merge and apply this later today, dropping the old ipblocks* views.

Change #1137262 merged by FNegri:

[operations/puppet@production] maintain-views: Drop views on ipblocks*

https://gerrit.wikimedia.org/r/1137262

Cookbook cookbooks.sre.wikireplicas.update-views run by fnegri: Started updating wiki replica views

Cookbook cookbooks.sre.wikireplicas.update-views started by fnegri executed with errors:

  • an-redacteddb1001.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace-all --auto-depool --clean --all-databases'
  • clouddb1017.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace-all --auto-depool --clean --all-databases'
  • clouddb1018.eqiad.wmnet (FAIL)
    • Ran Puppet agent
    • The maintain-views run failed, see OUTPUT of 'maintain-views ...' above for details

Cookbook cookbooks.sre.wikireplicas.update-views run by fnegri: Started updating wiki replica views

The cookbook crashed on clouddb1018 while creating the view cswiki_p.logging:

2025-06-03 10:41:28,289 INFO [cswiki_p.logging]
2025-06-03 10:42:28,290 WARNING Depooling s2 and retrying

Interestingly, I was watching at it and I tried to check what was blocking it:

root@clouddb1018:s2[(none)]> SHOW PROCESSLIST;
+----------+---------------+----------------------+----------+-----------+---------+--------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
| Id       | User          | Host                 | db       | Command   | Time    | State                                                  | Info                                                                                                 | Progress |
+----------+---------------+----------------------+----------+-----------+---------+--------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
|        3 | orchestrator  | 208.80.155.103:33258 | NULL     | Sleep     |       4 |                                                        | NULL                                                                                                 |    0.000 |
|        4 | orchestrator  | 208.80.155.103:33266 | NULL     | Sleep     |       4 |                                                        | NULL                                                                                                 |    0.000 |
|        5 | orchestrator  | 208.80.155.103:33272 | NULL     | Sleep     |       4 |                                                        | NULL                                                                                                 |    0.000 |
|      522 | wmf-pt-kill   | localhost            | NULL     | Sleep     |       4 |                                                        | NULL                                                                                                 |    0.000 |
|    31658 | system user   |                      | NULL     | Slave_IO  | 9997283 | Waiting for master to send event                       | NULL                                                                                                 |    0.000 |
|    31659 | system user   |                      | NULL     | Slave_SQL |       0 | Slave has read all relay log; waiting for more updates | NULL                                                                                                 |    0.000 |
| 21964665 | u20855        | 10.64.150.4:37300    | thwiki_p | Sleep     |   11553 |                                                        | NULL                                                                                                 |    0.000 |
| 21973246 | s52741        | 10.64.150.4:38246    | cswiki_p | Sleep     |    7774 |                                                        | NULL                                                                                                 |    0.000 |
| 21982322 | s51592        | 10.64.150.4:52516    | svwiki_p | Sleep     |    3747 |                                                        | NULL                                                                                                 |    0.000 |
| 21990877 | s55753        | 10.64.150.4:37668    | zhwiki_p | Sleep     |     272 |                                                        | NULL                                                                                                 |    0.000 |
| 21991135 | s51592        | 10.64.150.4:55726    | svwiki_p | Sleep     |     152 |                                                        | NULL                                                                                                 |    0.000 |
| 21991212 | maintainviews | localhost            | NULL     | Query     |      36 | Waiting for table metadata lock                        | CREATE OR REPLACE
            DEFINER=viewmaster
            VIEW `cswiki_p`.`logging`
            A |    0.000 |
| 21991392 | root          | localhost            | NULL     | Query     |       0 | starting                                               | SHOW PROCESSLIST                                                                                     |    0.000 |
+----------+---------------+----------------------+----------+-----------+---------+--------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
13 rows in set (0.000 sec)

I cannot see any running query that could be causing the table metadata lock.

The cookbook crashed on clouddb1018 while creating the view cswiki_p.logging:

The same view failed to create in T395122#10853131, there is something special about that view apparently.

Cookbook cookbooks.sre.wikireplicas.update-views started by fnegri executed with errors:

  • an-redacteddb1001.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace-all --auto-depool --clean --all-databases'
  • clouddb1017.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --replace-all --auto-depool --clean --all-databases'
  • clouddb1018.eqiad.wmnet (FAIL)
    • Ran Puppet agent
    • The maintain-views run failed, see OUTPUT of 'maintain-views ...' above for details

This time it failed on a different table:

2025-06-03 11:07:08,328 INFO [thwiki_p.templatelinks]
2025-06-03 11:08:08,329 WARNING Depooling s2 and retrying

I managed to run the script manually on clouddb1018, but I had to kill a few threads as the script got stuck with Waiting for table metadata lock. I'm not sure why because all the threads that I killed were in Sleep status, but they were connected to the same database that was waiting for the lock, and the lock disappeared as soon as I killed them.

clouddb1019 is trickier because there are some old queries that are still there even after a kill:

| 33282161 | s52788       | 10.64.150.4:57404    | commonswiki_p | Killed    |  736365 | Sending data                                           | /*{"qrun": 980338, "user": "JayCubby"}*/ WITH revs AS (
  SELECT
    rev_page,
    rev_actor,
    re |    0.000 |
| 33298533 | s52788       | 10.64.150.4:59734    | commonswiki_p | Killed    |  732747 | Sending data                                           | /*{"qrun": 980338, "user": "JayCubby"}*/ WITH revs AS (
  SELECT
    rev_page,
    rev_actor,
    re |    0.000 |
| 33315092 | s52788       | 10.64.150.4:57640    | commonswiki_p | Killed    |  729146 | Sending data                                           | /*{"qrun": 980338, "user": "JayCubby"}*/ WITH revs AS (
  SELECT
    rev_page,
    rev_actor,
    re |    0.000 |
| 33502967 | s52788       | 10.64.150.4:35272    | commonswiki_p | Killed    |  690790 | Sending data                                           | /*{"qrun": 980512, "user": "JayCubby"}*/ WITH revs AS (
  SELECT
    rev_page,
    rev_actor,
    re |    0.000 |

I've depooled clouddb1019@s4 to prevent new queries from starting on that host, let's see if those queries eventually terminate.

Those queries are badly stuck, with execution time exceeding 8 days. wmf-pt-kill also failed to kill them, and they are stuck in Killed status. Any ideas on how to safely terminate them? Maybe STOP SLAVE; followed by systemctl stop mariadb?

Mentioned in SAL (#wikimedia-operations) [2025-06-03T16:57:37Z] <fnegri@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies T390767

STOP SLAVE worked fine. systemctl stop mariadb@s4 is stuck. This was logged in journald:

Jun 03 16:57:49 clouddb1019 systemd[1]: Stopping mariadb@s4.service - mariadb database server...
Jun 03 16:57:49 clouddb1019 mysqld[1889]: 2025-06-03 16:57:49 0 [Note] /opt/wmf-mariadb106/bin/mysqld (initiated by: unknown): Normal shutdown
Jun 03 16:57:49 clouddb1019 mysqld[1889]: 2025-06-03 16:57:49 35938890 [Warning] Aborted connection 35938890 to db: 'unconnected' user: 'unauthenticated' host: 'localhost' (This connection closed normally without authentication)
Jun 03 16:58:09 clouddb1019 mysqld[1889]: 2025-06-03 16:58:09 0 [Warning] /opt/wmf-mariadb106/bin/mysqld: Thread 33502967 (user : 's52788') did not exit
Jun 03 16:58:09 clouddb1019 mysqld[1889]: 2025-06-03 16:58:09 0 [Warning] /opt/wmf-mariadb106/bin/mysqld: Thread 33315092 (user : 's52788') did not exit
Jun 03 16:58:09 clouddb1019 mysqld[1889]: 2025-06-03 16:58:09 0 [Warning] /opt/wmf-mariadb106/bin/mysqld: Thread 33298533 (user : 's52788') did not exit
Jun 03 16:58:09 clouddb1019 mysqld[1889]: 2025-06-03 16:58:09 0 [Warning] /opt/wmf-mariadb106/bin/mysqld: Thread 33282161 (user : 's52788') did not exit

After checking with @jcrespo and @Marostegui I did a kill -9 and restarted mariadb@s4 on clouddb1019. The server is now back in sync and repooled. I could also run maintain-views with no errors.

There are a few hosts left where I need to run maintain-views, see the checklist in the task description. I will complete those tomorrow.

fnegri updated the task description. (Show Details)
fnegri moved this task from In progress to Done on the cloud-services-team (FY2024/2025-Q3-Q4) board.