Page MenuHomePhabricator

Edits not saving on beta cluster (db replication error, corrupted table)
Closed, ResolvedPublicBUG REPORT

Description

Below likely caused by:

Mar 02 12:13:09 deployment-db11 mysqld[14965]: 2023-03-02 12:13:09 27 [ERROR] Incorrect definition of table mysql.proc: expected column 'definer' at position 11 to have type varchar(, found type char(141).
Mar 02 12:13:12 deployment-db11 mysqld[14965]: 2023-03-02 12:13:12 27 [ERROR] Incorrect definition of table mysql.event: expected column 'definer' at position 3 to have type varchar(, found type char(141).
Mar 02 12:13:12 deployment-db11 mysqld[14965]: 2023-03-02 12:13:12 27 [ERROR] Slave SQL: Query caused different errors on master and slave.     Error on master: message (format)='Cannot load from %s.%s. The table is probably corrup>
Mar 02 12:13:12 deployment-db11 mysqld[14965]: 2023-03-02 12:13:12 27 [Warning] Slave: Cannot load from mysql.proc. The table is probably corrupted Error_code: 1728
Mar 02 12:13:12 deployment-db11 mysqld[14965]: 2023-03-02 12:13:12 27 [Warning] Slave: Failed to open mysql.event Error_code: 1545
Mar 02 12:13:12 deployment-db11 mysqld[14965]: 2023-03-02 12:13:12 27 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'deployment-db09-b>
Mar 02 12:13:12 deployment-db11 mysqld[14965]: 2023-03-02 12:13:12 27 [Note] Slave SQL thread exiting, replication stopped in log 'deployment-db09-bin.000047' at position 873808641, master: deployment-db09.deployment-prep.eqiad1.wi>

Steps to replicate the issue (include links if applicable):

What happens?:
After (re)loading, the page content has not changed. The edit was not saved.

What should have happened instead?:
After (re)loading, the page content would reflect the changes. The edit would be saved.

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Event Timeline

Huh, was also trying the above logged out (editing https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Sandbox&action=edit) and got an edit conflict — the "conflicting edit" is not displayed in the page history.

image.png (710×1 px, 68 KB)


And editing via the API gives

{
    "edit": {
        "result": "Success",
        "pageid": 191808,
        "title": "Sandbox",
        "contentmodel": "wikitext",
        "oldrevid": 550884,
        "newrevid": 575468,
        "newtimestamp": "2023-03-02T14:14:03Z"
    }
}

but the edit is not saved.

TheresNoTime triaged this task as Unbreak Now! priority.Mar 2 2023, 2:24 PM

UBN! given this interrupts testing (nb. wonder why more CI isn't complaining...)

Consistent Wikimedia\Rdbms\LoadMonitor::computeServerStates: host deployment-db12 is not replicating? / Wikimedia\Rdbms\LoadMonitor::computeServerStates: host deployment-db11 is not replicating? errors — https://beta-logs.wmcloud.org/goto/a72ad7356f23e6fabfd863cce61d2530

Mar 02 12:13:09 deployment-db11 mysqld[14965]: 2023-03-02 12:13:09 27 [ERROR] Incorrect definition of table mysql.proc: expected column 'definer' at position 11 to have type varchar(, found type char(141).
Mar 02 12:13:12 deployment-db11 mysqld[14965]: 2023-03-02 12:13:12 27 [ERROR] Incorrect definition of table mysql.event: expected column 'definer' at position 3 to have type varchar(, found type char(141).
Mar 02 12:13:12 deployment-db11 mysqld[14965]: 2023-03-02 12:13:12 27 [ERROR] Slave SQL: Query caused different errors on master and slave.     Error on master: message (format)='Cannot load from %s.%s. The table is probably corrup>
Mar 02 12:13:12 deployment-db11 mysqld[14965]: 2023-03-02 12:13:12 27 [Warning] Slave: Cannot load from mysql.proc. The table is probably corrupted Error_code: 1728
Mar 02 12:13:12 deployment-db11 mysqld[14965]: 2023-03-02 12:13:12 27 [Warning] Slave: Failed to open mysql.event Error_code: 1545
Mar 02 12:13:12 deployment-db11 mysqld[14965]: 2023-03-02 12:13:12 27 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'deployment-db09-b>
Mar 02 12:13:12 deployment-db11 mysqld[14965]: 2023-03-02 12:13:12 27 [Note] Slave SQL thread exiting, replication stopped in log 'deployment-db09-bin.000047' at position 873808641, master: deployment-db09.deployment-prep.eqiad1.wi>

🙃

TheresNoTime added a subscriber: Zabe.

Tagging DBA as I'm afraid my familiarity with repairing whatever happened there is zero — any assistance or suggestions would be greatly appreciated. Also adding @Zabe as iirc you recently rescued these databases after the WMCS issue a while back?

TheresNoTime renamed this task from Edits not saving on beta cluster to Edits not saving on beta cluster (db replication error, corrupted table).Mar 2 2023, 2:48 PM
TheresNoTime updated the task description. (Show Details)
Ladsgroup subscribed.

Sorry, DBAs don't maintain beta cluster DBs, I'm currently busy with this UBN T330942 otherwise I would have helped

Change 893791 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: disable beta-update-databases-eqiad

https://gerrit.wikimedia.org/r/893791

Change 893791 merged by jenkins-bot:

[integration/config@master] jjb: disable beta-update-databases-eqiad

https://gerrit.wikimedia.org/r/893791

Change 893798 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/mediawiki-config@master] beta: Promote deployment-db11 as master, decom deployment-db09

https://gerrit.wikimedia.org/r/893798

Change 893798 merged by jenkins-bot:

[operations/mediawiki-config@master] beta: Promote deployment-db11 as master, decom deployment-db09

https://gerrit.wikimedia.org/r/893798

Mentioned in SAL (#wikimedia-releng) [2023-03-02T16:13:53Z] <zabe> failover deployment-prep master from deployment-db09 to deployment-db11 # T331019

Zabe claimed this task.

Mentioned in SAL (#wikimedia-releng) [2023-03-02T16:29:58Z] <zabe> create deployment-db13 as g3.cores8.ram16.disk20 # T331019

Mentioned in SAL (#wikimedia-releng) [2023-03-02T16:35:03Z] <zabe> create volume db13 and attach to deployment-db13 # T331019

Mentioned in SAL (#wikimedia-releng) [2023-03-02T21:54:42Z] <zabe> install mariadb 10.6 via role::mariadb::beta on deployment-db12 # T331019

Mentioned in SAL (#wikimedia-releng) [2023-03-02T22:01:40Z] <zabe> enable read-only mode and create dump of all databases # T331019

Mentioned in SAL (#wikimedia-releng) [2023-03-03T00:37:21Z] <zabe> take deployment-prep out of read-only # T331019

Mentioned in SAL (#wikimedia-releng) [2023-03-03T19:52:36Z] <zabe> deployment-db13: import dump into mariadb # T331019

Mentioned in SAL (#wikimedia-releng) [2023-03-03T19:52:56Z] <zabe> deployment-db13: start replication from deployment-db11 # T331019

Change 894112 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/mediawiki-config@master] beta: Add deployment-db13

https://gerrit.wikimedia.org/r/894112

Change 894112 merged by jenkins-bot:

[operations/mediawiki-config@master] beta: Add deployment-db13

https://gerrit.wikimedia.org/r/894112

Change 894114 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/mediawiki-config@master] beta: Pool deployment-db13

https://gerrit.wikimedia.org/r/894114

Change 894114 merged by jenkins-bot:

[operations/mediawiki-config@master] beta: Pool deployment-db13

https://gerrit.wikimedia.org/r/894114

Mentioned in SAL (#wikimedia-releng) [2024-02-26T22:04:27Z] <James_F> Deleting deployment-db09, decommissioned 11 months ago but never deleted in T331019