Error Unknown column ipb_sitewide in field list on query
Closed, ResolvedPublic

Description

19:21 <+icinga-wm> PROBLEM - MariaDB Slave SQL: s3 on db2050 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No,
                         Errno: 1054, Errmsg: Error Unknown column  ipb_sitewide in field list on query. Default database: ruwikiquote. [Query snipped

Got this after rolling forward 1.33.0-wmf.2 to group1

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 31 2018, 7:41 PM
thcipriani triaged this task as Unbreak Now! priority.

Changing priority after adding as train blocker

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptOct 31 2018, 7:44 PM
dbarratt added a comment.EditedOct 31 2018, 7:46 PM

Does this mean T204006 was not executed completely?

@Marostegui Could you verify that T204006 was executed successfully on ruwikiquote and possibly other dbs in the same group? This error might indicate that something didn't work quite right.

thcipriani lowered the priority of this task from Unbreak Now! to High.

<volans|off> ok all confirmed, the only one without that field in the core DBs is only db2050 only for ruwikiquote

Removing this as a train blocker and rolling forward. Leaving open pending further investigation.

Volans added subscribers: jcrespo, BBlack, Volans.EditedOct 31 2018, 8:43 PM

I've quickly audited the ipblocks.frm on all cored DBs in all shards (s1-s8) for all schemas and the only one missing (apart schemas that don't have it either on the masters because not in all.dblist) is ruwikiquote on db2050.
To do it quickly (as I'm not anymore familiar with the current tooling around DB stuff) I did the poor's man approach running things like:

sudo cumin 'C:mariadb::heartbeat%shard = s3' "grep -c 'ipb_sitewide' /srv/sqldata*/*/ipblocks.frm"

I've then checked with a similar approach the dbstores, and again, only dbstore2002 for s3 has that field missing.

As agreed on IRC with @BBlack that got in touch with @jcrespo I'm not applying the alter table on the failed hosts as they are both in the passive DC, as Jaime will have a look later.

Restricted Application added a subscriber: MGChecker. · View Herald TranscriptOct 31 2018, 9:03 PM
Volans added a comment.EditedOct 31 2018, 9:03 PM

Just for a quick reference the alter to create the table (confirmed also by the history on neodymium) should be:

set session sql_log_bin=0; ALTER TABLE  ipblocks   ADD ipb_sitewide bool NOT NULL default 1;
jcrespo added a subscriber: mark.EditedNov 2 2018, 7:48 AM

So I am going to give a shot in the dark and say that the schema changes are done under pressure and in a hurry as this was done only to have the features out early and without proper checks, they are prone to errors.

I wasn't involved on this change, but it was apparent to me that this wasn't done with enough time or proper planning and design (for example, the tables weren't renamed appropriately as it was suggested by the platform mediawiki team), and we were incredibly lucky something else more important didn't broke. I would like to see a retrospective on that CC @mark @dbarratt

Mentioned in SAL (#wikimedia-operations) [2018-11-02T07:59:18Z] <jynus> performing alter table on db2050 T208462 T204006

Mentioned in SAL (#wikimedia-operations) [2018-11-02T08:01:55Z] <jynus> reducing consitency on db2050 to improve recovery time T208462

Mentioned in SAL (#wikimedia-operations) [2018-11-02T08:19:38Z] <jynus> performing alter table on dbstore2002 s3 and reducing consistency to improve recovery time T208462 T204006

jcrespo moved this task from Triage to Done on the DBA board.

Work by dbas here is done except making the 2 altered hosts back fully consistent once they catch up.

jcrespo lowered the priority of this task from High to Normal.Nov 2 2018, 9:34 AM

Mentioned in SAL (#wikimedia-operations) [2018-11-05T13:57:30Z] <jynus_> increase consistency of db2050, dbstore2002 s3 after them catching up replication T208462

This is clearly an human error (my error) when deploying the schema change on T204006.
As I (and Jaime) explained on that ticket and on a separate email thread, a schema change is a complex and dangerous change as we have to alter every single wiki in core.
Pushing and rushing for a change can lead to these kind of issues, and it did.
I warned about it several times already via phab and email, and yet, I failed to push back in a stronger way and got myself into lots of pressure to get this out of the door and stop blocking the feature to be released. And this is the result.

We are lucky this only failed on that wiki and on codfw
(passive DC)
Hopefully this will be another lesson learned by everyone (including me) that pushing and rushing for schema changes isn't a good idea.

Thanks everyone who got this triagged and figured out in my absence.

Marostegui closed this task as Resolved.Thu, Nov 22, 1:33 PM