Maniphest T208462

Error Unknown column ipb_sitewide in field list on query
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	thcipriani
	Oct 31 2018, 7:41 PM

Description

19:21 <+icinga-wm> PROBLEM - MariaDB Slave SQL: s3 on db2050 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No,
                         Errno: 1054, Errmsg: Error Unknown column  ipb_sitewide in field list on query. Default database: ruwikiquote. [Query snipped

Got this after rolling forward 1.33.0-wmf.2 to group1

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		jcrespo	T208462 Error Unknown column ipb_sitewide in field list on query
		Resolved		dmaza	T203821 After QA, enable Partial Blocks MVP on test.wikipedia.org and test.wikidata.org

Event Timeline

thcipriani created this task.Oct 31 2018, 7:41 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 31 2018, 7:41 PM

Adding @dbarratt @Catrope @Mooeypoo based on a quick look at https://gerrit.wikimedia.org/r/q/ipb_sitewide

please retag as necessary.

Changing priority after adding as train blocker

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptOct 31 2018, 7:44 PM

Does this mean T204006 was not executed completely?

Krenair subscribed.Oct 31 2018, 7:47 PM

@Marostegui Could you verify that T204006 was executed successfully on ruwikiquote and possibly other dbs in the same group? This error might indicate that something didn't work quite right.

<volans|off> ok all confirmed, the only one without that field in the core DBs is only db2050 only for ruwikiquote

Removing this as a train blocker and rolling forward. Leaving open pending further investigation.

I've quickly audited the ipblocks.frm on all cored DBs in all shards (s1-s8) for all schemas and the only one missing (apart schemas that don't have it either on the masters because not in all.dblist) is ruwikiquote on db2050.
To do it quickly (as I'm not anymore familiar with the current tooling around DB stuff) I did the poor's man approach running things like:

sudo cumin 'C:mariadb::heartbeat%shard = s3' "grep -c 'ipb_sitewide' /srv/sqldata*/*/ipblocks.frm"

I've then checked with a similar approach the dbstores, and again, only dbstore2002 for s3 has that field missing.

As agreed on IRC with @BBlack that got in touch with @jcrespo I'm not applying the alter table on the failed hosts as they are both in the passive DC, as Jaime will have a look later.

dbarratt edited projects, added Anti-Harassment; removed Anti-Harassment (AHT Sprint 32).Oct 31 2018, 9:03 PM

Restricted Application added a subscriber: MGChecker. · View Herald TranscriptOct 31 2018, 9:03 PM

dbarratt moved this task from Untriaged to Tracking work by others on the Anti-Harassment board.Oct 31 2018, 9:03 PM

Just for a quick reference the alter to create the table (confirmed also by the history on neodymium) should be:

set session sql_log_bin=0; ALTER TABLE  ipblocks   ADD ipb_sitewide bool NOT NULL default 1;

jcrespo claimed this task.Nov 2 2018, 7:41 AM

So I am going to give a shot in the dark and say that the schema changes are done under pressure and in a hurry as this was done only to have the features out early and without proper checks, they are prone to errors.

I wasn't involved on this change, but it was apparent to me that this wasn't done with enough time or proper planning and design (for example, the tables weren't renamed appropriately as it was suggested by the platform mediawiki team), and we were incredibly lucky something else more important didn't broke. I would like to see a retrospective on that CC @mark @dbarratt

Mentioned in SAL (#wikimedia-operations) [2018-11-02T07:59:18Z] <jynus> performing alter table on db2050 T208462 T204006

Stashbot mentioned this in T204006: Execute the schema change for Partial Blocks.Nov 2 2018, 7:59 AM

Mentioned in SAL (#wikimedia-operations) [2018-11-02T08:01:55Z] <jynus> reducing consitency on db2050 to improve recovery time T208462

Mentioned in SAL (#wikimedia-operations) [2018-11-02T08:19:38Z] <jynus> performing alter table on dbstore2002 s3 and reducing consistency to improve recovery time T208462 T204006

Work by dbas here is done except making the 2 altered hosts back fully consistent once they catch up.

jcrespo lowered the priority of this task from High to Medium.Nov 2 2018, 9:34 AM

• TBolliger added a parent task: T203821: After QA, enable Partial Blocks MVP on test.wikipedia.org and test.wikidata.org.Nov 2 2018, 4:47 PM

dbarratt removed a parent task: T203821: After QA, enable Partial Blocks MVP on test.wikipedia.org and test.wikidata.org.Nov 2 2018, 5:21 PM

dbarratt added a subtask: T203821: After QA, enable Partial Blocks MVP on test.wikipedia.org and test.wikidata.org.Nov 2 2018, 5:24 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-05T13:57:30Z] <jynus_> increase consistency of db2050, dbstore2002 s3 after them catching up replication T208462

This is clearly an human error (my error) when deploying the schema change on T204006.
As I (and Jaime) explained on that ticket and on a separate email thread, a schema change is a complex and dangerous change as we have to alter every single wiki in core.
Pushing and rushing for a change can lead to these kind of issues, and it did.
I warned about it several times already via phab and email, and yet, I failed to push back in a stronger way and got myself into lots of pressure to get this out of the door and stop blocking the feature to be released. And this is the result.

We are lucky this only failed on that wiki and on codfw
(passive DC)
Hopefully this will be another lesson learned by everyone (including me) that pushing and rushing for schema changes isn't a good idea.

Thanks everyone who got this triagged and figured out in my absence.

• TBolliger reopened subtask T203821: After QA, enable Partial Blocks MVP on test.wikipedia.org and test.wikidata.org as Open.Nov 19 2018, 9:50 PM

• TBolliger closed subtask T203821: After QA, enable Partial Blocks MVP on test.wikipedia.org and test.wikidata.org as Resolved.Nov 19 2018, 10:00 PM

Marostegui closed this task as Resolved.Nov 22 2018, 1:33 PM

Liuxinyu970226 unsubscribed.Nov 23 2018, 12:13 PM

I upgraded to 1.37 from 1.31, and I got the error of ipb_sitewide missing.

I had to manually run "ALTER TABLE ipblocks ADD ipb_sitewide bool NOT NULL default 1;"