Fatal: Cannot block user at wikitech: Table 'labswiki.ipblocks_restrictions' doesn't exist
Closed, ResolvedPublic

Description

I tried to block a 1-page spammer, using default block settings, at https://wikitech.wikimedia.org/wiki/Special:Block/Morinakol
and got "Database error" - [d21d3f865fa10048eda3d560] 2018-11-16 06:23:47: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"
I tried in different browser&machine and got [0e3d91b8b6c3cb9bce1e52ac] 2018-11-16 06:29:58: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"

Possibly related to a schema change? cf T200963: Cannot save edits at Wikitech-wiki - exception - Wikimedia\Rdbms\DBQueryError

Quiddity created this task.Nov 16 2018, 6:36 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 16 2018, 6:36 AM
Quiddity triaged this task as High priority.
exception.file	       	/srv/mediawiki/php-1.33.0-wmf.4/includes/libs/rdbms/database/Database.php:1506
exception.message	       	A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: SELECT  ir_ipb_id,ir_type,ir_value,page_namespace,page_title  FROM `ipblocks_restrictions` LEFT JOIN `page` ON (ir_type = '1' AND (ir_value=page_id))   WHERE ir_ipb_id = '1113'  
Function: MediaWiki\Block\BlockRestriction::loadByBlockId
Error: 1146 Table 'labswiki.ipblocks_restrictions' doesn't exist (10.64.16.79)
Aklapper renamed this task from Cannot block user at wikitech - fatal exception - Wikimedia\Rdbms\DBQueryError to Fatal: Cannot block user at wikitech: Table 'labswiki.ipblocks_restrictions' doesn't exist.Nov 16 2018, 9:46 AM
Bawolff added subscribers: aezell, dmaza, dbarratt and 2 others.

Definitely sounds partial block related. Cc'ing some AHT people.

Ensuring that neccesary schema changes are applied should be responsibility of people deploying the feature.

Wikitech experiences periods of significant vandalism, and is a bit of an island with a small pool of patrollers. It wouldn't be unreasonable for this to be UBN.

The CREATE TABLE SWAT was requested by me in T204151 and put it on the schedule here:
https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=1804762&oldid=1804728
I don't recall who performed the SWAT deploy but I can dig through and find out if that's important.

I asked the deployer to execute the CREATE TABLE on every wiki. As far as I know, they used a deployment script to loop through each wiki and create them. Perhaps this script needs to be updated to include wikitech?

This issue can be resolved by executing the CREATE TABLE sql, or by running the update script (if we run that on wikitech).

Since a table creation is not a schema change, then the folks on DBA might not be aware of this change.

However, please let me know if there is anything I or Anti-Harassment should have done differently.

Here is the SWAT deploy IRC log:

[23:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T2300).
[23:00:05] <jouncebot>	 davidwbarratt: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:17] <davidwbarratt>	 I'm here!
[23:02:02] <davidwbarratt>	 who is SWATing?
[23:06:18] <davidwbarratt>	 hello?
[23:08:29] <dmaza>	 hello :p
[23:10:44] <davidwbarratt>	 ping addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof:
[23:11:00] <MaxSem>	 I'm in a meeting, sorry
[23:11:03] <twentyafterfour>	 I can swat
[23:11:06] <davidwbarratt>	 no problem
[23:11:09] <wikibugs>	 10Operations, 10ops-ulsfo, 10netops: Interface errors on cr4-ulsfo:et-0/0/1 - https://phabricator.wikimedia.org/T205937 (10ayounsi) p:05Triage>03Normal
[23:11:11] <davidwbarratt>	 twentyafterfour thanks!
[23:11:27] <davidwbarratt>	 twentyafterfour it's a SQL patch to create a table
[23:12:04] <twentyafterfour>	 have we gotten any dba review of the change? 
[23:12:17] <davidwbarratt>	 yes, it's already merged into master
[23:12:51] <davidwbarratt>	 here's the task https://phabricator.wikimedia.org/T197144 and the DBA review(s) https://phabricator.wikimedia.org/T193449
[23:12:59] <davidwbarratt>	 but it is not a "schema change"
[23:13:17] <davidwbarratt>	 https://wikitech.wikimedia.org/wiki/Schema_changes#What_is_not_a_schema_change
[23:13:30] <twentyafterfour>	 ah I see
[23:13:43] <twentyafterfour>	 ok just creating a table should be fine, indeed 
[23:14:35] <RoanKattouw>	 Looks to me like you should run foreachwiki sql.php maintenance/postgres/archives/patch-ipblocks_restrictions-table.sql  , right?
[23:14:44] <RoanKattouw>	 Uh, except without the /postgres/ bit
[23:14:51] <wikibugs>	 (03CR) 10Cwhite: [C: 032] "NOOP on existing: https://puppet-compiler.wmflabs.org/compiler1001/12705/" [puppet] - 10https://gerrit.wikimedia.org/r/463868 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite)
[23:14:52] <davidwbarratt>	 yep, just this: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/440871/21/maintenance/archives/patch-ipblocks_restrictions-table.sql
[23:15:36] <twentyafterfour>	 ok do we need to cherry pick anything to the branch or was this already merged prior to branch cut?
[23:15:43] * twentyafterfour looks at date of merge
[23:15:54] <RoanKattouw>	 "Included in" says wmf.22 and wmf.23
[23:16:00] <davidwbarratt>	 it's already merged, it should be deployed already
[23:16:05] <twentyafterfour>	 yeah cool 
[23:16:13] <twentyafterfour>	 so just the foreachwiki should do it 
[23:16:34] <Reedy>	 gonna have to be run from a codfw mtx host though, not deploy1001
[23:16:51] <RoanKattouw>	 What is the codfw maintenance host? mwmaint2001?
[23:16:55] <Reedy>	 yeah
[23:17:06] <Reedy>	 codfw is the active, but deployment still is in eqiad
[23:17:06] <RoanKattouw>	 Also, for completeness, let me correct my mistake:  foreachwiki sql.php maintenance/archives/patch-ipblocks_restrictions-table.sql
[23:17:11] <twentyafterfour>	 not sure if I have access to that? 
[23:17:14] * twentyafterfour checks
[23:17:18] <Reedy>	 you will :)
[23:17:20] <RoanKattouw>	 OK cool. I've never been on mwmaint2001, only on mwmaint1001
[23:17:32] <Reedy>	 mwmaint2001.codfw.wmnet obvs
[23:18:12] <davidwbarratt>	 I assume that creating the tables will persist after the datacenter switch-backover ?
[23:18:21] <twentyafterfour>	 !log creating ipblocks_restrictions table (command run on mwmaint2001: foreachwiki sql.php maintenance/archives/patch-ipblocks_restrictions-table.sql)
[23:18:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:31] <twentyafterfour>	 davidwbarratt: it should
[23:18:42] <davidwbarratt>	 good
[23:18:48] <Reedy>	 we're still replicating across DC
[23:18:53] <davidwbarratt>	 ah
[23:18:54] <Reedy>	 just the masters are currently in codfw
[23:19:06] <Reedy>	 https://dbtree.wikimedia.org/ is a nice illustration of it
[23:19:35] <James_F>	 RoanKattouw: Welcome, newbie. ;-)
[23:19:50] <twentyafterfour>	 ok the queries are running
[23:20:00] <davidwbarratt>	 yay!
[23:20:09] <twentyafterfour>	 so far all say " Query OK, 0 row(s) affected"
[23:20:30] <Reedy>	 That's normal IIRC
[23:20:44] <twentyafterfour>	 is there an easy way to confirm that the table is created?
[23:21:01] <Reedy>	 sql aawiki
[23:21:07] <Reedy>	 explain ipblocks_restrictions;
[23:21:16] <Reedy>	 I've just confirmed it's there on the eqiad hosts
[23:21:40] <twentyafterfour>	 Cool, Thanks Reedy!
[23:21:46] <James_F>	 Table exists on codfw aawiki too.
[23:21:54] <twentyafterfour>	 still running, there are a lot of wikis ;) 
[23:21:56] <Reedy>	 James_F: I'd be amazed if they didn't :P
[23:22:03] <Reedy>	 That would be some witchcraft
[23:22:36] <twentyafterfour>	 replag == witchcraft
[23:22:48] <James_F>	 Reedy: I've seen some stuff you wouldn't believe. RepLag, C-beams glittering off the belt of Orion, etc.
[23:22:54] <James_F>	 Bah, twentyafterfour got there first.
[23:23:02] <twentyafterfour>	 hah 
[23:23:02] <Reedy>	 Are you planning on deploying code that uses that table soon after?
[23:23:30] <davidwbarratt>	 we can't until the schema change is complete (i.e. the patch that changes a column)
[23:24:00] <davidwbarratt>	 https://phabricator.wikimedia.org/T204006
[23:24:34] <Reedy>	 fair
[23:27:50] <icinga-wm>	 PROBLEM - puppet last run on wdqs1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[enforce-users-groups-cleanup]
[23:30:01] <davidwbarratt>	 twentyafterfour still running?
[23:30:21] <Reedy>	 probably...  over 900 wikis takes a while :)
[23:30:39] <twentyafterfour>	 davidwbarratt: yeah almost done I think 
[23:30:57] <twentyafterfour>	 at zawiki
[23:31:06] <twentyafterfour>	 and now it's done
[23:31:16] <mutante>	 herron: is the wdqs error about the new group?
[23:31:24] <twentyafterfour>	 !log finished creating database tables
[23:31:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:32:48] <wikibugs>	 (03Abandoned) 10MacFan4000: Remove MW 1.29 from ExtDist as it is now no longer supported [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440745 (owner: 10MacFan4000)
[23:33:25] <mutante>	 herron: ignore it. it is definitely not. i see an unrelated issue
[23:34:05] <herron>	 mutante ok, I didn’t merge that yet fwiw
[23:34:26] <mutante>	 herron: yea, i see it. it just seemed like it because "groups-cleanup" 
[23:34:37] <davidwbarratt>	 YAY!
[23:34:46] <davidwbarratt>	 thanks twentyafterfour !
[23:34:51] <mutante>	 but the issue that somebody manually added a user or installed softare
[23:35:06] <twentyafterfour>	 davidwbarratt: You're welcome. Thanks to Reedy and RoanKattouw for helping out.

This table should've been created on labswiki and labtestwiki indeed. Probably the script needs to be upgraded.
I have created it now on db1073 for labswiki and labtestwiki.

Can you confirm if this works now?
Thanks,

Marostegui moved this task from Triage to Done on the DBA board.Nov 16 2018, 3:06 PM
Anomie added a subscriber: Anomie.Nov 16 2018, 3:58 PM

This table should've been created on labswiki and labtestwiki indeed. Probably the script needs to be upgraded.

foreachwiki uses all.dblist, which does include labswiki and labtestwiki.

When something is running through all 932 wikis, if there's an error on one wiki in the middle it can scroll off the screen and not be seen by humans. It can be useful to use script or screen -L to run the command, or pipe the output through 2>&1 | tee some-file.txt, to capture the output for later review.

So, for example, if the wikitech master (db2037 during the DC switch, I think?) was read-only for some reason when the maintenance script was run, that could have been the cause.

chasemp added subscribers: Andrew, Reedy.EditedNov 16 2018, 3:58 PM

@Reedy or @Andrew any idea why labswiki could be left behind here? @Bawolff and I were wondering if it could have had something to do with DC switchover things happening around the same time. Not sure if this should be held open to fix whatever caused this in the first place.

Edit: Thanks @Anomie, I had not seen your reply when I commented :)

[Making public is blocks work again]

Tested blocking on wikitech. It worked

Bawolff changed the visibility from "Custom Policy" to "Public (No Login Required)".Nov 16 2018, 4:17 PM
Restricted Application added a subscriber: MGChecker. · View Herald TranscriptNov 16 2018, 4:17 PM

I am removing the DBA tag as there is nothing else left for us here. I will remain subscribed to the task though.

There is a good follow up suggested by @Anomie at T209674#4754013 to make sure following iterations do not suffer this same problem

So, for example, if the wikitech master (db2037 during the DC switch, I think?) was read-only for some reason when the maintenance script was run, that could have been the cause.

Misc services (wikitech lives on a misc service) masters where not failed over, so wikitech master role remained on db1073 at all times, so maybe the script tried to write to codfw misc master and that was indeed on read-only (and it should be like that) - that could explain it.
We were on codfw from 12th Sept to 10th Oct - was the table created during those days?

Anomie added a comment.EditedNov 16 2018, 6:07 PM

Misc services (wikitech lives on a misc service) masters where not failed over, so wikitech master role remained on db1073 at all times, so maybe the script tried to write to codfw misc master and that was indeed on read-only (and it should be like that) - that could explain it.

Since db-codfw.php points to db2037, that's what maintenance from mwmaint2001 would have connected to.

We were on codfw from 12th Sept to 10th Oct - was the table created during those days?

2018-10-01 at around 23:18 UTC.

@chasemp: It looks like we have our explanation: labswiki and labtestwiki weren't part of the DC switch, so maintenance from mwmaint2001 couldn't write these wikis then (much like it can't write any wikis now when codfw is not the active DC). The humans running the maintenance probably just missed seeing the readonly error messages for those two wikis in the middle of the foreachwiki boilerplate and success messages for the other 930 wikis.

The humans running the maintenance probably just missed seeing the readonly error messages for those two wikis in the middle of the foreachwiki boilerplate and success messages for the other 930 wikis.

Would it be possible for the script to collect and present all of the errors at the end of the operation? This might make it more obvious there was a failure.

foreachwiki doesn't know anything about the maintenance script being run, it just prints a header and pipes stdout through sed to prefix each line. Depending on the script, that may be a lot of lines.

The best it might do would be to list the wikis for which the maintenance script exited with a non-zero exit code. Not all scripts do, although it looks like sql.php would have here. Anyway, it might be better to file a separate task for that if you want.

This comment was removed by chasemp.
chasemp closed this task as Resolved.Wed, Nov 21, 4:12 PM
chasemp claimed this task.

The best it might do would be to list the wikis for which the maintenance script exited with a non-zero exit code. Not all scripts do, although it looks like sql.php would have here. Anyway, it might be better to file a separate task for that if you want.

Yep, l'll resolve this as the original issue is good to go. Thanks all, a separate task to pursue the larger contributing issue seems ideal. @dbarratt would you mind making that?

Yep, l'll resolve this as the original issue is good to go. Thanks all, a separate task to pursue the larger contributing issue seems ideal. @dbarratt would you mind making that?

Done. T210474