Beta update.php broken since Nov 15th 19:20
Closed, ResolvedPublic

Description

update.php on the beta cluster is broken since Nov 15th 19:20. Running it manually on deployment-tin:

$ sudo su - mwdeploy
$ mwscript update.php --wiki=enwiki --quick
#!/usr/bin/env php
MediaWiki 1.29.0-alpha Updater

Your composer.lock file is up to date with current dependencies!
Going to run database updates for enwiki
Depending on the size of your database this may take a while!
...echo_subscription doesn't exist.
...vote_ip in table securepoll_votes already modified by patch /srv/mediawiki-staging/php-master/extensions/SecurePoll/patches/patch-vote_ip-extend.sql.
Turning off Content Handler DB fields for this part of upgrade.
Set $wgShowExceptionDetails = true; in LocalSettings.php to show detailed debugging information.

https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 16 2016, 9:18 AM
hashar added a comment.EditedNov 16 2016, 9:20 AM
This comment has been deleted.

I havent tried to find the triggered exception yet. I am looking at T150833 which causes:

Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 754

That might be related.

I had the commit triggering T150833 reverted on beta cluster but update.php (mwscript update.php --wiki=enwiki --quick) still fails. So that is unrelated.

$ strace -f -y -s1024 /usr/local/bin/mwscript update.php --wiki=aawiki --quick
execve("/usr/local/bin/mwscript", ["/usr/local/bin/mwscript", "update.php", "--wiki=aawiki", "--quick"], [/* 14 vars */]) = 0
[pid 12630] execve("/usr/bin/sudo", ["sudo", "-u", "www-data", "php5", "/srv/mediawiki-staging/multiversion/MWScript.php", "update.php", "--wiki=aawiki", "--quick"], [/* 13 vars */]) = 0
[pid 12630] write(2</dev/pts/1>, "sudo: effective uid is not 0, is /usr/bin/sudo on a file system with the 'nosuid' option set or an NFS file system without root privileges?\n", 140sudo: effective uid is not 0, is /usr/bin/sudo on a file system with the 'nosuid' option set or an NFS file system without root privileges?
) = 140

Sudo broken somehow? :(

hashar triaged this task as Unbreak Now! priority.Nov 16 2016, 11:52 AM
Restricted Application added subscribers: Jay8g, Luke081515, TerraCodes. · View Herald TranscriptNov 16 2016, 11:52 AM

I could not figure out why $wgShowExceptionDetails is not true when running update.php so I have just live hacked the related PHP files:

  • includes/exception/MWException.php
  • includes/exception/MWExceptionRenderer.php

(rant: both files have a lot of copy pasted code in common).

The exception ends up being:

[74b7d066c02b82925901b350] [no req]   DBReplicationWaitError from line 357 of /srv/mediawiki-staging/php-master/includes/libs/rdbms/lbfactory/LBFactory.php: Could not wait for replica DBs to catch up to deployment-db03
Backtrace:
#0 /srv/mediawiki-staging/php-master/includes/installer/DatabaseUpdater.php(476): LBFactory->waitForReplication()
#1 /srv/mediawiki-staging/php-master/includes/installer/DatabaseUpdater.php(433): DatabaseUpdater->runUpdates(array, boolean)
#2 /srv/mediawiki-staging/php-master/maintenance/update.php(172): DatabaseUpdater->doUpdates(array)
#3 /srv/mediawiki-staging/php-master/maintenance/doMaintenance.php(111): UpdateMediaWiki->execute()
#4 /srv/mediawiki-staging/php-master/maintenance/update.php(217): require_once(string)
#5 /srv/mediawiki-staging/multiversion/MWScript.php(97): require_once(string)
#6 {main}

So maybe the slave replication is broken :(

On the slave we have

Error 'Can't drop database 'dewiktionary'; database doesn't exist' on query. Default database: 'dewiktionary'. Query: 'drop database dewiktionary'
$ echo 'SHOW SLAVE STATUS \G'|sql  aawiki
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: deployment-db03.eqiad.wmflabs
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: deployment-db03-bin.000009
          Read_Master_Log_Pos: 592967957
               Relay_Log_File: deployment-db04-relay-bin.000011
                Relay_Log_Pos: 592968119
        Relay_Master_Log_File: deployment-db03-bin.000009
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1008
                   Last_Error: Error 'Can't drop database 'dewiktionary'; database doesn't exist' on query. Default database: 'dewiktionary'. Query: 'drop database dewiktionary'
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 592967822
              Relay_Log_Space: 592974211
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1008
               Last_SQL_Error: Error 'Can't drop database 'dewiktionary'; database doesn't exist' on query. Default database: 'dewiktionary'. Query: 'drop database dewiktionary'
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 172234526
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: No
                  Gtid_IO_Pos: 

Mentioned in SAL (#wikimedia-releng) [2016-11-16T12:24:28Z] <hashar> beta: created dewiktionary table on the Database slave. Restarted replication with START SLAVE; T150834 T150764

hashar closed this task as Resolved.Nov 16 2016, 12:26 PM
hashar claimed this task.

That has been caused by the addition of the German Wiktionary ( T150764 ). The MySQL master/slave replication got broken because the dewiktionary database was not created on the slave and "drop database dewiktionary;" failed to replicate.

I have manually created the database on the slave, issued START SLAVE and replication caught up from there.

update.php works now! https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12839/console

hashar reopened this task as Open.Nov 16 2016, 12:30 PM

And the job fails:

$ mwscript update.php --wiki=dewiktionary  --quick
#!/usr/bin/env php
MediaWiki 1.29.0-alpha Updater

Your composer.lock file is up to date with current dependencies!

dewiktionary is not on the master database. So gotta remove it from the dblist.

Change 321866 had a related patch set uploaded (by Hashar):
Revert "Add German Wiktionary in beta"

https://gerrit.wikimedia.org/r/321866

Change 321866 merged by jenkins-bot:
Revert "Add German Wiktionary in beta"

https://gerrit.wikimedia.org/r/321866

hashar closed this task as Resolved.Nov 16 2016, 12:54 PM

12:54 UTC Project beta-update-databases-eqiad build #12840: FIXED in 1 min 7 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12840/