Page MenuHomePhabricator

beta-update-databases-eqiad being aborted at 45 minutes since June 9th
Closed, ResolvedPublic

Description

https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/

Seems to be getting stuck at a certain point (the same point from two random-ish runs I checked):

05:20:08 ...oathauth_users table already exists.
05:20:08 ...oathauth_users table does not contain secret_reset field.
05:20:08 ...nl_newsletters table already exists.
05:20:08 ...nl_issues table already exists.
05:20:08 ...nl_subscriptions table already exists.
05:20:08 ...nl_publishers table Build timed out (after 45 minutes). Marking the build as aborted.
06:05:00 Build was aborted
06:05:00 [PostBuildScript] - Execution post build scripts.
06:05:00 [beta-update-databases-eqiad] $ /bin/bash -xe /tmp/hudson9191744524502974067.sh
06:05:00 + /srv/deployment/integration/slave-scripts/bin/global-teardown.sh
06:05:00 IRC notifier plugin: Sending notification to: #wikimedia-releng
06:05:00 Finished: ABORTED

Event Timeline

Last good build was at Jun 9, 2016 4:20:00 PM

The currently stall build is attempting to update enwiki. Logstash shows a few queries have lost connection, probably due to a timeout that is enforced by hhvm/mysql or whatever. I would guess that some extension or core introduced an update that is wayy too long.

I have killed update.php and the last lines were:

Removing orphaned echo_event rows... 
Terminated

Running it manually:

$ mwscript update.php --wiki=enwiki --quick
...
Removing orphaned echo_event rows...

The database shows a bunch of:

SELECT /* BatchRowIterator::next  */  event_id  FROM `echo_event` LEFT JOIN `echo_notification` ON (

No clue whether it is related. Eventually all queries are idling and update.php is stall waiting. Via strace:

write(9, SELECT /* BatchRowIterator::next  */  event_id  FROM `echo_event` LEFT JOIN `echo_notification` ON ((notification_event=event_id))  WHERE notification_user IS NULL  ORDER BY event_id ASC LIMIT 500
read(9,

Which does not show in the SHOW PROCESSLIST;. Then:

Function: BatchRowIterator::next
Error: 2013 Lost connection to MySQL server during query (10.68.17.94)

) = 318

Query plan:

(wikiadmin@deployment-db1) [enwiki]> EXPLAIN SELECT /* BatchRowIterator::next  */  event_id  FROM `echo_event` LEFT JOIN `echo_notification` ON ((notification_event=event_id))  WHERE notification_user IS NULL  ORDER BY event_id ASC LIMIT 500;
+------+-------------+-------------------+-------+---------------+---------------------+---------+------+--------+--------------------------------------------------------------------------+
| id   | select_type | table             | type  | possible_keys | key                 | key_len | ref  | rows   | Extra                                                                    |
+------+-------------+-------------------+-------+---------------+---------------------+---------+------+--------+--------------------------------------------------------------------------+
|    1 | SIMPLE      | echo_event        | index | NULL          | echo_event_type     | 66      | NULL | 290477 | Using index; Using temporary; Using filesort                             |
|    1 | SIMPLE      | echo_notification | index | NULL          | echo_user_timestamp | 18      | NULL | 541505 | Using where; Using index; Not exists; Using join buffer (flat, BNL join) |
+------+-------------+-------------------+-------+---------------+---------------------+---------+------+--------+--------------------------------------------------------------------------+
2 rows in set (0.01 sec)

(wikiadmin@deployment-db1) [enwiki]>

Mentioned in SAL [2016-06-13T08:30:57Z] <hashar> Beta: mwscript extensions/Echo/maintenance/removeInvalidTargetPage.php --wiki=enwiki for T137615

Deleted 497 records from 500 records
Deleted 500 records from 500 records
Deleted 500 records from 500 records
Deleted 344 records from 500 records
Deleted 165 records from 500 records
Deleted 27 records from 500 records
Deleted 0 records from 500 records
Deleted 0 records from 500 records
Deleted 442 records from 500 records
Deleted 500 records from 500 records
Deleted 500 records from 500 records
Deleted 500 records from 500 records
Deleted 500 records from 500 records
Deleted 262 records from 500 records
Deleted 235 records from 500 records
Deleted 0 records from 500 records
Deleted 0 records from 500 records
Deleted 0 records from 500 records
Deleted 494 records from 500 records
Deleted 500 records from 500 records
Deleted 500 records from 500 records
Deleted 500 records from 500 records
Deleted 500 records from 500 records
Deleted 475 records from 500 records
Deleted 132 records from 500 records
Deleted 394 records from 500 records
Deleted 213 records from 500 records
Deleted 133 records from 500 records
Deleted 0 records from 500 records
Deleted 0 records from 500 records
Deleted 30 records from 500 records
Deleted 182 records from 500 records
Deleted 50 records from 308 records

This is still happening as of 17:05 UTC.

@Mattflaschen-WMF you added that collab team project, are you looking into this?

How is it possible that this is still happening? @hashar says he ran the script to completion, and it's a logged update, so it shouldn't run twice.

A workaround we could consider is to manually insert a row with ul_key='RemoveOrphanedEvents', ul_value=NULL into the updatelog table.

This being the aborted job, didn't look at logs

The Echo removeInvalidTargetPage.php was me blindly attempting to cleanup the beta cluster enwiki database, that does not clean up the orphaned events though.

I have looked a bit more at it, and one of the issue is the huge amounts of Flow related notifications which cause the join to be crazy. A behavior that might well exist on other wikis?

For now we have 300k echo_event and 528k echo_notifications which is way too large for the JOIN. The batch iterator executes it over and over with a LIMIT of 500 but that does not help there. I guess we would need a batch of event ids instead to limit the JOIN scope.

removeInvalidTargetPage.php is ancient, it was added in 2014. It's also not supposed to be run from update.php...

I guess update.php block on the same SQL as extensions/Echo/maintenance/removeOrphanedEvents.php

There is a difference in the query execution plan between beta and prod:

EXPLAIN SELECT /* BatchRowIterator::next  */  event_id  FROM `echo_event` LEFT JOIN `echo_notification` ON ((notification_event=event_id)) ;

Beta:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLEecho_eventindexNULLecho_event_type66NULL367397Using index
1SIMPLEecho_notificationindexNULLecho_user_timestamp18NULL476890Using where; Using index; Using join buffer (flat, BNL join)

Prod (mediawikiwiki)

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLEecho_eventindexNULLevent_type66NULL340284Using index
1SIMPLEecho_notificationrefecho_notification_eventecho_notification_event4mediawikiwiki.echo_event.event_id3Using index

Looks like the echo_notification table in beta lacks an index on notification_event. It does not show up in Echo code for sure.

Mentioned in SAL [2016-06-13T19:22:00Z] <hashar> T137615 deployment-db2, enwiki > CREATE INDEX echo_notification_event ON echo_notification (notification_event);

Looks like the echo_notification table in beta lacks an index on notification_event. It does not show up in Echo code for sure.

Ah yes, that index was created recently in production, but in master it's introduced by a patch that's not merged yet. So it looks like we've accidentally let master get in the state where update.php runs a script that needs that index, without creating that index.

I'll fix that by splitting off this index creation.

On beta after INDEX echo_notification_event ON echo_notification (notification_event);

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLEecho_eventindexNULLecho_event_type66NULL147930Using index
1SIMPLEecho_notificationrefecho_notification_eventecho_notification_event4NULL2633Using index

That is WAYYYYYY less rows and the query is now blazing fast.

I'll fix that by splitting off this index creation.

Done in https://gerrit.wikimedia.org/r/294111

Mentioned in SAL [2016-06-13T19:55:05Z] <hashar> T137615 deployment-db2, eswiki > CREATE INDEX echo_notification_event ON echo_notification (notification_event);

So the issue was the introduction in Echo of a script/db update: RemoveOrphanedEvents . It depends on index which for production has been deployed already but for which the Gerrit change is still pending. Hence on beta, the script/update got pushed without the index.

Manually creating the index on enwiki confirmed the update to be instant/fast https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/ is now green.

@Catrope kindly extracted the database schema update in a standalone patch https://gerrit.wikimedia.org/r/#/c/294111/ . Once merged, the index will be created on all beta cluster database and we will be clear.

So essentially solved, pending for Roan change to land/merge and then Notifications can mark this task solved! Thank you @Catrope !

So essentially solved, pending for Roan change to land/merge and then Notifications can mark this task solved! Thank you @Catrope !

Two last builds - #9182 and #9183 failed (https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/) - is it a different issue?

21:20:12     raise Exception("command: ", cmd, "output: ", f.read())
21:20:12 Exception: ('command: ', '/usr/local/bin/mwscript update.php --wiki=enwiki --quick', 'output: ',
.................
21:20:12 Build step 'Execute shell' marked build as failure
21:20:12 [PostBuildScript] - Execution post build scripts.
21:20:12 [beta-update-databases-eqiad] $ /bin/bash -xe /tmp/hudson7853251939077959123.sh
21:20:12 + /srv/deployment/integration/slave-scripts/bin/global-teardown.sh
21:20:13 IRC notifier plugin: Sending notification to: #wikimedia-releng
21:20:13 Finished: FAILURE

hmm, seeing a problem that's possibly related.

update.php is currently throwing:

[26e477f31cd24e48e880c2d3] [no req]   DBReplicationWaitError from line 393 of /mnt/srv/mediawiki-staging/php-master/includes/db/loadbalancer/LBFactory.php: Could not wait for slaves 
to catch up to deployment-db1

update.php evidently is breaking replication wit h a duplicate key error:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: deployment-db1
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: deployment-db1-bin.000076
          Read_Master_Log_Pos: 540987782
               Relay_Log_File: deployment-db2-relay-bin.000045
                Relay_Log_Pos: 716070
        Relay_Master_Log_File: deployment-db1-bin.000076
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1061
                   Last_Error: Error 'Duplicate key name 'echo_notification_event'' on query. Default database: 'eswiki'. Query: 'CREATE /* DatabaseBase::sourceFile( /mnt/srv/mediawiki-staging/php-master/extensions/Echo/db_patches/patch-add-notification_event-index.sql )  */ INDEX echo_notification_event ON `echo_notification` (notification_event)'
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 540987214
              Relay_Log_Space: 717234
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1061
               Last_SQL_Error: Error 'Duplicate key name 'echo_notification_event'' on query. Default database: 'eswiki'. Query: 'CREATE /* DatabaseBase::sourceFile( /mnt/srv/mediawiki-staging/php-master/extensions/Echo/db_patches/patch-add-notification_event-index.sql )  */ INDEX echo_notification_event ON `echo_notification` (notification_event)'
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 172232897
1 row in set (0.00 sec)

That's very strange. update.php is supposed to check if the key exists before trying to create it. Perhaps @hashar created the key on one of the slaves without creating it on the master?

hmm, key is definitely there now:

thcipriani@deployment-tin:~$ sql eswiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 22291637
Server version: 5.5.34-MariaDB-1~precise-log mariadb.org binary distribution

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show index from echo_notification where Key_name = 'echo_notification_event'\G
*************************** 1. row ***************************
        Table: echo_notification
   Non_unique: 1
     Key_name: echo_notification_event
 Seq_in_index: 1
  Column_name: notification_event
    Collation: A
  Cardinality: 231
     Sub_part: NULL
       Packed: NULL
         Null: 
   Index_type: BTREE
      Comment: 
Index_comment: 
1 row in set (0.00 sec)

Fixed replication on deployment-db2 and beta-update-databases seems(?) ok now.

Two last builds (#9235 and #9236) are successful.

That's very strange. update.php is supposed to check if the key exists before trying to create it. Perhaps @hashar created the key on one of the slaves without creating it on the master?

Most probably. I have definitely created the key manually for enwiki and eswiki (and maybe nlwiki as well). And I have done it via sql --write enwiki. So I guess I have only had them created on the master.

I guess I assumed the index creation to somehow replicate to the slaves :(

Last builds looks all ok. Thx for the slave fix!