Page MenuHomePhabricator

Wikiapiary: Database server down
Open, Needs TriagePublic

Description

I'm getting 2 errors accessing wikiapiary:

Warning: Redis::connect(): connect() failed: Connection refused in /srv/www/mediawiki/public_html/w/includes/libs/redis/RedisConnectionPool.php on line 239

and

Sorry! This site is experiencing technical difficulties. Try waiting a few minutes and reloading. (Cannot access the database: Connection refused (db-local))

Event Timeline

Kghbln added a subscriber: Kghbln.

Cause: https://lists.wikimedia.org/pipermail/cloud/2019-February/000538.html
(WikiApiary is on the host named pub2 mentioned here.)

I haven't had a chance to work on this yet and won't until probably this weekend because of my work load. I'm asking people in MediaWiki-Stakeholders-Group to help out if they have cloud access and time.

I've updated the LocalSettings.php on the site to point to this ticket. If you have cloud access and want to help with the recovery of the site, I've copied the site to /srv/www/mediawiki/public_html/testing/ and it can be accessed here.

-- Copy of the end of /var/log/mysql/error.log.

Tgr added a subscriber: Tgr.Feb 19 2019, 7:06 PM

Redis says (after fixing a file permission issue):

26449:M 19 Feb 19:04:48.353 # Wrong signature trying to load DB from file
26449:M 19 Feb 19:04:48.353 # Fatal error loading the DB: Invalid argument. Exiting.

MySQL says:

2019-02-19 13:50:36 140614250160704 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
...
2019-02-19 13:50:36 140614250160704 [Note] InnoDB: Starting final batch to recover 101 pages from redo log
2019-02-19 13:50:36 7fe32d3fe700  InnoDB: Assertion failure in thread 140613693466368 in file fut0lst.ic line 83
InnoDB: Failing assertion: addr.page == FIL_NULL || addr.boffset >= FIL_PAGE_DATA
...
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.

Redis I guess we could just nuke and start from scratch.
The /srv/mysql directory is currently 78G and the disk has only a few G free space left so if we want to make a backup before doing the forced recovery, someone has to make a snapshot to their local machine.

copying via rsync in process....

Tgr added a comment.Feb 20 2019, 2:00 AM

MariaDB needs at least force recovery level 3 (do not roll back aborted transactions) to start. If I start it like that, mariadbcheck --all-databases doesn't find anything wrong. So there is probably no data loss, but a half-finished transaction might have left the DB in an inconsistent state. Given the kind of data we store in Wikiapiary, we probably don't care about that. Just have to figure out how to permanently discard the unfinished transactions.

Just have to figure out how to permanently discard the unfinished transactions.

During my quick search this percona information which may or may not be helpful was as close as I got.

Tgr added a comment.Feb 20 2019, 5:42 PM

So I guess the least bad option here is to bring up the DB in force recovery mode, dump the database to another machine (since the VM is out of space), nuke the data and restore from the dump. @MarkAHershberger do you have another 100G on that server to make a dump? :)

The server I copied to is a debian server with the same version of mariadb installed. Could I point that server at the db that I copied and then dump that, restore, and copy it back to wmflabs?

Ideally, I could give you an account on my server and you could do this there.

BTW: du reports that the db takes 66G on the server I copied it to. df reports 92G free in that partition. So, I think there is room.

Tgr added a comment.Feb 20 2019, 9:46 PM

That works too, if you don't use mariadb on that server otherwise.

debian9.nichework.com should have the whole dump now. Sent email with access instructions.

Tgr added a comment.Feb 23 2019, 4:16 AM
MariaDB [(none)]> show engine innodb status;
...
---TRANSACTION 331032999, ACTIVE 12 sec recovered trx
...

MariaDB [(none)]> select * from information_schema.innodb_trx \G
*************************** 1. row ***************************
                    trx_id: 331032999
                 trx_state: RUNNING
               trx_started: 2019-02-23 03:42:22
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 0
       trx_mysql_thread_id: 0
                 trx_query: NULL
       trx_operation_state: NULL
         trx_tables_in_use: 0
         trx_tables_locked: 0
          trx_lock_structs: 0
     trx_lock_memory_bytes: 360
           trx_rows_locked: 0
         trx_rows_modified: 0
   trx_concurrency_tickets: 0
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 10000
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)

so I can see the recovered transaction but it's not connected to any process and thus there is no way to kill it. (Some articles mention XA RECOVER, that didn't work here.)

Tgr added a comment.Feb 24 2019, 10:10 AM

Dumped the data in recovery level 3 using sudo mysqldump --add-drop-database --create-options --single-transaction --databases ... (single transaction is needed to avoid an 1449: "The user specified as a definer (...) does not exist" when using LOCK TABLES error). Couldn't drop the tables (that gave ERROR 2013 (HY000): Lost connection to MySQL server during query). I deleted all the /srv/mysql/ib* files, still couldn't drop the tables (ERROR 1030 (HY000): Got error 1 "Operation not permitted" from storage engine partition).

So in the end I deleted everything, ran mysql_install_db, and imported the tables (apiary, mw_wikiapiary, mysql). The site seems to be up now, did not test it much.

I've done some edits and it seems to work fine. Many thanks!

I've found a little glitch, however, and I'm not sure if it's something related to the outage. this page which I've created today, only has the option to edit it with the normal editor, but not the "form edit" as other pages have. I don't know if this may be caused by some pending jobs not being run, or redis failing to update something from the cache.

Tgr added a comment.Feb 24 2019, 7:46 PM

Yeah the job queue is stalled. Redis is up and I can't find any trace of a related cronjob or service so I'm not sure how it was configured in the first place.
(Also the puppet service is stopped. Is that intentional due to recovery, or just because the machine is not its own puppet master, or something that should be fixed?)