Page MenuHomePhabricator

db2069 storage crash
Closed, ResolvedPublic

Description

	10	 Repaired	Drive Array	04/10/2018 14:50	04/09/2018 19:58	1	Internal Storage Enclosure Device Failure (Bay 1, Box 1, Port 1I, Slot 0)
	9	 Repaired	Drive Array	04/09/2018 18:55	04/09/2018 18:55	1	Internal Storage Enclosure Device Failure (Bay 1, Box 1, Port 1I, Slot 0)
	8	 Repaired	Drive Array	04/09/2018 15:00	04/08/2018 12:30	1	Internal Storage Enclosure Device Failure (Bay 4, Box 1, Port 1I, Slot 0)
	7	 Caution	Drive Array	04/09/2018 18:52	04/06/2018 04:11	2	Internal Storage Enclosure Device Failure (Bay 1, Box 1, Port 1I, Slot 0)
	6	 Caution	POST Message	07/29/2016 07:01	07/29/2016 07:01	1	POST Error: 1792-Slot X Drive Array - Valid Data Found in Cache Module. Data will automatically be written to drive array.
	5	 Caution	POST Message	07/29/2016 07:01	07/29/2016 07:01	1	POST Error: 1719-A controller failure event occurred prior to this power-up
	4	 Critical	Drive Array	08/09/2018 10:05	07/29/2016 00:26	2	Drive Array Controller Failure (Slot 0)
	3	 Caution	POST Message	08/09/2014 01:14	08/09/2014 01:14	1	POST Error: 1785-Slot X Drive Array Not Configured
	2	 Caution	POST Message	08/03/2014 20:28	08/03/2014 20:28	1	POST Error: 1785-Slot X Drive Array Not Configured
	1	 Caution	POST Message	08/02/2014 00:05	08/02/2014 00:05	1	POST Error: 1785-Slot X Drive Array Not Configured
	0	 Informational	Maintenance	01/10/2015 02:17	01/10/2015 02:17	1	Maintenance note: IML cleared through hpasmcli
NVRAM data		 
Embedded Flash/SD-CARD		Controller firmware revision 2.09.00  
EEPROM		 
Host ROM		 
Supported host		 
Power Management Controller		Version 3.3 
CPLD - PAL0		ProLiant DL380p Gen8 System Programmable Logic Device version 0x32 
CPLD - PAL1		ProLiant DL380p Gen8 SAS Programmable Logic Device version 0x0C

On boot:

1719-Slot 0 Drive Array - A controller failure event occurred prior to thisve>
     power-up.  (Previous lock up code = 0x13)
1792-Slot 0 Drive Array - Valid Data Found in Write-Back Cache.
     Data will automatically be written to drive array.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 451606 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db2069 due to crash

https://gerrit.wikimedia.org/r/451606

Change 451606 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db2069 due to crash

https://gerrit.wikimedia.org/r/451606

jcrespo updated the task description. (Show Details)

mysql start log seems clean except for the mysql database, which is a non issue:

Aug 09 11:10:35 db2069 systemd[1]: Starting MariaDB database server...
Aug 09 11:10:35 db2069 mysqld[2964]: 2018-08-09 11:10:35 140389366982912 [Note] /opt/wmf-mariadb101/bin/mysqld (mysqld 10.1.32-MariaD
Aug 09 11:10:35 db2069 mysqld[2964]: 2018-08-09 11:10:35 140389366982912 [Warning] No argument was provided to --log-bin and neither 
Aug 09 11:10:35 db2069 mysqld[2964]: 2018-08-09 11:10:35 140389366982912 [ERROR] Plugin 'rpl_semi_sync_slave' already installed
Aug 09 11:10:35 db2069 mysqld[2964]: 2018-08-09 11:10:35 140389366982912 [Note] InnoDB: Using mutexes to ref count buffer pool pages
Aug 09 11:10:35 db2069 mysqld[2964]: 2018-08-09 11:10:35 140389366982912 [Note] InnoDB: The InnoDB memory heap is disabled
Aug 09 11:10:35 db2069 mysqld[2964]: 2018-08-09 11:10:35 140389366982912 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
Aug 09 11:10:35 db2069 mysqld[2964]: 2018-08-09 11:10:35 140389366982912 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used f
Aug 09 11:10:35 db2069 mysqld[2964]: 2018-08-09 11:10:35 140389366982912 [Note] InnoDB: Compressed tables use zlib 1.2.3
Aug 09 11:10:35 db2069 mysqld[2964]: 2018-08-09 11:10:35 140389366982912 [Note] InnoDB: Using Linux native AIO
Aug 09 11:10:35 db2069 mysqld[2964]: 2018-08-09 11:10:35 140389366982912 [Note] InnoDB: Using SSE crc32 instructions
Aug 09 11:10:35 db2069 mysqld[2964]: 2018-08-09 11:10:35 140389366982912 [Note] InnoDB: Initializing buffer pool, size = 118.0G
Aug 09 11:10:41 db2069 mysqld[2964]: 2018-08-09 11:10:41 140389366982912 [Note] InnoDB: Completed initialization of buffer pool
Aug 09 11:10:42 db2069 mysqld[2964]: 2018-08-09 11:10:42 140389366982912 [Note] InnoDB: Highest supported file format is Barracuda.
Aug 09 11:10:42 db2069 mysqld[2964]: 2018-08-09 11:10:42 140389366982912 [Note] InnoDB: Starting crash recovery from checkpoint LSN=3
Aug 09 11:10:58 db2069 mysqld[2964]: 2018-08-09 11:10:58 140389366982912 [Note] InnoDB: Processed 3355 .ibd/.isl files
Aug 09 11:10:59 db2069 mysqld[2964]: 2018-08-09 11:10:59 140389366982912 [Note] InnoDB: Restoring possible half-written data pages fr
Aug 09 11:11:00 db2069 mysqld[2964]: 2018-08-09 11:11:00 140389366982912 [Note] InnoDB: Read redo log up to LSN=3296546275328
Aug 09 11:11:09 db2069 mysqld[2964]: InnoDB: Transaction 6075449528 was in the XA prepared state.
Aug 09 11:11:09 db2069 mysqld[2964]: InnoDB: Transaction 6075449528 was in the XA prepared state.
Aug 09 11:11:09 db2069 mysqld[2964]: InnoDB: 1 transaction(s) which must be rolled back or cleaned up
Aug 09 11:11:09 db2069 mysqld[2964]: InnoDB: in total 0 row operations to undo
Aug 09 11:11:09 db2069 mysqld[2964]: InnoDB: Trx id counter is 6075449856
Aug 09 11:11:09 db2069 mysqld[2964]: 2018-08-09 11:11:09 140389366982912 [Note] InnoDB: Starting final batch to recover 126646 pages 
Aug 09 11:11:15 db2069 mysqld[2964]: 2018-08-09 11:11:15 140252454823680 [Note] InnoDB: To recover: 96552 pages from log
Aug 09 11:11:30 db2069 mysqld[2964]: 2018-08-09 11:11:30 140252471609088 [Note] InnoDB: To recover: 48933 pages from log
Aug 09 11:11:40 db2069 mysqld[2964]: InnoDB: In a MySQL replication slave the last master binlog file
Aug 09 11:11:40 db2069 mysqld[2964]: InnoDB: position 0 506745989, file name db1029-bin.000714
Aug 09 11:11:40 db2069 mysqld[2964]: InnoDB: Last MySQL binlog file position 0 211697215, file name ./db2069-bin.000131
Aug 09 11:11:57 db2069 mysqld[2964]: 2018-08-09 11:11:57 140389366982912 [Note] InnoDB: 128 rollback segment(s) are active.
Aug 09 11:11:57 db2069 mysqld[2964]: 2018-08-09 11:11:57 140250374469376 [Note] InnoDB: Starting in background the rollback of recove
Aug 09 11:11:57 db2069 mysqld[2964]: 2018-08-09 11:11:57 140250374469376 [Note] InnoDB: Rollback of non-prepared transactions complet
Aug 09 11:11:57 db2069 mysqld[2964]: 2018-08-09 11:11:57 140389366982912 [Note] InnoDB: Waiting for purge to start
Aug 09 11:11:57 db2069 mysqld[2964]: 2018-08-09 11:11:57 140389366982912 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.
Aug 09 11:12:14 db2069 mysqld[2964]: 2018-08-09 11:12:14 140249778849536 [Note] InnoDB: Dumping buffer pool(s) not yet started
Aug 09 11:12:14 db2069 mysqld[2964]: 2018-08-09 11:12:14 7f8e723f8700 InnoDB: Loading buffer pool(s) from .//ib_buffer_pool
Aug 09 11:12:14 db2069 mysqld[2964]: 2018-08-09 11:12:14 140389366982912 [Note] Recovering after a crash using db2069-bin
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [Note] Starting crash recovery...
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 7faef2596900  InnoDB: Starting recovery for XA transactions...
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 7faef2596900  InnoDB: Transaction 6075449528 in prepared state after recover
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 7faef2596900  InnoDB: Transaction contains changes to 3 rows
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 7faef2596900  InnoDB: 1 transactions in prepared state after recovery
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [Note] Found 1 prepared transaction(s) in InnoDB
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [Note] Crash recovery finished.
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [Note] Server socket created on IP: '::'.
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [Note] Server socket created on IP: '::'.
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [ERROR] mysqld: Table './mysql/user' is marked as crashed an
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [Warning] Checking table:   './mysql/user'
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [ERROR] mysql.user: 1 client is using or hasn't closed the t
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [ERROR] mysqld: Table './mysql/db' is marked as crashed and 
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [Warning] Checking table:   './mysql/db'
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [ERROR] mysql.db: 1 client is using or hasn't closed the tab
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [Warning] 'proxies_priv' entry '@% root@db1029' ignored in -
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [ERROR] mysqld: Table './mysql/tables_priv' is marked as cra
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [Warning] Checking table:   './mysql/tables_priv'
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [ERROR] mysql.tables_priv: 1 client is using or hasn't close
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [ERROR] mysqld: Table './mysql/event' is marked as crashed a
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [Warning] Checking table:   './mysql/event'
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [ERROR] mysql.event: 1 client is using or hasn't closed the 
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366351616 [Note] Event Scheduler: scheduler thread started with id 2
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [Warning] Neither --relay-log nor --relay-log-index were use
Aug 09 11:12:24 db2069 mysqld[2964]: 2018-08-09 11:12:24 140389366982912 [Note] /opt/wmf-mariadb101/bin/mysqld: ready for connections
Aug 09 11:12:24 db2069 mysqld[2964]: Version: '10.1.32-MariaDB'  socket: '/run/mysqld/mysqld.sock'  port: 3306  MariaDB Server
Aug 09 11:12:24 db2069 systemd[1]: Started MariaDB database server

However, I wouldn't consider the data trustworthy without a row data check.

herron triaged this task as High priority.Aug 10 2018, 7:27 PM

I have compared the tables: echo_target_page, echo_event echo_notification across all wikis and no differences have been found.
So I believe we are good to go

Not sure exactly how you checked, but I saw one error in the first 20 wikis I checked:

echo angwikiquote | while read db; do echo "$db..."; ./compare.py wikidatawiki echo_event event_id db1069 db1064 db1120 db2034 db2033 db2069 --step=50000 --print-every=100 || break; done
2018-08-20T08:27:23.454539: row id 269950842/273742463, ETA: 00m22s, 0 chunk(s) found different
DIFFERENCE on db2034.codfw.wmnet:3306: WHERE event_id BETWEEN 273700842 AND 273742463
DIFFERENCE on db2033.codfw.wmnet:3306: WHERE event_id BETWEEN 273700842 AND 273742463
DIFFERENCE on db2069.codfw.wmnet:3306: WHERE event_id BETWEEN 273700842 AND 273742463
Execution ended, a total of 1 chunk(s) are different.

Let's fix the issue above and let me continue a full check- at the moment there is no rush to repool it, we can reevaluate later. We may find more issues, even if not relevant to the crash.

Yeah, looks like there might be inconsistencies eqiad <-> codfw for all hosts :(

Strange, maybe there is a bug or a race condition?

angwikisource...
2018-08-20T09:27:39.462984: row id 269950842/273790891, ETA: 00m22s, 0 chunk(s) found different
DIFFERENCE on db1064.eqiad.wmnet:3306: WHERE event_id BETWEEN 273750842 AND 273790891
DIFFERENCE on db1120.eqiad.wmnet:3306: WHERE event_id BETWEEN 273750842 AND 273790891
DIFFERENCE on db2034.codfw.wmnet:3306: WHERE event_id BETWEEN 273750842 AND 273790891
DIFFERENCE on db2033.codfw.wmnet:3306: WHERE event_id BETWEEN 273750842 AND 273790891
DIFFERENCE on db2069.codfw.wmnet:3306: WHERE event_id BETWEEN 273750842 AND 273790891
Execution ended, a total of 1 chunk(s) are different.

Maybe it was caught in the middle of a transaction or something?

I will get those differences fixed

I have fixed T201603#4513954 and I am going to re-run the checks across all the wikis again.
Will repool the db2069 and close this task and create a new ticket for x1 consistency as it will be easier to track there.

Mentioned in SAL (#wikimedia-operations) [2018-08-22T11:40:54Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool db2069 - T201603 (duration: 00m 56s)