Page MenuHomePhabricator

deployment-db03.deployment-prep.eqiad.wmflabs instance can not start
Closed, ResolvedPublic

Description

deployment-db03.deployment-prep.eqiad.wmflabs can not start anymore. That is the instance hosting the beta cluster master database.

deployment-db04 is the slave and has some data corruption, that is T216067.

db03: https://horizon.wikimedia.org/project/instances/27460e9d-5548-4cd6-9472-548db6402294/

Request IDActionStart TimeUser IDMessage
req-f05ae73e-5ae0-4099-8d04-9b401789d994StartFeb. 18, 2019, 11:44 a.m.novaadminError
req-849202a7-8029-46de-80c0-6ecbd4fb78cdStartFeb. 18, 2019, 11:34 a.m.krenairError
req-19128734-227b-4987-a782-f57794bf6bf0StartFeb. 18, 2019, 11:27 a.m.hasharError
req-4bb3e498-1891-454b-9f84-6f9fbaf6be94StopFeb. 13, 2019, 6:56 p.m.--
req-ea3d5182-871c-4d55-a4f1-9d271e21d3a9StartFeb. 13, 2019, 6:12 p.m.novaadmin-
req-8ae9a589-a783-45c3-9b71-bcdfb07d1d2aStopFeb. 13, 2019, 5:55 p.m.novaadmin-
req-60bf4ecf-8a48-4d8f-8087-2ac8b7fd3c4aStartFeb. 13, 2019, 3:34 p.m.novaadmin-
req-71da83a0-32d5-4f38-a0c3-d8e87edf4943StopFeb. 13, 2019, 3:29 p.m.--
req-556d3f78-18b6-4f90-bfac-1fa0a616cebbStartFeb. 13, 2019, 2:15 p.m.novaadmin-
req-df32d30e-a324-4f97-9f90-2a27364885b0StopFeb. 13, 2019, 1:38 p.m.--
req-da85a36c-a91b-454f-bc61-10387643fafbRebootNov. 20, 2018, 10:53 a.m.novaadmin-

From IRC (time is UTC+1)

[12:31:27] <hasharAway> for when a WMCS admin is around,  deployment-db03.deployment-prep  can not start for some reason  https://horizon.wikimedia.org/project/instances/27460e9d-5548-4cd6-9472-548db6402294/
[12:31:44] <hasharAway> that is the instance holding the database for the beta cluster. 
[12:31:56] <hasharAway> https://phabricator.wikimedia.org/T216067 is slighly related but that is for the database slave ( deployment-db04 )
[12:32:02] <hasharAway> (i am not around today sorry)
[12:32:43] <Krenair> that's odd it was started a few days ago?
[12:33:53] <Krenair> looks like it was up and down like mad on Wednesday
[12:34:00] <Krenair> then failed to start this morning
[12:34:10] <Krenair> didn't realise it was stopped
[12:34:58] <Krenair> Yeah something is very very wrong
[12:35:03] <arturo> deployment-db servers were running in one of the failing cloudvirts
[12:35:33] <Krenair> yeah but I thought they did come back up eventually
[12:35:40] <Krenair> it appears that didn't last long
[12:35:41] <Krenair>  Request ID
[12:35:41] <Krenair>  Action
[12:35:41] <Krenair>  Start Time
[12:35:41] <Krenair>  User ID
[12:35:42] <Krenair>  Message
[12:35:45] <Krenair> req-849202a7-8029-46de-80c0-6ecbd4fb78cd  Start  18 Feb 2019, 11:34 a.m.  krenair  Error
[12:35:47] <Krenair> req-19128734-227b-4987-a782-f57794bf6bf0  Start  18 Feb 2019, 11:27 a.m.  hashar  Error 
[12:36:01] <Krenair> additionally the console log is broken
[12:36:04] <Krenair> can someone repair that?
[12:36:19] <arturo> I would suggest you create new VMs
[12:38:17] <Krenair> please try to start it anyway
[12:44:51] <arturo> Krenair: just issued the command
[12:45:48] <Krenair> power state: no state
[12:46:04] <Krenair> yeah no luck
[12:46:10] <Krenair> any more helpful console output?
[12:47:33] <arturo> I would be happy to debug and fix this under other circumstances. I would say we have 2 options: 1) create new VMs 2) wait a bit to see if today I can have some spare cycles to look into this
[12:49:15] <Krenair> luckily deployment-db04 is still up so we can probably retrieve the data from there
[12:49:54] <Krenair> I'm going to try doing the copy from there to db05 now
[12:51:07] <Krenair> I have to say though, we do need to figure out how db03 and db04 came to be on the same host so we can prevent that happening again
[12:51:44] <arturo> Krenair: right now the nova scheduler doesn't know about the roles of a given server
[12:52:12] <arturo> we have a script https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maintenance#wmcs-spreadcheck that can be used to report on this
[12:52:15] <Krenair> sure but I expect they were originally created on different hosts, then at some point someone moved them?
[12:53:53] <arturo> I doubt we would be checking collocation if we are rushing to manually drain a cloudvirt (to prevent further data loss)
[12:58:10] <Krenair> maybe not immediately in an emergency but some time after
[13:00:36] <arturo> fair enough, but in this case, we chained several incidents since the original issue in that cloudvirt
[13:01:12] <arturo> we had very little time for cleanups, if any
[13:01:28] <Krenair> in this case I wouldn't've expected it yet

Event Timeline

GTirloni triaged this task as Medium priority.
root@cloudvirt1026:/var/lib/nova/instances/27460e9d-5548-4cd6-9472-548db6402294# qemu-img check ./disk
qemu-img: Could not open './disk': Could not open backing file: Could not open '/var/lib/nova/instances/_base/76a35d5edf0cd19144cac5d4b0a44e7f9212fa14': No such file or directory

Copied backing file from cloudvirt1022 and started VM.

CC @Andrew as this might be a problem with the cold-migrate script.

The instance is reacheable but the service is not happy:

The log sequence numbers 212565209189 and 212565209189 in ibdata files do not match the log sequence number 233682420105 in the ib_logfiles!
[ERROR] InnoDB: Tried to read 16384 bytes at offset 0. Was only able to read 0.
InnoDB: Operating system error number 5 in a file operation.
InnoDB: Error number 5 means 'Input/output error'.

-- Logs begin at Mon 2019-02-18 13:24:47 UTC, end at Mon 2019-02-18 17:29:08 UTC. --
Feb 18 13:24:48 mysql[652]: /opt/wmf-mariadb10
Feb 18 13:24:48 mysql[652]: Starting MySQL
Feb 18 13:24:48 mysql[652]: .
Feb 18 13:24:48 mysqld[837]: 190218 13:24:48 [Note] /opt/wmf-mariadb10/bin/mysqld (mysqld 10.0.34-MariaDB) starting as process 836 ... 
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Warning] No argument was provided to --log-bin and neither --log-basename or --log-bin-index where used;  This may cause repliction to break when this server acts as a master and has its hostname changed! Please use '--log-basename=deployment-db03' or '--log-bin=deployment-db03-bin' to avoid this problem.
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: Using mutexes to ref count buffer pool pages
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: The InnoDB memory heap is disabled
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins 
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: Compressed tables use zlib 1.2.3
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: Using Linux native AIO
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: Using CPU crc32 instructions
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: Initializing buffer pool, size = 12.0G
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: Completed initialization of buffer pool
Feb 18 13:24:54 mysqld[837]: 190218 13:24:54 [Note] InnoDB: Highest supported file format is Barracuda.
Feb 18 13:24:54 mysqld[837]: 190218 13:24:54 [Note] InnoDB: The log sequence numbers 212565209189 and 212565209189 in ibdata files do not match the log sequence number 233682420105 in the ib_logfiles!
Feb 18 13:24:54 mysqld[837]: 190218 13:24:54 [ERROR] InnoDB: Tried to read 16384 bytes at offset 0. Was only able to read 0. 
Feb 18 13:24:54 mysqld[837]: 2019-02-18 13:24:54 7ff6015b4780  InnoDB: Operating system error number 5 in a file operation. 
Feb 18 13:24:54 mysqld[837]: InnoDB: Error number 5 means 'Input/output error'.
Feb 18 13:24:54 mysqld[837]: InnoDB: Some operating system error numbers are described at
Feb 18 13:24:54 mysqld[837]: InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html
Feb 18 13:24:54 mysqld[837]: 190218 13:24:54 [ERROR] InnoDB: File (unknown): 'read' returned OS error 105. Cannot continue operation
Addshore raised the priority of this task from Medium to High.Feb 19 2019, 3:02 PM
Addshore subscribed.

I'm not sure why this is only "Normal", I'm going to at least bump this to high.
We are essentially ignoring one of our QA/CI steps here which will increase the chances of incidents in production..
Perhaps this should even be UBN..?

Addshore added a project: Wikidata.
Addshore moved this task from incoming to monitoring on the Wikidata board.
Addshore moved this task from Unsorted 💣 to Watching 👀 on the User-Addshore board.
bd808 added subscribers: GTirloni, bd808.

I'm not sure why this is only "Normal", I'm going to at least bump this to high.
We are essentially ignoring one of our QA/CI steps here which will increase the chances of incidents in production..
Perhaps this should even be UBN..?

Who are you trying to push to move faster? What is the desired outcome of this task?

The instance is reacheable but the service is not happy:

The log sequence numbers 212565209189 and 212565209189 in ibdata files do not match the log sequence number 233682420105 in the ib_logfiles!
[ERROR] InnoDB: Tried to read 16384 bytes at offset 0. Was only able to read 0.
InnoDB: Operating system error number 5 in a file operation.
InnoDB: Error number 5 means 'Input/output error'.

Fixing these errors is beyond the core services offered by the Cloud Services team.

Who are you trying to push to move faster? What is the desired outcome of this task?
Fixing these errors is beyond the core services offered by the Cloud Services team.

I guess this would be on Release-Engineering-Team plate then?
I'm glad that I commented as clearing up that this will not be done by cloud-services-team is already a step in the right direction.
I personally want to avoid the possibility having to use hindsight if something goes wrong with the train this week that could have been spotted in beta.

hashar claimed this task.

Closing this task since it is going off topic. The original intent was to get the instance up again and @GTirloni fixed it yesterday (thank you).

I filled T216635 for the Innodb recovery.