deployment-db03.deployment-prep.eqiad.wmflabs instance can not start
Closed, ResolvedPublic
Actions

Description

deployment-db03.deployment-prep.eqiad.wmflabs can not start anymore. That is the instance hosting the beta cluster master database.

deployment-db04 is the slave and has some data corruption, that is T216067.

db03: https://horizon.wikimedia.org/project/instances/27460e9d-5548-4cd6-9472-548db6402294/

Request ID	Action	Start Time	User ID	Message
req-f05ae73e-5ae0-4099-8d04-9b401789d994	Start	Feb. 18, 2019, 11:44 a.m.	novaadmin	Error
req-849202a7-8029-46de-80c0-6ecbd4fb78cd	Start	Feb. 18, 2019, 11:34 a.m.	krenair	Error
req-19128734-227b-4987-a782-f57794bf6bf0	Start	Feb. 18, 2019, 11:27 a.m.	hashar	Error
req-4bb3e498-1891-454b-9f84-6f9fbaf6be94	Stop	Feb. 13, 2019, 6:56 p.m.	-	-
req-ea3d5182-871c-4d55-a4f1-9d271e21d3a9	Start	Feb. 13, 2019, 6:12 p.m.	novaadmin	-
req-8ae9a589-a783-45c3-9b71-bcdfb07d1d2a	Stop	Feb. 13, 2019, 5:55 p.m.	novaadmin	-
req-60bf4ecf-8a48-4d8f-8087-2ac8b7fd3c4a	Start	Feb. 13, 2019, 3:34 p.m.	novaadmin	-
req-71da83a0-32d5-4f38-a0c3-d8e87edf4943	Stop	Feb. 13, 2019, 3:29 p.m.	-	-
req-556d3f78-18b6-4f90-bfac-1fa0a616cebb	Start	Feb. 13, 2019, 2:15 p.m.	novaadmin	-
req-df32d30e-a324-4f97-9f90-2a27364885b0	Stop	Feb. 13, 2019, 1:38 p.m.	-	-
req-da85a36c-a91b-454f-bc61-10387643fafb	Reboot	Nov. 20, 2018, 10:53 a.m.	novaadmin	-

From IRC (time is UTC+1)

[12:31:27] <hasharAway> for when a WMCS admin is around,  deployment-db03.deployment-prep  can not start for some reason  https://horizon.wikimedia.org/project/instances/27460e9d-5548-4cd6-9472-548db6402294/
[12:31:44] <hasharAway> that is the instance holding the database for the beta cluster. 
[12:31:56] <hasharAway> https://phabricator.wikimedia.org/T216067 is slighly related but that is for the database slave ( deployment-db04 )
[12:32:02] <hasharAway> (i am not around today sorry)
[12:32:43] <Krenair> that's odd it was started a few days ago?
[12:33:53] <Krenair> looks like it was up and down like mad on Wednesday
[12:34:00] <Krenair> then failed to start this morning
[12:34:10] <Krenair> didn't realise it was stopped
[12:34:58] <Krenair> Yeah something is very very wrong
[12:35:03] <arturo> deployment-db servers were running in one of the failing cloudvirts
[12:35:33] <Krenair> yeah but I thought they did come back up eventually
[12:35:40] <Krenair> it appears that didn't last long
[12:35:41] <Krenair>  Request ID
[12:35:41] <Krenair>  Action
[12:35:41] <Krenair>  Start Time
[12:35:41] <Krenair>  User ID
[12:35:42] <Krenair>  Message
[12:35:45] <Krenair> req-849202a7-8029-46de-80c0-6ecbd4fb78cd  Start  18 Feb 2019, 11:34 a.m.  krenair  Error
[12:35:47] <Krenair> req-19128734-227b-4987-a782-f57794bf6bf0  Start  18 Feb 2019, 11:27 a.m.  hashar  Error 
[12:36:01] <Krenair> additionally the console log is broken
[12:36:04] <Krenair> can someone repair that?
[12:36:19] <arturo> I would suggest you create new VMs
[12:38:17] <Krenair> please try to start it anyway
[12:44:51] <arturo> Krenair: just issued the command
[12:45:48] <Krenair> power state: no state
[12:46:04] <Krenair> yeah no luck
[12:46:10] <Krenair> any more helpful console output?
[12:47:33] <arturo> I would be happy to debug and fix this under other circumstances. I would say we have 2 options: 1) create new VMs 2) wait a bit to see if today I can have some spare cycles to look into this
[12:49:15] <Krenair> luckily deployment-db04 is still up so we can probably retrieve the data from there
[12:49:54] <Krenair> I'm going to try doing the copy from there to db05 now
[12:51:07] <Krenair> I have to say though, we do need to figure out how db03 and db04 came to be on the same host so we can prevent that happening again
[12:51:44] <arturo> Krenair: right now the nova scheduler doesn't know about the roles of a given server
[12:52:12] <arturo> we have a script https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maintenance#wmcs-spreadcheck that can be used to report on this
[12:52:15] <Krenair> sure but I expect they were originally created on different hosts, then at some point someone moved them?
[12:53:53] <arturo> I doubt we would be checking collocation if we are rushing to manually drain a cloudvirt (to prevent further data loss)
[12:58:10] <Krenair> maybe not immediately in an emergency but some time after
[13:00:36] <arturo> fair enough, but in this case, we chained several incidents since the original issue in that cloudvirt
[13:01:12] <arturo> we had very little time for cleanups, if any
[13:01:28] <Krenair> in this case I wouldn't've expected it yet

Related Objects

Mentioned In: T219088: Ensure we are unlikely to have both deployment-prep DB instances hosted together again in future
T219087: Get rid of deployment-db0[34]
T217954: Recover ign.ign2commons.eqiad.wmflabs
T216635: MySQL database on deployment-db03 does not start due to InnoDB issue
T216045: Beta cluster is broken: No working replica DB server: Unknown error (172.16.5.5:3306))
T216573: de.wikipedia.beta.wmflabs.org is currently in readonly mode
T216321: beta-scap-eqiad repeat failures
Mentioned Here: T216635: MySQL database on deployment-db03 does not start due to InnoDB issue
T216067: Recover from corrupted beta MySQL slave (deployment-db04)

Event Timeline

hashar created this task.Feb 18 2019, 12:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 18 2019, 12:29 PM

hashar merged a task: T216321: beta-scap-eqiad repeat failures.Feb 18 2019, 12:30 PM

hashar mentioned this in T216321: beta-scap-eqiad repeat failures.

hashar added subscribers: MarcoAurelio, Krenair, Kizule.

• GTirloni claimed this task.Feb 18 2019, 12:47 PM

• GTirloni triaged this task as Medium priority.

root@cloudvirt1026:/var/lib/nova/instances/27460e9d-5548-4cd6-9472-548db6402294# qemu-img check ./disk
qemu-img: Could not open './disk': Could not open backing file: Could not open '/var/lib/nova/instances/_base/76a35d5edf0cd19144cac5d4b0a44e7f9212fa14': No such file or directory

Copied backing file from cloudvirt1022 and started VM.

CC @Andrew as this might be a problem with the cold-migrate script.

• GTirloni edited projects, added cloud-services-team (Kanban); removed cloud-services-team.Feb 18 2019, 1:28 PM

• GTirloni added a project: Cloud-VPS.

The instance is reacheable but the service is not happy:

The log sequence numbers 212565209189 and 212565209189 in ibdata files do not match the log sequence number 233682420105 in the ib_logfiles!
[ERROR] InnoDB: Tried to read 16384 bytes at offset 0. Was only able to read 0.
InnoDB: Operating system error number 5 in a file operation.
InnoDB: Error number 5 means 'Input/output error'.

-- Logs begin at Mon 2019-02-18 13:24:47 UTC, end at Mon 2019-02-18 17:29:08 UTC. --
Feb 18 13:24:48 mysql[652]: /opt/wmf-mariadb10
Feb 18 13:24:48 mysql[652]: Starting MySQL
Feb 18 13:24:48 mysql[652]: .
Feb 18 13:24:48 mysqld[837]: 190218 13:24:48 [Note] /opt/wmf-mariadb10/bin/mysqld (mysqld 10.0.34-MariaDB) starting as process 836 ... 
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Warning] No argument was provided to --log-bin and neither --log-basename or --log-bin-index where used;  This may cause repliction to break when this server acts as a master and has its hostname changed! Please use '--log-basename=deployment-db03' or '--log-bin=deployment-db03-bin' to avoid this problem.
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: Using mutexes to ref count buffer pool pages
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: The InnoDB memory heap is disabled
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins 
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: Compressed tables use zlib 1.2.3
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: Using Linux native AIO
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: Using CPU crc32 instructions
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: Initializing buffer pool, size = 12.0G
Feb 18 13:24:53 mysqld[837]: 190218 13:24:53 [Note] InnoDB: Completed initialization of buffer pool
Feb 18 13:24:54 mysqld[837]: 190218 13:24:54 [Note] InnoDB: Highest supported file format is Barracuda.
Feb 18 13:24:54 mysqld[837]: 190218 13:24:54 [Note] InnoDB: The log sequence numbers 212565209189 and 212565209189 in ibdata files do not match the log sequence number 233682420105 in the ib_logfiles!
Feb 18 13:24:54 mysqld[837]: 190218 13:24:54 [ERROR] InnoDB: Tried to read 16384 bytes at offset 0. Was only able to read 0. 
Feb 18 13:24:54 mysqld[837]: 2019-02-18 13:24:54 7ff6015b4780  InnoDB: Operating system error number 5 in a file operation. 
Feb 18 13:24:54 mysqld[837]: InnoDB: Error number 5 means 'Input/output error'.
Feb 18 13:24:54 mysqld[837]: InnoDB: Some operating system error numbers are described at
Feb 18 13:24:54 mysqld[837]: InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html
Feb 18 13:24:54 mysqld[837]: 190218 13:24:54 [ERROR] InnoDB: File (unknown): 'read' returned OS error 105. Cannot continue operation

Krenair merged a task: T216484: Database locked on beta.wmflabs.org sites (deployment-db03 down?).Feb 19 2019, 11:06 AM

Krenair added subscribers: RhinosF1, Bawolff, Peachey88, Legoktm.

I'm not sure why this is only "Normal", I'm going to at least bump this to high.
We are essentially ignoring one of our QA/CI steps here which will increase the chances of incidents in production..
Perhaps this should even be UBN..?

Addshore added a project: User-Addshore.Feb 19 2019, 3:03 PM

Addshore added a project: Wikidata.

Addshore moved this task from incoming to monitoring on the Wikidata board.

Addshore moved this task from Unsorted 💣 to Watching 👀 on the User-Addshore board.

In T216404#4964728, @Addshore wrote:

I'm not sure why this is only "Normal", I'm going to at least bump this to high.
We are essentially ignoring one of our QA/CI steps here which will increase the chances of incidents in production..
Perhaps this should even be UBN..?

Who are you trying to push to move faster? What is the desired outcome of this task?

In T216404#4962137, @hashar wrote:

The instance is reacheable but the service is not happy:

The log sequence numbers 212565209189 and 212565209189 in ibdata files do not match the log sequence number 233682420105 in the ib_logfiles!
[ERROR] InnoDB: Tried to read 16384 bytes at offset 0. Was only able to read 0.
InnoDB: Operating system error number 5 in a file operation.
InnoDB: Error number 5 means 'Input/output error'.

Fixing these errors is beyond the core services offered by the Cloud Services team.

Who are you trying to push to move faster? What is the desired outcome of this task?
Fixing these errors is beyond the core services offered by the Cloud Services team.

I guess this would be on Release-Engineering-Team plate then?
I'm glad that I commented as clearing up that this will not be done by cloud-services-team is already a step in the right direction.
I personally want to avoid the possibility having to use hindsight if something goes wrong with the train this week that could have been spotted in beta.