deployment-db03.deployment-prep.eqiad.wmflabs can not start anymore. That is the instance hosting the beta cluster master database.
deployment-db04 is the slave and has some data corruption, that is T216067.
db03: https://horizon.wikimedia.org/project/instances/27460e9d-5548-4cd6-9472-548db6402294/
Request ID | Action | Start Time | User ID | Message |
req-f05ae73e-5ae0-4099-8d04-9b401789d994 | Start | Feb. 18, 2019, 11:44 a.m. | novaadmin | Error |
req-849202a7-8029-46de-80c0-6ecbd4fb78cd | Start | Feb. 18, 2019, 11:34 a.m. | krenair | Error |
req-19128734-227b-4987-a782-f57794bf6bf0 | Start | Feb. 18, 2019, 11:27 a.m. | hashar | Error |
req-4bb3e498-1891-454b-9f84-6f9fbaf6be94 | Stop | Feb. 13, 2019, 6:56 p.m. | - | - |
req-ea3d5182-871c-4d55-a4f1-9d271e21d3a9 | Start | Feb. 13, 2019, 6:12 p.m. | novaadmin | - |
req-8ae9a589-a783-45c3-9b71-bcdfb07d1d2a | Stop | Feb. 13, 2019, 5:55 p.m. | novaadmin | - |
req-60bf4ecf-8a48-4d8f-8087-2ac8b7fd3c4a | Start | Feb. 13, 2019, 3:34 p.m. | novaadmin | - |
req-71da83a0-32d5-4f38-a0c3-d8e87edf4943 | Stop | Feb. 13, 2019, 3:29 p.m. | - | - |
req-556d3f78-18b6-4f90-bfac-1fa0a616cebb | Start | Feb. 13, 2019, 2:15 p.m. | novaadmin | - |
req-df32d30e-a324-4f97-9f90-2a27364885b0 | Stop | Feb. 13, 2019, 1:38 p.m. | - | - |
req-da85a36c-a91b-454f-bc61-10387643fafb | Reboot | Nov. 20, 2018, 10:53 a.m. | novaadmin | - |
From IRC (time is UTC+1)
[12:31:27] <hasharAway> for when a WMCS admin is around, deployment-db03.deployment-prep can not start for some reason https://horizon.wikimedia.org/project/instances/27460e9d-5548-4cd6-9472-548db6402294/ [12:31:44] <hasharAway> that is the instance holding the database for the beta cluster. [12:31:56] <hasharAway> https://phabricator.wikimedia.org/T216067 is slighly related but that is for the database slave ( deployment-db04 ) [12:32:02] <hasharAway> (i am not around today sorry) [12:32:43] <Krenair> that's odd it was started a few days ago? [12:33:53] <Krenair> looks like it was up and down like mad on Wednesday [12:34:00] <Krenair> then failed to start this morning [12:34:10] <Krenair> didn't realise it was stopped [12:34:58] <Krenair> Yeah something is very very wrong [12:35:03] <arturo> deployment-db servers were running in one of the failing cloudvirts [12:35:33] <Krenair> yeah but I thought they did come back up eventually [12:35:40] <Krenair> it appears that didn't last long [12:35:41] <Krenair> Request ID [12:35:41] <Krenair> Action [12:35:41] <Krenair> Start Time [12:35:41] <Krenair> User ID [12:35:42] <Krenair> Message [12:35:45] <Krenair> req-849202a7-8029-46de-80c0-6ecbd4fb78cd Start 18 Feb 2019, 11:34 a.m. krenair Error [12:35:47] <Krenair> req-19128734-227b-4987-a782-f57794bf6bf0 Start 18 Feb 2019, 11:27 a.m. hashar Error [12:36:01] <Krenair> additionally the console log is broken [12:36:04] <Krenair> can someone repair that? [12:36:19] <arturo> I would suggest you create new VMs [12:38:17] <Krenair> please try to start it anyway [12:44:51] <arturo> Krenair: just issued the command [12:45:48] <Krenair> power state: no state [12:46:04] <Krenair> yeah no luck [12:46:10] <Krenair> any more helpful console output? [12:47:33] <arturo> I would be happy to debug and fix this under other circumstances. I would say we have 2 options: 1) create new VMs 2) wait a bit to see if today I can have some spare cycles to look into this [12:49:15] <Krenair> luckily deployment-db04 is still up so we can probably retrieve the data from there [12:49:54] <Krenair> I'm going to try doing the copy from there to db05 now [12:51:07] <Krenair> I have to say though, we do need to figure out how db03 and db04 came to be on the same host so we can prevent that happening again [12:51:44] <arturo> Krenair: right now the nova scheduler doesn't know about the roles of a given server [12:52:12] <arturo> we have a script https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maintenance#wmcs-spreadcheck that can be used to report on this [12:52:15] <Krenair> sure but I expect they were originally created on different hosts, then at some point someone moved them? [12:53:53] <arturo> I doubt we would be checking collocation if we are rushing to manually drain a cloudvirt (to prevent further data loss) [12:58:10] <Krenair> maybe not immediately in an emergency but some time after [13:00:36] <arturo> fair enough, but in this case, we chained several incidents since the original issue in that cloudvirt [13:01:12] <arturo> we had very little time for cleanups, if any [13:01:28] <Krenair> in this case I wouldn't've expected it yet