Page MenuHomePhabricator

compiler1002.puppet-diffs.eqiad.wmflabs instance is down
Closed, ResolvedPublic


Aftermath of WMCS outage last week, the instance compiler1002.puppet-diffs.eqiad.wmflabs is unreachable.

The CI Jenkins can not reach it over ssh

Horizon page:


Request IDActionStart TimeUser IDMessage
req-10c22ffc-0340-4bd6-9b28-749b103d02afStopFeb. 13, 2019, 6:39 p.m.novaadmin-
req-28587666-19d4-4034-846c-98a44af7793eStartFeb. 13, 2019, 6:37 p.m.novaadmin-
req-307304fc-ba65-494e-8671-80a9288bea54StopFeb. 13, 2019, 5:01 p.m.novaadmin-
req-2923193e-53ad-4392-891e-721ae8d528c9StartFeb. 13, 2019, 3:33 p.m.novaadmin-
req-71da83a0-32d5-4f38-a0c3-d8e87edf4943StopFeb. 13, 2019, 3:29 p.m.--
req-dbd3ed92-ca73-44e5-bb97-a1e1b240f8ebStartFeb. 13, 2019, 2:24 p.m.novaadmin-
req-df32d30e-a324-4f97-9f90-2a27364885b0StopFeb. 13, 2019, 1:38 p.m.--
req-aea6fa12-5834-4328-a644-8f976eafc6c7RebootNov. 9, 2018, 3:58 a.m.novaadmin-

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2019-02-19T15:25:47Z] <hashar> Started instance compiler1002.puppet-diffs.eqiad.wmflabs via Horizon. It was in shutoff state | T216513

hashar claimed this task.

I started the instance via Horizon and it got attached to Jenkins again.

Mentioned in SAL (#wikimedia-operations) [2019-02-19T17:00:18Z] <hashar> Offlined compiler1002.puppet-diffs.eqiad.wmflabs from Jenkins. Its disk is corrupt | T216513

Joe said that it is /var/lib/catalog-differ being corrupted. That might be easy to
recover :)

I've re-created compiler1002 from scratch and am working to bring the puppet compiler service up on the host and validate a few builds locally. I estimate this will take until tonight (eastern time) or tomorrow morning since the local puppetdb takes a while to populate.

Compiler1002 is back online and successfully ran through a few local test catalog compiles. populate-puppetdb is running now so we should be in good shape to re-enable this host tomorrow morning. Will follow up when that completes.

That finished a bit faster than I was expecting! Ready to re-enable in the morning. And FWIW here's an example of a successful manual run

@herron restored the instance, not me :] Will be verified in a few hours and added back to CI pool if it works fine.

hashar triaged this task as High priority.Feb 20 2019, 10:22 AM
hashar moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.

compiler1002 is ready to be re-enabled at your earliest convenience

Thank you Keith for the complete rebuild! ;-]