Page MenuHomePhabricator

Puppet labs/private.git data loss incident affecting some projects
Closed, ResolvedPublic

Description

At 2020-06-04T11:12 UTC a change was merged to the operations/puppet.git repository which resulted in data loss for Cloud VPS projects using a local Puppetmaster (role::puppetmaster::standalone). The specific data loss is removal of any local to the Puppetmaster instance commits overlaid on the upstream labs/private.git repository. These patches would have contained passwords, ssh keys, TLS certificates, and similar authentication information for Puppet managed configuration.

Incident doc: https://wikitech.wikimedia.org/wiki/Incidents/2020-06-04_cloud-private-repo

Event Timeline

bd808 triaged this task as Unbreak Now! priority.Jun 4 2020, 3:12 PM
bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
[19:21]  <andrewbogott> One investigative tool is: 1) edit crontab to comment out the periodic puppet run 2) enable puppet 3) puppet agent --noop -tv

The crontab is /etc/cron.d/puppet

Mentioned in SAL (#wikimedia-cloud) [2020-06-04T20:07:49Z] <bd808> Checked all hosts and found no missing secrets; puppet re-enabled (T254491)

Mentioned in SAL (#wikimedia-cloud) [2020-06-04T20:57:25Z] <bd808> Checked all hosts and found no missing secrets; puppet re-enabled (T254491)

bd808 added a subtask: Restricted Task.Jun 4 2020, 9:40 PM
bd808 lowered the priority of this task from Unbreak Now! to High.Jun 4 2020, 9:47 PM
bd808 added a subscriber: Krenair.

Lowering priority from UBN to High.

@Andrew, @aborrero, @jbond, @hashar, @Krenair, and @bd808 have worked through all of the projects where we felt that secret loss was likely. We believe that we have managed to recover almost all lost secrets. There are a few things left to do before we re-enable Puppet everywhere, but that list is short and non-urgent. Out of an abundance of caution we are not going to do these last steps until more of us are fresh and simultaneously available. We will also work on a proper incident report at that time.

bd808 removed bd808 as the assignee of this task.Jun 4 2020, 10:56 PM

i have made a first pass at the incident report. The main sections which could use some more input are

  • impact
  • detection
  • documentation
  • actionable

The conclusion and timeline are likely also missing some points

i have made a first pass at the incident report. The main sections which could use some more input are

I have added a few bits:

  • fixed date (was May 11)
  • details how it got handled for deployment-prep / integration
  • added reference links to the IRC logs for #wikimedia-cloud #wikimedia-operations and #wikimedia-releng

The incident page <code>20200605-cloud-private-repo</code> has the date the page has been created which is a day after the incident occurred (June 4th). That caused me a bit of confusion, I would recommend to move the page <code>Incident_documentation/20200604-cloud-private-repo</code>.

The incident page <code>20200605-cloud-private-repo</code> has the date the page has been created which is a day after the incident occurred (June 4th). That caused me a bit of confusion, I would recommend to move the page <code>Incident_documentation/20200604-cloud-private-repo</code>.

{{Done}} Page moved with redirect from original title to new title which matches the incident date.

Mentioned in SAL (#wikimedia-cloud) [2020-06-09T21:24:50Z] <bd808> Credentials missing from es7 cluster for tool access. Fallout from T254491

Mentioned in SAL (#wikimedia-cloud) [2020-06-09T21:24:50Z] <bd808> Credentials missing from es7 cluster for tool access. Fallout from T254491

This was the bash tool. Its password did not make it into the recovery done by @aborrero. I went ahead and generated a new password for it, added that to the local commit on the puppetmaster, and updated the credentials in the tool.

Change 604696 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] labtestpuppet: puppet::servers

https://gerrit.wikimedia.org/r/604696

Change 604696 merged by Jbond:
[operations/puppet@production] labtestpuppet: puppet::servers

https://gerrit.wikimedia.org/r/604696

aborrero claimed this task.