Page MenuHomePhabricator

Icinga/Check for VMs leaked by the nova-fullstack test
Closed, ResolvedPublic

Description

Write the description below

From alertmanager:

Icinga/Check for VMs leaked by the nova-fullstack test
summary: 7 instances in the admin-monitoring project

Event Timeline

dcaro triaged this task as High priority.Aug 25 2021, 9:08 AM
dcaro created this task.

Change 714722 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] nova_fullstack: rephrase log message

https://gerrit.wikimedia.org/r/714722

Change 714733 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] nova_fullstack: Add last error output when timing out puppet check

https://gerrit.wikimedia.org/r/714733

I started a new VM with the same image, and while booting (before the first puppet run), I connected to the virsh
console and was able to print the puppet config, showing that the state dir is not the one we are looking at
(/var/lib/puppet/state):

agent_catalog_run_lockfile = /var/cache/puppet/state/agent_catalog_run.lock
agent_disabled_lockfile = /var/cache/puppet/state/agent_disabled.lock
classfile = /var/cache/puppet/state/classes.txt
graphdir = /var/cache/puppet/state/graphs
lastrunfile = /var/cache/puppet/state/last_run_summary.yaml
lastrunreport = /var/cache/puppet/state/last_run_report.yaml
resourcefile = /var/cache/puppet/state/resources.txt
statedir = /var/cache/puppet/state
statefile = /var/cache/puppet/state/state.yaml
statettl = 2764800

So my current hypothesis is that the first puppet run changes the state dir, but it's not until the second that it uses
that new path to store the state, and that depends on the cron getting triggered.
And sometimes that's too long and the test just times out.

I'll adapt the script to look in both places (as if the above is correct, even using the 'puppet config print' will
show the wrong path after the first run).

Change 714761 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] nova_fullstack: try to get the puppet state from a couple places

https://gerrit.wikimedia.org/r/714761

Change 714722 merged by Andrew Bogott:

[operations/puppet@production] nova_fullstack: rephrase log message

https://gerrit.wikimedia.org/r/714722

The main curse on VM creation these days is the puppet-agent. Cloud-init starts puppet agent (no-optionally) and then the puppet-agent may or may not start a puppet sync while the firstboot script is running.

That race causes no end of headaches, so /probably/ that is still what's causing this. I'm going to look at that next.

Change 714831 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] nova vendor-data: another mild attempt to avoid races with the puppet agent

https://gerrit.wikimedia.org/r/714831

Change 714831 merged by Andrew Bogott:

[operations/puppet@production] nova vendor-data: another mild attempt to avoid races with the puppet agent

https://gerrit.wikimedia.org/r/714831

That last patch maybe helped, or maybe we've just had a lucky streak.

@Andrew handing it over to you, as it's not clear to me if you tried the other patches or not (the ones about checking different puppet state file paths), feel free to abandon them if they are not needed and close the task.

It turns out it was just a lucky streak :(

Change 714733 merged by Andrew Bogott:

[operations/puppet@production] nova_fullstack: Add last error output when timing out puppet check

https://gerrit.wikimedia.org/r/714733

Change 714761 merged by Andrew Bogott:

[operations/puppet@production] nova_fullstack: try to get the puppet state from a couple places

https://gerrit.wikimedia.org/r/714761

Change 715026 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Added cloud-wide default for profile::debdeploy::client::filter_services:

https://gerrit.wikimedia.org/r/715026

Change 715026 merged by Andrew Bogott:

[operations/puppet@production] Added cloud-wide default for profile::debdeploy::client::filter_services:

https://gerrit.wikimedia.org/r/715026

Change 714858 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Revert \"nova_fullstack: try to get the puppet state from a couple places\"

https://gerrit.wikimedia.org/r/714858

Change 714858 merged by Andrew Bogott:

[operations/puppet@production] Revert \"nova_fullstack: try to get the puppet state from a couple places\"

https://gerrit.wikimedia.org/r/714858

Change 715045 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] nova vendordata: try to have cloud-init perform the first puppet run

https://gerrit.wikimedia.org/r/715045

Change 715050 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] nova_fullstack_test.py: capture output on succesfull puppet check

https://gerrit.wikimedia.org/r/715050

Change 715050 merged by Andrew Bogott:

[operations/puppet@production] nova_fullstack_test.py: capture output on succesful puppet check

https://gerrit.wikimedia.org/r/715050

Change 715045 merged by Andrew Bogott:

[operations/puppet@production] nova vendordata: try to have cloud-init perform the first puppet run

https://gerrit.wikimedia.org/r/715045

This particular issue should be resolved, not going to keep this open on that upstream task since we won't be able to deploy it until after bullseye.