Page MenuHomePhabricator

wmf-auto-reimage: 'execution expired' on first puppet run
Closed, DeclinedPublic

Description

When running wmf-auto-reimage for both cache and lvs hosts, the first puppet run seems to consistently fail at a certain point with Could not evaluate: Could not retrieve file metadata for [...]: execution expired.

Puppet needs to be re-run manually on the host being reinstalled in order for wmf-auto-reimage to be made to continue. The execution expires while retrieving the metadata for wmf_ca_2017_2020.crt on cache hosts and /etc/rsyslog.d/10-puppet-agent.conf on lvs hosts.

See for example the following occurrences of 'execution expired' on neodymium:/var/log/wmf-auto-reimage:

201804180747_vgutierrez_29945_lvs3003_esams_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Base::Puppet/Rsyslog::Conf[puppet-agent]/File[/etc/rsyslog.d/10-puppet-agent.conf]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/rsyslog.d/puppet-agent.conf: execution expiredESC[0m
201808021947_bblack_13294_cp1079_eqiad_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201805021045_vgutierrez_23768_lvs2002_codfw_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Base::Puppet/Rsyslog::Conf[puppet-agent]/File[/etc/rsyslog.d/10-puppet-agent.conf]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/rsyslog.d/puppet-agent.conf: execution expiredESC[0m
201808022051_bblack_30024_cp1089_eqiad_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201804180936_vgutierrez_28436_lvs2005_codfw_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Base::Puppet/Rsyslog::Conf[puppet-agent]/File[/etc/rsyslog.d/10-puppet-agent.conf]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/rsyslog.d/puppet-agent.conf: execution expiredESC[0m
201804300923_vgutierrez_12807_lvs3001_esams_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Base::Puppet/Rsyslog::Conf[puppet-agent]/File[/etc/rsyslog.d/10-puppet-agent.conf]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/rsyslog.d/puppet-agent.conf: execution expiredESC[0m
201808021947_bblack_13297_cp1082_eqiad_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201808021722_bblack_13689_cp1076_eqiad_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201808022051_bblack_30021_cp1086_eqiad_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201804171415_vgutierrez_28018_lvs3004_esams_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Base::Puppet/Rsyslog::Conf[puppet-agent]/File[/etc/rsyslog.d/10-puppet-agent.conf]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/rsyslog.d/puppet-agent.conf: execution expiredESC[0m
201808060952_ema_27699_cp5003_eqsin_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201808022051_bblack_30022_cp1087_eqiad_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201805021250_vgutierrez_20476_lvs2001_codfw_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Base::Puppet/Rsyslog::Conf[puppet-agent]/File[/etc/rsyslog.d/10-puppet-agent.conf]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/rsyslog.d/puppet-agent.conf: execution expiredESC[0m
201808022051_bblack_30018_cp1083_eqiad_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201804190753_vgutierrez_21775_lvs4006_ulsfo_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Base::Puppet/Rsyslog::Conf[puppet-agent]/File[/etc/rsyslog.d/10-puppet-agent.conf]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/rsyslog.d/puppet-agent.conf: execution expiredESC[0m
201807311510_ema_11657_cp3044_esams_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201808022051_bblack_30019_cp1084_eqiad_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201804181047_vgutierrez_12072_lvs2004_codfw_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Base::Puppet/Rsyslog::Conf[puppet-agent]/File[/etc/rsyslog.d/10-puppet-agent.conf]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/rsyslog.d/puppet-agent.conf: execution expiredESC[0m
201808011535_ema_20473_cp5007_eqsin_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201808021947_bblack_13293_cp1078_eqiad_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201808022051_bblack_30023_cp1088_eqiad_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201808021947_bblack_13292_cp1077_eqiad_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201808021947_bblack_13296_cp1081_eqiad_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201804300706_vgutierrez_14698_lvs3002_esams_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Base::Puppet/Rsyslog::Conf[puppet-agent]/File[/etc/rsyslog.d/10-puppet-agent.conf]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/rsyslog.d/puppet-agent.conf: execution expiredESC[0m
201808031349_bblack_13823_cp1085_eqiad_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201808022051_bblack_30025_cp1090_eqiad_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m
201808011535_ema_20472_cp5006_eqsin_wmnet_cumin.out:^M          ^MESC[AESC[1;31mError: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expiredESC[0m

I could not find anything interesting while inspecting the apache logs on the relevant puppetmasters.

Event Timeline

ema triaged this task as Medium priority.Aug 6 2018, 1:15 PM
ema added a project: SRE.

I've modified wmf_auto_reimage_lib to run puppet with --debug at the first run. Here is what happens around the failure (logs cleaned up from various junk):

Debug: Executing: '/sbin/ip token set ::10:132:0:109 dev enp5s0f0 && /sbin/ip -6 addr flush dev enp5s0f0 dynamic'
Debug: Closing connection for https://puppet:8140
Debug: Creating new connection for https://puppet:8140
Debug: Starting connection for https://puppet:8140
Error: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expired

The connection to the puppetmaster gets closed due to the ip -6 addr flush command apparently.

@ema do you know if this is still happening? If so it seems more of a puppetization issue than a reimage one, should we remove the SRE-tools tag?

ema moved this task from Backlog to Doing on the User-ema board.

Cookbook cookbooks.sre.hosts.reimage was started by ema@cumin2002 for host cp4021.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by ema@cumin2002 for host cp4021.ulsfo.wmnet with OS buster completed:

  • cp4021 (WARN)
    • Downtimed on Icinga
    • Set pooled=inactive for the following services on confctl:

{"cp4021.ulsfo.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=ulsfo,cluster=cache_upload,service=ats-be"}
{"cp4021.ulsfo.wmnet": {"weight": 1, "pooled": "yes"}, "tags": "dc=ulsfo,cluster=cache_upload,service=ats-tls"}
{"cp4021.ulsfo.wmnet": {"weight": 1, "pooled": "yes"}, "tags": "dc=ulsfo,cluster=cache_upload,service=varnish-fe"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present
  • Deleted any existing Puppet certificate
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Host up (new fresh buster OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110130922_ema_2817687_cp4021.out
  • Checked BIOS boot parameters are back to normal
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is not optimal, downtime not removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'dc=ulsfo,cluster=cache_upload,service=ats-be' set/pooled=yes
sudo confctl select 'dc=ulsfo,cluster=cache_upload,service=ats-tls' set/pooled=yes
sudo confctl select 'dc=ulsfo,cluster=cache_upload,service=varnish-fe' set/pooled=yes

  • Updated Netbox data from PuppetDB

Trying another reimage as follows:

root@cumin2002:~# cookbook sre.hosts.reimage --os buster --conftool -t T201317 cp4021 2>&1 | ts | tee reimage-cp4021-take-2.log

Cookbook cookbooks.sre.hosts.reimage was started by ema@cumin2002 for host cp4021.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by ema@cumin2002 for host cp4021.ulsfo.wmnet with OS buster completed:

  • cp4021 (WARN)
    • Downtimed on Icinga
    • Set pooled=inactive for the following services on confctl:

{"cp4021.ulsfo.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=ulsfo,cluster=cache_upload,service=ats-be"}
{"cp4021.ulsfo.wmnet": {"weight": 1, "pooled": "yes"}, "tags": "dc=ulsfo,cluster=cache_upload,service=ats-tls"}
{"cp4021.ulsfo.wmnet": {"weight": 1, "pooled": "yes"}, "tags": "dc=ulsfo,cluster=cache_upload,service=varnish-fe"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present
  • Deleted any existing Puppet certificate
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Host up (new fresh buster OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110131133_ema_2833937_cp4021.out
  • Checked BIOS boot parameters are back to normal
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is not optimal, downtime not removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'dc=ulsfo,cluster=cache_upload,service=ats-be' set/pooled=yes
sudo confctl select 'dc=ulsfo,cluster=cache_upload,service=ats-tls' set/pooled=yes
sudo confctl select 'dc=ulsfo,cluster=cache_upload,service=varnish-fe' set/pooled=yes

  • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-10-13T14:52:47Z] <ema> repool cp4021, further testing can be performed on sretest1001 T201317

Too long has passed since then and doesn't seem to happen anymore.