Page MenuHomePhabricator

PuppetAgentStaleLastRun - cloud-puppetmaster-03
Closed, ResolvedPublic

Description

From alertmanager https://prometheus-alerts.wmcloud.org:

alertname: PuppetAgentStaleLastRun
project: cloudinfra
summary: Last Puppet run was over 24 hours ago on instance cloud-puppetmaster-03 in project cloudinfra
2 hours ago
instance: cloud-puppetmaster-03
job: node
severity: warn

Trying to run puppet manually complains with the lockfile existing:

root@cloud-puppetmaster-03:~# run-puppet-agent
Notice: Run of Puppet configuration client already in progress; skipping  (/var/lib/puppet/state/agent_catalog_run.lock exists)

But when doing an ls of the file it does not show up:

root@cloud-puppetmaster-03:~# ls -la /var/lib/puppet/state/agent_catalog_run.lock
ls: cannot access '/var/lib/puppet/state/agent_catalog_run.lock': No such file or directory

Weird thing is, when lsing the dir you get it:

root@cloud-puppetmaster-03:~# ls -l /var/lib/puppet/state
ls: cannot access '/var/lib/puppet/state/agent_catalog_run.lock': No such file or directory
total 2640
-????????? ? ?    ?          ?            ? agent_catalog_run.lock
...

Might be some filesystem issue, looking

Event Timeline

dcaro triaged this task as High priority.Jul 20 2022, 6:47 AM
dcaro created this task.

Yep, something is broken there since yesterday:

[Tue Jul 19 04:42:22 2022] EXT4-fs error (device vda2): ext4_validate_inode_bitmap:100: comm utils.rb:110: Corrupt inode bitmap - block_group = 34, inode_bitmap = 1048594
[Tue Jul 19 04:45:10 2022] EXT4-fs error (device vda2): ext4_validate_block_bitmap:384: comm apt-show-versio: bg 18: bad block bitmap checksum
[Tue Jul 19 04:45:10 2022] EXT4-fs error (device vda2) in ext4_free_blocks:4973: Filesystem failed CRC
[Tue Jul 19 04:47:46 2022] EXT4-fs error (device vda2): ext4_validate_block_bitmap:384: comm utils.rb:110: bg 30: bad block bitmap checksum
[Tue Jul 19 04:47:46 2022] EXT4-fs (vda2): Delayed block allocation failed for inode 746559 at logical offset 2048 with max blocks 57 with error 74
[Tue Jul 19 04:47:46 2022] EXT4-fs (vda2): This should not happen!! Data will be lost

Mentioned in SAL (#wikimedia-cloud) [2022-07-20T07:03:31Z] <dcaro> manually running fsck through costole on puppetmaster-03 do to disk errors (T313380)

The host rebooted, manually fsck'd the disk, and started up again ok, tested by running puppet from a fullstack VM:

root@fullstackd-20220720052914:~# run-puppet-agent
2022-07-20 07:30:45.597402 WARN  puppetlabs.facter - locale environment variables were bad; continuing with LANG=C LC_ALL=C
2022-07-20 07:30:46.198613 WARN  puppetlabs.facter - locale environment variables were bad; continuing with LANG=C LC_ALL=C
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for fullstackd-20220720052914.admin-monitoring.eqiad1.wikimedia.cloud
Info: Applying configuration version '(35f3b429e7) Filippo Giunchedi - prometheus: enable x509 CN validation in blackbox'
Notice: The LDAP client stack for this host is: sssd/sudo
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
Notice: Applied catalog in 3.98 seconds

oot@fullstackd-20220720052914:~# grep server /etc/puppet/puppet.conf
server = puppetmaster.cloudinfra.wmflabs.org

root@fullstackd-20220720052914:~# dig +short -x  $(dig +short puppetmaster.cloudinfra.wmflabs.org)
cloud-puppetmaster-03.cloudinfra.eqiad1.wikimedia.cloud.

Closing

dcaro moved this task from To refine to Done on the User-dcaro board.