Page MenuHomePhabricator

Receiving warning emails about failed puppet runs - multiple instances
Closed, ResolvedPublic

Description

I'm getting all these emails:

Alert: puppet failed on fastcci-worker1.fastcci.eqiad.wmflabs
Alert: puppet failed on fastcci-puppetmaster.fastcci.eqiad.wmflabs
Alert: puppet failed on maps-tiles1.maps.eqiad.wmflabs
Alert: puppet failed on maps-tiles2.maps.eqiad.wmflabs
Alert: puppet failed on fastcci-main.fastcci.eqiad.wmflabs
Alert: puppet failed on maps-warper.maps.eqiad.wmflabs
Alert: puppet failed on fastcci-master.fastcci.eqiad.wmflabs

A look at the fastcci-worker1.fastcci.eqiad.wmflabs /var/log/puppet.log file does not help me much :-/ It contains blocks like this

Ignoring stale puppet agent lock for pid 16421
Info: Sleeping for 31 seconds (splay is enabled)
Info: Retrieving plugin
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_config_dir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/physicalcorecount.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_vardir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/lldp.rb
Info: Loading facts in /var/lib/puppet/lib/facter/initsystem.rb
Info: Loading facts in /var/lib/puppet/lib/facter/pe_version.rb
Info: Loading facts in /var/lib/puppet/lib/facter/labsprojectfrommetadata.rb
Info: Loading facts in /var/lib/puppet/lib/facter/ganeti.rb
Info: Loading facts in /var/lib/puppet/lib/facter/root_home.rb
Info: Loading facts in /var/lib/puppet/lib/facter/apt.rb
Info: Caching catalog for fastcci-worker1.fastcci.eqiad.wmflabs
Info: Applying configuration version '1463921528'
Notice: /Stage[main]/Standard::Ntp::Client/Ntp::Daemon[client]/Service[ntp]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Standard::Ntp::Client/Ntp::Daemon[client]/Service[ntp]: Unscheduling refresh on Service[ntp]
Skipping this run, puppet agent already running at pid 18206

Event Timeline

Run puppet manually?

Also, you need to put this in the appropriate projects for maps and fastcci instances.

running manually it seems to hang after

Info: /Stage[main]/Standard::Ntp::Client/Ntp::Daemon[client]/Service[ntp]: Unscheduling refresh on Service[ntp]

Could that be the reason for the stale lock? (next scheduled run encounters the previous hanging run?)

I'm receiving the same issue on with Alert: puppet failed on wigi.wikidumpparse.eqiad.wmflabs . There's no indication on how to fix this.

In my /var/log/puppet.log: I have many

Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/ubuntu/.ssh/authorized_keys /public/keys/ubuntu/.ssh]: Not removing directory; use 'force' to overrideESC[0m
ESC[mNotice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/ubuntu/.ssh/authorized_keys /public/keys/ubuntu/.ssh]/ensure: removed
Andrew subscribed.

It looks to me like this is hanging because of /public/dumps issues. We've been facing similar issues over the last few weeks. A reboot will almost certainly fix everything; if any of the instances should not be rebooted let me know and I will pursue other means.

@notconfusing rebooting the instances took care of it. Sorry about the slow answer. I rebooted a single instance first and wanted to make sure it helped before rebooting all of them.

Yes, rebooting through the wikitech web interface took care of it for me too.

That's great! As always, this is due to various terrible NFS problems which are difficult to deal with comprehensively. In theory after the reboot you're less subject to NFS-related hangs.