Page MenuHomePhabricator

Puppet broken on restricted.bastion.wmcloud.org
Closed, ResolvedPublic

Description

Hi folks,

not sure if this is the correct tag, but restricted.bastion.wmcloud.org has puppet broken since a while ago:

bastion-restricted-eqiad1-01 is a Cloud VPS bastion host (with mosh enabled) (labs::bastion)
The last Puppet run was at Mon Apr  5 09:39:10 UTC 2021 (30668 minutes ago). 
Last puppet commit: 
Last login: Mon Apr 26 16:45:36 2021 from 93.34.115.23


Apr 26 16:38:31 bastion-restricted-eqiad1-01 puppet-agent[2389]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Class[Ssh::Server]: parameter 'max_startups' expects a value of type Undef or Integer, got String (file: /etc/puppet/modules/profile/manifests/base.pp, line: 120, column: 5) on node bastion-restricted-eqiad1-01.bastion.eqiad.wmflabs
Apr 26 16:38:31 bastion-restricted-eqiad1-01 puppet-agent[2389]: Not using cache on failed catalog
Apr 26 16:38:31 bastion-restricted-eqiad1-01 puppet-agent[2389]: Could not retrieve catalog; skipping run

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Yea, 30668 minutes is 21 days. Matches the change above. cc: @jbond

Looks like the fix would be in Hiera (Horizon Hiera?) where "max_attempts" is set to a string somewhere where it should be an integer. Probably just quotes need to be removed.

Change 682693 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] C:ssh::server: Update the type checking to String

https://gerrit.wikimedia.org/r/682693

Thanks @Dzahn I have created a CR to update the supported types as the bastion host passes max_startups: 35:30:60 which should be supported. Ill take this forward tomorrow but feel free to merge the change your self if its blocking

confirmed the format of max_startups in sshd docs

compiled on bast1003 to make sure, no change

Change 682693 merged by Dzahn:

[operations/puppet@production] C:ssh::server: Update the type checking to String

https://gerrit.wikimedia.org/r/682693

Mentioned in SAL (#wikimedia-cloud) [2021-04-26T18:06:30Z] <mutante> running puppet on bastion-restricted-eqiad1-01 after deploying fix for T281176

Mentioned in SAL (#wikimedia-cloud) [2021-04-26T18:08:36Z] <mutante> bastion-restricted-eqiad1-01 puppet runs again but still has unrelated issues, like "failed to fetch python3-dateutil" T281176 on stretch

Thanks @jbond I compiled, merged the change and ran puppet on the host.

@elukey Puppet run now finishes again but there are still unrelated issues like this:

The following NEW packages will be installed:
  python3-dateutil
0 upgraded, 1 newly installed, 0 to remove and 139 not upgraded.
Need to get 39.4 kB of archives.
After this operation, 228 kB of additional disk space will be used.
Err:1 http://apt.wikimedia.org/wikimedia jessie-wikimedia/openstack-mitaka-jessie amd64 python3-dateutil all 2.4.2-1~bpo8+1
  404  Not Found [IP: 208.80.154.30 80]
E: Failed to fetch http://apt.wikimedia.org/wikimedia/pool/openstack-mitaka-jessie/p/python-dateutil/python3-dateutil_2.4.2-1~bpo8+1_all.deb  404  Not Found [IP: 208.80.154.30 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
Notice: /Stage[main]/Base::Debdeploy/File[/usr/local/bin/apt-upgrade-activity]: Dependency Package[python3-dateutil] has failures: true
Warning: /Stage[main]/Base::Debdeploy/File[/usr/local/bin/apt-upgrade-activity]: Skipping because of failed dependencies
Warning: /Stage[main]/Base::Debdeploy/File[/etc/debdeploy-client]: Skipping because of failed dependencies

Mentioned in SAL (#wikimedia-cloud) [2021-04-26T18:14:29Z] <mutante> bastion-restricted-eqiad1-01, a stretch host, has openstack-mitaka-jessie.list in APT sources. mixed stretch/jessie sources cause issues T281176

Mentioned in SAL (#wikimedia-cloud) [2021-04-26T18:16:39Z] <mutante> bastion-restricted-eqiad1-01 - /etc/apt/sources.list.d# mv openstack-mitaka-jessie.list /root/ ; puppet agent -tv - fixed puppet run T281176

Dzahn claimed this task.

I moved the "openstack-mitaka-jessie" (should have never existed on a stretch host!) file out of sources.list.d and ran puppet again.

This made it:

Notice: /Stage[main]/Packages::Python3_dateutil/Package[python3-dateutil]/ensure: created

and puppet is now running without issues again.

That apt sources.list bug can't die quickly enough.

ah, a known bug! gotcha, thanks