Page MenuHomePhabricator

Timeouts on puppetserver1002 past reboot
Closed, ResolvedPublic

Description

puppetserver1002 got rebooted earlier the day and since then some Puppet runs were failing, two random samples:

This might be some hardware issue unveiled by the reboot, maybe a broken SPF/cable or so?

14:21:08	info	Using environment 'production'	
14:22:18	info	Retrieving pluginfacts	
14:22:38	info	Retrieving plugin	
14:22:57	info	Loading facts	
14:23:01	warning	The current total number of facts: 2998 exceeds the number of facts limit: 2048	
14:24:37	info	Caching catalog for wikikube-worker1155.eqiad.wmnet	
14:24:38	info	Applying configuration version '(b58328f57c) JMeybohm - kubernetes: Remove absent rsyslog config: block-docker-mount-spam'	
14:31:17	err	/Stage[main]/Geoip::Data::Puppet/File[/usr/share/GeoIP]

Failed to generate additional resources using 'eval_generate': Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadatas/volatile/GeoIP?recurse=true&max_files=0&links=manage&checksum_type=sha256&source_permissions=ignore&environment=production interrupted after 38.513 seconds
Wrapped exception:
end of file reached	/srv/puppet_code/environments/production/modules/geoip/manifests/data/puppet.pp:21
14:32:17	err	/Stage[main]/Geoip::Data::Puppet/File[/usr/share/GeoIP]

Could not evaluate: Could not retrieve file metadata for puppet:///volatile/GeoIP: Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadata/volatile/GeoIP?links=manage&checksum_type=sha256&source_permissions=ignore&environment=production timed out connect operation after 60.062 seconds	/srv/puppet_code/environments/production/modules/geoip/manifests/data/puppet.pp:21
14:33:20	err	/Stage[main]/Systemd/File[/usr/local/bin/systemd-timer-mail-wrapper]

Could not evaluate: Could not retrieve file metadata for puppet:///modules/systemd/systemd-timer-mail-wrapper.py: Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadata/modules/systemd/systemd-timer-mail-wrapper.py?links=manage&checksum_type=sha256&source_permissions=ignore&environment=production interrupted after 26.133 seconds	/srv/puppet_code/environments/production/modules/systemd/manifests/init.pp:32
14:34:02	err	/Stage[main]/Profile::Puppet::Agent/Motd::Script[last-puppet-run]/File[/etc/update-motd.d/97-last-puppet-run]

Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/puppet/97-last-puppet-run: Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadata/modules/profile/puppet/97-last-puppet-run?links=manage&checksum_type=sha256&source_permissions=ignore&environment=production failed after 42.187 seconds: SSL_connect returned=1 errno=0 peeraddr=10.64.16.19:8140 state=error: unexpected eof while reading	/srv/puppet_code/environments/production/modules/motd/manifests/script.pp:43
14:39:13	notice	/Stage[main]/Profile::Systemd::Timesyncd/Systemd::Unit[systemd-timesyncd.service]/File[/etc/systemd/system/systemd-timesyncd.service.d]

Dependency File[/usr/local/bin/systemd-timer-mail-wrapper] has failures: true	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:61
14:39:13	warning	/Stage[main]/Profile::Systemd::Timesyncd/Systemd::Unit[systemd-timesyncd.service]/File[/etc/systemd/system/systemd-timesyncd.service.d]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:61
14:39:13	warning	/Stage[main]/Profile::Systemd::Timesyncd/Systemd::Unit[systemd-timesyncd.service]/File[/etc/systemd/system/systemd-timesyncd.service.d/puppet-override.conf]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:78
14:39:13	warning	/Stage[main]/Profile::Systemd::Timesyncd/Systemd::Unit[systemd-timesyncd.service]/Exec[systemd daemon-reload for systemd-timesyncd.service (systemd-timesyncd.service)]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:88
14:39:13	warning	/Stage[main]/Profile::Systemd::Timesyncd/Systemd::Unit[systemd-timedated.service]/File[/etc/systemd/system/systemd-timedated.service.d]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:61
14:39:13	warning	/Stage[main]/Profile::Systemd::Timesyncd/Systemd::Unit[systemd-timedated.service]/File[/etc/systemd/system/systemd-timedated.service.d/puppet-override.conf]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:78
14:39:13	warning	/Stage[main]/Profile::Systemd::Timesyncd/Systemd::Unit[systemd-timedated.service]/Exec[systemd daemon-reload for systemd-timedated.service (systemd-timedated.service)]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:88
14:39:13	warning	/Stage[main]/Logrotate/Systemd::Unit[logrotate.timer:hourly-override]/File[/etc/systemd/system/logrotate.timer.d/puppet-override.conf]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:78
14:39:13	warning	/Stage[main]/Logrotate/Systemd::Unit[logrotate.timer:hourly-override]/Exec[systemd daemon-reload for logrotate.timer (logrotate.timer:hourly-override)]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:88
14:39:13	warning	/Stage[main]/Profile::Puppet::Client_bucket/Systemd::Timer::Job[clean_puppet_client_bucket]/Systemd::Unit[clean_puppet_client_bucket.service]/File[/lib/systemd/system/clean_puppet_client_bucket.service]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:78
14:39:13	warning	/Stage[main]/Profile::Puppet::Client_bucket/Systemd::Timer::Job[clean_puppet_client_bucket]/Systemd::Unit[clean_puppet_client_bucket.service]/Exec[systemd daemon-reload for clean_puppet_client_bucket.service (clean_puppet_client_bucket.service)]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:88
14:39:13	warning	/Stage[main]/Profile::Puppet::Agent/Systemd::Timer::Job[puppet-agent-timer]/Systemd::Unit[puppet-agent-timer.service]/File[/lib/systemd/system/puppet-agent-timer.service]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:78
14:39:13	warning	/Stage[main]/Profile::Puppet::Agent/Systemd::Timer::Job[puppet-agent-timer]/Systemd::Unit[puppet-agent-timer.service]/Exec[systemd daemon-reload for puppet-agent-timer.service (puppet-agent-timer.service)]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:88
14:42:03	err	/Stage[main]/Profile::Environment/File[/etc/vim/vimrc.local]

Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/environment/vimrc.local: Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadata/modules/base/environment/vimrc.local?links=manage&checksum_type=sha256&source_permissions=ignore&environment=production interrupted after 29.058 seconds	/srv/puppet_code/environments/production/modules/profile/manifests/environment.pp:133
14:44:12	warning	/Stage[main]/Prometheus::Node_puppet_agent/Systemd::Unit[prometheus-puppet-agent-stats]/File[/lib/systemd/system/prometheus-puppet-agent-stats.service]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:78
14:44:12	warning	/Stage[main]/Prometheus::Node_puppet_agent/Systemd::Unit[prometheus-puppet-agent-stats]/Exec[systemd daemon-reload for prometheus-puppet-agent-stats.service (prometheus-puppet-agent-stats)]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:88
14:44:12	warning	/Stage[main]/Prometheus::Node_puppet_agent/Exec[enable prometheus-puppet-agent-stats]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/prometheus/manifests/node_puppet_agent.pp:47
14:44:46	err	/Stage[main]/Profile::Puppet::Agent/Rsyslog::Conf[puppet-agent]/File[/etc/rsyslog.d/10-puppet-agent.conf]

Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/puppet/rsyslog.conf: Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadata/modules/profile/puppet/rsyslog.conf?links=manage&checksum_type=sha256&source_permissions=ignore&environment=production interrupted after 34.195 seconds	/srv/puppet_code/environments/production/modules/rsyslog/manifests/conf.pp:55
14:49:58	warning	/Stage[main]/Prometheus::Node_puppet_agent/Systemd::Timer::Job[prometheus_puppet_agent_stats]/Systemd::Unit[prometheus_puppet_agent_stats.service]/File[/lib/systemd/system/prometheus_puppet_agent_stats.service]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:78
14:49:58	warning	/Stage[main]/Prometheus::Node_puppet_agent/Systemd::Timer::Job[prometheus_puppet_agent_stats]/Systemd::Unit[prometheus_puppet_agent_stats.service]/Exec[systemd daemon-reload for prometheus_puppet_agent_stats.service (prometheus_puppet_agent_stats.service)]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:88
14:49:58	warning	/Stage[main]/Smart/Systemd::Override[systemd-wait-longer-for-smartd]/Systemd::Unit[smartmontools-systemd-wait-longer-for-smartd]/File[/etc/systemd/system/smartmontools.service.d]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:61
14:49:58	warning	/Stage[main]/Smart/Systemd::Override[systemd-wait-longer-for-smartd]/Systemd::Unit[smartmontools-systemd-wait-longer-for-smartd]/File[/etc/systemd/system/smartmontools.service.d/systemd-wait-longer-for-smartd.conf]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:78
14:49:58	warning	/Stage[main]/Smart/Systemd::Override[systemd-wait-longer-for-smartd]/Systemd::Unit[smartmontools-systemd-wait-longer-for-smartd]/Exec[systemd daemon-reload for smartmontools.service (smartmontools-systemd-wait-longer-for-smartd)]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:88
14:49:58	warning	/Stage[main]/Smart/Systemd::Timer::Job[export_smart_data_dump]/Systemd::Unit[export_smart_data_dump.service]/File[/lib/systemd/system/export_smart_data_dump.service]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:78
14:49:58	warning	/Stage[main]/Smart/Systemd::Timer::Job[export_smart_data_dump]/Systemd::Unit[export_smart_data_dump.service]/Exec[systemd daemon-reload for export_smart_data_dump.service (export_smart_data_dump.service)]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:88
14:49:58	warning	/Stage[main]/Prometheus::Cadvisor/Systemd::Service[cadvisor]/Systemd::Unit[cadvisor]/File[/etc/systemd/system/cadvisor.service.d]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:61
14:49:58	warning	/Stage[main]/Prometheus::Cadvisor/Systemd::Service[cadvisor]/Systemd::Unit[cadvisor]/File[/etc/systemd/system/cadvisor.service.d/puppet-override.conf]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:78
14:49:58	warning	/Stage[main]/Prometheus::Cadvisor/Systemd::Service[cadvisor]/Systemd::Unit[cadvisor]/Exec[systemd daemon-reload for cadvisor.service (cadvisor)]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:88
14:49:58	warning	/Stage[main]/Prometheus::Cadvisor/Systemd::Service[cadvisor]/Service[cadvisor]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/service.pp:59
14:49:58	warning	/Stage[main]/Prometheus::Ethtool_exporter/Systemd::Service[prometheus-ethtool-exporter]/Systemd::Unit[prometheus-ethtool-exporter]/File[/etc/systemd/system/prometheus-ethtool-exporter.service.d]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:61
14:49:58	warning	/Stage[main]/Prometheus::Ethtool_exporter/Systemd::Service[prometheus-ethtool-exporter]/Systemd::Unit[prometheus-ethtool-exporter]/File[/etc/systemd/system/prometheus-ethtool-exporter.service.d/puppet-override.conf]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:78
14:49:58	warning	/Stage[main]/Prometheus::Ethtool_exporter/Systemd::Service[prometheus-ethtool-exporter]/Systemd::Unit[prometheus-ethtool-exporter]/Exec[systemd daemon-reload for prometheus-ethtool-exporter.service (prometheus-ethtool-exporter)]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:88
14:49:58	warning	/Stage[main]/Prometheus::Ethtool_exporter/Systemd::Service[prometheus-ethtool-exporter]/Service[prometheus-ethtool-exporter]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/service.pp:59
14:50:04	warning	/Stage[main]/Base::Kernel/Systemd::Timer::Job[kernel-purge]/Systemd::Unit[kernel-purge.service]/File[/lib/systemd/system/kernel-purge.service]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:78
14:50:04	warning	/Stage[main]/Base::Kernel/Systemd::Timer::Job[kernel-purge]/Systemd::Unit[kernel-purge.service]/Exec[systemd daemon-reload for kernel-purge.service (kernel-purge.service)]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:88
14:50:04	warning	/Stage[main]/Prometheus::Node_debian_version/Systemd::Timer::Job[prometheus-debian-version-textfile]/Systemd::Unit[prometheus-debian-version-textfile.service]/File[/lib/systemd/system/prometheus-debian-version-textfile.service]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:78
14:50:04	warning	/Stage[main]/Prometheus::Node_debian_version/Systemd::Timer::Job[prometheus-debian-version-textfile]/Systemd::Unit[prometheus-debian-version-textfile.service]/Exec[systemd daemon-reload for prometheus-debian-version-textfile.service (prometheus-debian-version-textfile.service)]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:88
14:50:04	warning	/Stage[main]/Prometheus::Node_dpkg_success/Systemd::Timer::Job[prometheus-dpkg-success-textfile]/Systemd::Unit[prometheus-dpkg-success-textfile.service]/File[/lib/systemd/system/prometheus-dpkg-success-textfile.service]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:78
14:50:04	warning	/Stage[main]/Prometheus::Node_dpkg_success/Systemd::Timer::Job[prometheus-dpkg-success-textfile]/Systemd::Unit[prometheus-dpkg-success-textfile.service]/Exec[systemd daemon-reload for prometheus-dpkg-success-textfile.service (prometheus-dpkg-success-textfile.service)]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/systemd/manifests/unit.pp:88
14:51:07	notice	Caught TERM; exiting

and

14:09:12	info	Using environment 'production'	
14:10:03	info	Retrieving pluginfacts	
14:10:04	info	Retrieving plugin	
14:10:11	info	Loading facts	
14:10:40	info	Caching catalog for es1044.eqiad.wmnet	
14:10:40	info	Applying configuration version '(7029a8a2c0) MVernon - partman: also add ms-be206[8-9] to partman_early_command'	
14:11:43	err	/Stage[main]/Puppet::Agent/File[/etc/facter/facter.conf]

Could not evaluate: Could not retrieve file metadata for puppet:///modules/puppet/facter.conf: Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadata/modules/puppet/facter.conf?links=manage&checksum_type=sha256&source_permissions=ignore&environment=production timed out connect operation after 60.061 seconds	/srv/puppet_code/environments/production/modules/puppet/manifests/agent.pp:42
14:23:39	err	/Stage[main]/Apt/Apt::Repository[wikimedia-private]/Concat[/etc/apt/sources.list.d/wikimedia-private.sources]/Concat_file[/etc/apt/sources.list.d/wikimedia-private.sources]

Failed to generate additional resources using 'eval_generate': Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadata/modules/apt/sources-deb822-header.txt?environment=production interrupted after 29.966 seconds
Wrapped exception:
end of file reached	/srv/puppet_code/environments/production/vendor_modules/concat/manifests/init.pp:122
14:24:39	err	/Stage[main]/Apt/Apt::Repository[debian-backports]/Concat[/etc/apt/sources.list.d/debian-backports.sources]/Concat_file[/etc/apt/sources.list.d/debian-backports.sources]

Failed to generate additional resources using 'eval_generate': Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadata/modules/apt/sources-deb822-header.txt?environment=production timed out connect operation after 60.064 seconds
Wrapped exception:
Net::OpenTimeout	/srv/puppet_code/environments/production/vendor_modules/concat/manifests/init.pp:122
14:26:29	err	/Stage[main]/Profile::Base::Certificates/Sslcert::Ca[GlobalSign_ECC_OV_SSL_CA_2018.crt]/File[/usr/local/share/ca-certificates/GlobalSign_ECC_OV_SSL_CA_2018.crt.crt]

Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/GlobalSign_ECC_OV_SSL_CA_2018.crt: Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadata/modules/base/ca/GlobalSign_ECC_OV_SSL_CA_2018.crt?links=manage&checksum_type=sha256&source_permissions=ignore&environment=production interrupted after 32.646 seconds	/srv/puppet_code/environments/production/modules/sslcert/manifests/ca.pp:40
14:27:29	err	/Stage[main]/Profile::Base::Certificates/Sslcert::Ca[GlobalSign_ECC_Root_CA_R5_R3_Cross.crt]/File[/usr/local/share/ca-certificates/GlobalSign_ECC_Root_CA_R5_R3_Cross.crt.crt]

Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/GlobalSign_ECC_Root_CA_R5_R3_Cross.crt: Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadata/modules/base/ca/GlobalSign_ECC_Root_CA_R5_R3_Cross.crt?links=manage&checksum_type=sha256&source_permissions=ignore&environment=production timed out connect operation after 60.063 seconds	/srv/puppet_code/environments/production/modules/sslcert/manifests/ca.pp:40
14:27:29	notice	/Stage[main]/Sslcert/Exec[update-ca-certificates]

Dependency File[/usr/local/share/ca-certificates/GlobalSign_ECC_OV_SSL_CA_2018.crt.crt] has failures: true	/srv/puppet_code/environments/production/modules/sslcert/manifests/init.pp:16
14:27:29	notice	/Stage[main]/Sslcert/Exec[update-ca-certificates]

Dependency File[/usr/local/share/ca-certificates/GlobalSign_ECC_Root_CA_R5_R3_Cross.crt.crt] has failures: true	/srv/puppet_code/environments/production/modules/sslcert/manifests/init.pp:16
14:27:29	warning	/Stage[main]/Sslcert/Exec[update-ca-certificates]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/modules/sslcert/manifests/init.pp:16
14:27:29	warning	/Stage[main]/Profile::Pki::Client/Concat[/etc/cfssl/mutual_tls_client_cert.pem]/Concat_file[/etc/cfssl/mutual_tls_client_cert.pem]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/vendor_modules/concat/manifests/init.pp:122
14:27:29	warning	/Stage[main]/Profile::Pki::Client/Concat[/etc/cfssl/mutual_tls_client_cert.pem]/File[/etc/cfssl/mutual_tls_client_cert.pem]

Skipping because of failed dependencies	
14:27:29	warning	/Stage[main]/Profile::Pki::Client/Concat::Fragment[mtls_client_cert_leaf]/Concat_fragment[mtls_client_cert_leaf]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/vendor_modules/concat/manifests/fragment.pp:50
14:27:29	warning	/Stage[main]/Profile::Pki::Client/Concat::Fragment[mtls_client_cert_chain]/Concat_fragment[mtls_client_cert_chain]

Skipping because of failed dependencies	/srv/puppet_code/environments/production/vendor_modules/concat/manifests/fragment.pp:50
14:28:29	err	/Stage[main]/Profile::Rsyslog::Kafka_shipper/Rsyslog::Conf[template_syslog_json]/File[/etc/rsyslog.d/10-template-syslog-json.conf]

Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/rsyslog/template_syslog_json.conf: Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadata/modules/profile/rsyslog/template_syslog_json.conf?links=manage&checksum_type=sha256&source_permissions=ignore&environment=production timed out connect operation after 60.02 seconds	/srv/puppet_code/environments/production/modules/rsyslog/manifests/conf.pp:55
14:29:10	err	/Stage[main]/Profile::Firewall/Ferm::Conf[main]/File[/etc/ferm/conf.d/02_main]

Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/firewall/main-input-default-drop.conf: Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadata/modules/base/firewall/main-input-default-drop.conf?links=manage&checksum_type=sha256&source_permissions=ignore&environment=production failed after 40.305 seconds: SSL_connect returned=1 errno=0 peeraddr=10.64.16.19:8140 state=error: unexpected eof while reading	/srv/puppet_code/environments/production/modules/ferm/manifests/conf.pp:14
14:30:57	err	/Stage[main]/Admin/Admin::Hashuser[filippo]/Admin::User[filippo]/File[/home/filippo]

Failed to generate additional resources using 'eval_generate': Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadatas/modules/admin/home/filippo?recurse=true&max_files=0&links=manage&checksum_type=sha256&source_permissions=ignore&environment=production interrupted after 28.624 seconds
Wrapped exception:
end of file reached	/srv/puppet_code/environments/production/modules/admin/manifests/user.pp:89
14:38:07	err	/Stage[main]/Admin/Admin::Hashuser[hnowlan]/Admin::User[hnowlan]/File[/home/hnowlan]

Failed to generate additional resources using 'eval_generate': Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadatas/modules/admin/home/skel?recurse=true&max_files=0&links=manage&checksum_type=sha256&source_permissions=ignore&environment=production interrupted after 29.987 seconds
Wrapped exception:
end of file reached	/srv/puppet_code/environments/production/modules/admin/manifests/user.pp:89
14:39:04	err	/Stage[main]/Admin/Admin::Hashuser[hnowlan]/Admin::User[hnowlan]/File[/home/hnowlan]

Could not evaluate: Could not retrieve file metadata for puppet:///modules/admin/home/skel: Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3/file_metadata/modules/admin/home/skel?links=manage&checksum_type=sha256&source_permissions=ignore&environment=production failed after 56.26 seconds: SSL_connect returned=1 errno=0 peeraddr=10.64.16.19:8140 state=error: unexpected eof while reading	/srv/puppet_code/environments/production/modules/admin/manifests/user.pp:89
14:39:10	notice	Caught TERM; exiting

Event Timeline

Some execption example from the puppetserver logs (cut out as they are pretty long):

2026-04-14T12:57:10.834Z ERROR [qtp1196799668-113293] [p.r.core] Internal Server Error: java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30410/30000 ms
	at org.eclipse.jetty.server.HttpInput$ErrorState.noContent(HttpInput.java:1137)
	at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:335)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:287)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:330)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:190)
--
2026-04-14T12:57:52.995Z ERROR [qtp1196799668-113293] [p.r.core] Internal Server Error: org.eclipse.jetty.io.EofException: Early EOF
	at org.eclipse.jetty.server.HttpInput$3.getError(HttpInput.java:1195)
	at org.eclipse.jetty.server.HttpInput$3.noContent(HttpInput.java:1183)
	at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:335)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:287)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:330)
--
2026-04-14T13:02:25.968Z ERROR [qtp1196799668-114384] [p.r.core] Internal Server Error: java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 32185/30000 ms
	at org.eclipse.jetty.server.HttpInput$ErrorState.noContent(HttpInput.java:1137)
	at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:335)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:287)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:330)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:190)
--
2026-04-14T13:04:35.337Z ERROR [qtp1196799668-109465] [p.r.core] Internal Server Error: org.eclipse.jetty.io.EofException: Early EOF
	at org.eclipse.jetty.server.HttpInput$3.getError(HttpInput.java:1195)
	at org.eclipse.jetty.server.HttpInput$3.noContent(HttpInput.java:1183)
	at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:335)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:287)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:330)

I tried to reproduce these errors with two hosts which formerly had failing Puppet runs (by explicitly running them against depooled puppetserver1002), but I could not reproduce in something like 20 manual queries each. The puppetserver is still in the state from yesterday, it has not been restarted yet.

Initially I suspected that this could be a subtle hardware error (we once had connectivity issues on a Ganeti node after a reboot, it was either the SPF or the cable which was faulty and needed to be replaced, but the error only showed up after the reboot). This could still be case, then the errors wave have shown up earlier, I had rebooted the server around 6:45 UTC.

It also can't be a kernel issue, we have 4 out of 6 Puppet servers on that kernel and three of them are the same Dell server type. The software stack is also identical.

Around the time we had two different other maintenances: The refresh of the debmonitor intermediate and the esams network maintenance. The Puppetserver dashboard shows substantially elevated compile times for 1002 only https://grafana.wikimedia.org/goto/ffj5ov5ymn6rkb?orgId=1 (up to 20 minutes for the p99).

I'm inclined to restart Puppetserver on 1002 and simply repool it, most probably this was a temporay issue due to external circumstances (and if it still occurs we can still probe deeper)

Change #1272559 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Revert "Depool puppetserver1002"

https://gerrit.wikimedia.org/r/1272559

Change #1272559 merged by Muehlenhoff:

[operations/dns@master] Revert "Depool puppetserver1002"

https://gerrit.wikimedia.org/r/1272559

MoritzMuehlenhoff claimed this task.

I'm inclined to restart Puppetserver on 1002 and simply repool it, most probably this was a temporay issue due to external circumstances (and if it still occurs we can still probe deeper)

puppetserver1002 was repooled 45 minutes ago and has since been stable, marking as resolved.

Change #1273788 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Depool puppetserver1002

https://gerrit.wikimedia.org/r/1273788

Change #1273788 merged by Muehlenhoff:

[operations/dns@master] Depool puppetserver1002

https://gerrit.wikimedia.org/r/1273788

This started again and I've just depooled 1002 again.

@MoritzMuehlenhoff I tried to reproduce the issue on Friday afternoon, but I was unable to trigger it with simulated loads via cumin. I ratcheted up the concurrency, but once I went over around 25 simultaneous puppet runs, the throughput decreased significantly. However, even when the runs were taking a very long time no similar errors appeared in the logs. Perhaps we need to do more live debugging on Monday.

Mentioned in SAL (#wikimedia-operations) [2026-04-21T12:28:32Z] <moritzm> update firmware on puppetserver1002: idrac from 6.10.30.20 to 7.20.80.50 T423282

Mentioned in SAL (#wikimedia-operations) [2026-04-21T12:29:04Z] <moritzm> update firmware on puppetserver1002: BIOS from 1.9.2 to 1.20.2 T423282

Mentioned in SAL (#wikimedia-operations) [2026-04-21T12:53:40Z] <moritzm> update firmware on puppetserver1002: NIC from 22.31.6 to 23.21.6 T423282

@MoritzMuehlenhoff I tried to reproduce the issue on Friday afternoon, but I was unable to trigger it with simulated loads via cumin. I ratcheted up the concurrency, but once I went over around 25 simultaneous puppet runs, the throughput decreased significantly. However, even when the runs were taking a very long time no similar errors appeared in the logs. Perhaps we need to do more live debugging on Monday.

Same, I also could not reproduce this organically by manual Puppet runs. Since this doesn't really reproduce, but still strangely seems limited to only puppetserver1002 (if this were caused by a specific Puppet run breaking the server, we should see it on 1001/1003 now with 1002 depooled) I think a sensible next step would be to rule out systemtically all issues with the host. And since the software state is identical, this could still be subtly hardware-related.

As such, I went ahead and upgraded the BIOS, idrac and NIC firmwares:

  • IDRAC from 6.10.30.20 to 7.20.80.50
  • BIOS from 1.9.2 to 1.20.2
  • NIC from 22.31.6 to 23.21.6 T423282

With limited availability tomorrow I won't repool it yet, we can do that either on Thu or maybe rather on Monday so that we can react if this happens to break even with the updated firmwares. Does that sound like a good plan as next step?

Poking at this further I also noticed one other discrepancy actually: For some reason puppetserver1002 has the jdk variant of OpenJDK installed. This predates the current retention time of the apt logs on that server, so this is probably something that has been like that since the first setup of that host (debdeploy operates on source packages, so it simply upgrades what it installed on every host). A fresh installation only installs that flavour, so I suppose this was doen manually at some point by John when debugging something.

jmm@cumin2002:~$ sudo cumin puppetserver* 'dpkg --list | grep openjdk'
6 hosts will be targeted:
puppetserver[2001-2002,2004].codfw.wmnet,puppetserver[1001-1003].eqiad.wmnet
OK to proceed on 6 hosts? Enter the number of affected hosts to confirm or "q" to quit: 6
===== NODE GROUP =====
(1) puppetserver1002.eqiad.wmnet
----- OUTPUT for command #1: 'dpkg --list | grep openjdk' -----
ii  openjdk-17-jdk:amd64                        17.0.18+8-1~deb12u1                  amd64        OpenJDK Development Kit (JDK)
ii  openjdk-17-jdk-headless:amd64               17.0.18+8-1~deb12u1                  amd64        OpenJDK Development Kit (JDK) (headless)
ii  openjdk-17-jre:amd64                        17.0.18+8-1~deb12u1                  amd64        OpenJDK Java runtime, using Hotspot JIT
ii  openjdk-17-jre-headless:amd64               17.0.18+8-1~deb12u1                  amd64        OpenJDK Java runtime, using Hotspot JIT (headless)
===== NODE GROUP =====
(5) puppetserver[2001-2002,2004].codfw.wmnet,puppetserver[1001,1003].eqiad.wmnet
----- OUTPUT for command #1: 'dpkg --list | grep openjdk' -----
ii  openjdk-17-jre-headless:amd64               17.0.18+8-1~deb12u1                  amd64        OpenJDK Java runtime, using Hotspot JIT (headless)
================
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (6/6) [00:01<00:00,  4.85hosts/s]FAIL |                                                                                                                                                          |   0% (0/6) [00:01<?, ?hosts/s]100.0% (6/6) success ratio (>= 100.0% threshold) for command #1: 'dpkg --list | grep openjdk'.
100.0% (6/6) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
jmm@cumin2002:~$

I'll revert this and switch 1002 to only run jre-headless like the other Puppet servers.

Poking at this further I also noticed one other discrepancy actually: For some reason puppetserver1002 has the jdk variant of OpenJDK installed. This predates the current retention time of the apt logs on that server, so this is probably something that has been like that since the first setup of that host (debdeploy operates on source packages, so it simply upgrades what it installed on every host). A fresh installation only installs that flavour, so I suppose this was doen manually at some point by John when debugging something.

nice, I didn't notice the jdk vs jre piece. I'm happy to test today.

This is stable since a week, boldly resolving. It's not really clear whether the root cause was the jdk/jre mismatch or the outdated system firmware somehow causing a stability issue.