Page MenuHomePhabricator

Puppet failures on many hosts
Closed, ResolvedPublic

Description

I'm getting the emails about puppet failures on deployment prep-hosts since around 2020-10-30.

e.g.

Puppet is failing to run on the "deployment-ores01.deployment-prep.eqiad.wmflabs" instance in Wikimedia Cloud VPS.

Working Puppet runs are needed to maintain instance security and logins.
As long as Puppet continues to fail, this system is in danger of becoming
unreachable.

You are receiving this email because you are listed as member for the
project that contains this instance.  Please take steps to repair
this instance or contact a Cloud VPS admin for assistance.

For further support, visit #wikimedia-cloud on freenode or
<https://wikitech.wikimedia.org>

List of hosts I noticed were:

  • deployment-ores01
  • deployment-mdb01
  • deployment-cache-text06
  • deployment-kafka-jumbo-2
  • deployment-mx02 - T267831
  • deployment-kafka-main-1
  • deployment-kafka-main-2
  • deployment-cache-upload06
  • deployment-memc08 - T267388
  • deployment-wdqs01
  • deployment-logstash03
  • deployment-kafka-jumbo-1

Event Timeline

The email notification had been broken and got fixed hence why we receive them again. For each affected host, one can look at the last few entries in the Puppet log (sudo tail -n50 /var/log/puppet.log), copy paste the output to a sub task and then subscribe whoever might knows about the issue (that is the hard part).

Some of the fixes are probably straightforward. deployment-memc08 was just about removing quotes surrounding a Hiera value cause the Puppet class has been changed to enforce float/int type for the parameters (T267388).

Mentioned in SAL (#wikimedia-releng) [2020-11-09T14:18:06Z] <hashar> deployment-prep: removing role::beta::availability_collector from deployment-cache06 since that broke puppet. The role no more exists # T267006

There is literally nobody tasked on fixing those. The way I do it is:

  • login to the instance
  • check the puppet log: sudo tail -n 100 /var/log/puppet.log
  • create a task and add related projects, eventually directly subscribe authors of patches that might have broken it

Adding the following to match production has fixed deployment-ores01

profile::tlsproxy::envoy::ensure: present
profile::tlsproxy::envoy::global_cert_name: ores.discovery.wmnet
profile::tlsproxy::envoy::services:
- port: 8081
  server_names:
  - '*'
profile::tlsproxy::envoy::sni_support: 'no'

Change 641167 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] mariadb: create empty section mappings for cloud

https://gerrit.wikimedia.org/r/641167

deployment-kafka-* servers where all working fine when tested

Change 641167 merged by Jbond:
[operations/puppet@production] mariadb: create empty section mappings for cloud

https://gerrit.wikimedia.org/r/641167

Change 641169 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::query_service: add default for username

https://gerrit.wikimedia.org/r/641169

Change 641169 merged by Jbond:
[operations/puppet@production] profile::query_service: add default for username

https://gerrit.wikimedia.org/r/641169

Change 641171 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::query_service: add federation_user_agent to deployment-prep

https://gerrit.wikimedia.org/r/641171

Change 641171 merged by Jbond:
[operations/puppet@production] profile::query_service: add federation_user_agent to deployment-prep

https://gerrit.wikimedia.org/r/641171

deployment-wdqs01 should be fixed with https://gerrit.wikimedia.org/r/641171 however there is now a conflict wit a local commit which needs deleting:
c503964991 [LOCAL HACK] Fill placeholder for etcd::autogen_pwd_seed from root@deployment-puppetmaster04:/var/lib/git/operations/puppet
`

to fix deployment-logstash03:

  • rename role::logstash::apifeatureusage to profile::logstash::apifeatureusage in horizon
  • rename role::logstash::collector to profile::logstash::collector in horizon
  • add modules/secret/secrets/certificates/kafka_logstash-eqiad_broker/truststore.jks to private repo
  • add the following to the kafka_cluters global hiera value in the deployment-prep puppet project config
kafka_clusters:
  logging-eqiad:
    brokers:
      deployment-kafka-jumbo-3.deployment-prep.eqiad1.wikimedia.cloud:
        id: 3
        rack: C

deployment-wdqs01 should be fixed with https://gerrit.wikimedia.org/r/641171 however there is now a conflict wit a local commit which needs deleting:
c503964991 [LOCAL HACK] Fill placeholder for etcd::autogen_pwd_seed from root@deployment-puppetmaster04:/var/lib/git/operations/puppet
`

This has been corrected also needed a few extra yaml settings

in relation to deployment-cache-upload06 puppet runs successfully but the following command fails on each execution @Vgutierrez may be able to quickly spot the issue

root@deployment-cache-upload06:~# /usr/local/sbin/reload-vcl -n frontend -f /etc/varnish/wikimedia_upload-frontend.vcl -d 2 -a
Executing: "/usr/bin/varnishadm -n frontend vcl.load vcl-a456afcf-3a67-4b97-a1ac-79cc88e69a86 /etc/varnish/wikimedia_upload-frontend.vcl"
Cannot open /var/lib/varnish/frontend/_.vsm: No such file or directory
Traceback (most recent call last):
  File "/usr/local/sbin/reload-vcl", line 179, in <module>
    main()
  File "/usr/local/sbin/reload-vcl", line 162, in main
    main_vcl_id = load(vadm_cmd, args.vcl_file)
  File "/usr/local/sbin/reload-vcl", line 123, in load
    do_cmd(vcl_load_cmd)
  File "/usr/local/sbin/reload-vcl", line 63, in do_cmd
    subprocess.check_call(cmd)
  File "/usr/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/varnishadm', '-n', 'frontend', 'vcl.load', 'vcl-a456afcf-3a67-4b97-a1ac-79cc88e69a86', '/etc/varnish/wikimedia_upload-frontend.vcl']' returned non-zero exit status 2.

All other systems are running puppet successfully however someone more familiar with deployment-prep should check that everything is running as expected

dancy added a subscriber: dancy.

it looks like deployment-cache-upload06 has been omitted in T267561:

vgutierrez@deployment-cache-upload06:/etc/varnish$ dpkg -l |grep varnish
ii  libvarnishapi1                       5.1.3-1wm15                  amd64        shared libraries for Varnish
ii  libvarnishapi2:amd64                 6.1.1-1+deb10u1              amd64        shared libraries for Varnish
ii  prometheus-varnish-exporter          1.4.1-1                      amd64        Prometheus exporter for Varnish
ii  prometheus-varnishkafka-exporter     0.1-1                        all          A Prometheus exporter daemon that exports metrics from varnishkafka logs.
ii  varnish                              5.1.3-1wm15                  amd64        state of the art, high-performance web accelerator
ii  varnish-dbg                          5.1.3-1wm15                  amd64        debugging symbols for varnish
ii  varnish-modules                      0.12.1-1+wmf2                amd64        Varnish module collection
ii  varnishkafka                         1.0.14-1                     amd64        Varnish to Kafka log streamer

@Vgutierrez definitely. T267561 only focused on the text06 instance and indeed we should have fixed that one as well. Maybe reopen that task to follow up? Some packages might be marked on hold (apt-mark showhold).

puppet seems to be happy on deployment-cache-upload06 after upgrading to varnish 6

Thank you to everyone that acted on this task.

All the instances mentioned in this task are passing fine now! There are a couple more breakage but they can be acted on in an independent tasks.