Various puppet issues in deployment-prep
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	hashar
	Nov 20 2017, 9:26 AM

Description

Filled from a mail by @Andrew to Release-Engineering-Team

@Andrew is in the process of upgrading our various puppetmasters to modern versions. Before he takes the next step (https://gerrit.wikimedia.org/r/#/c/392172/) he is trying to get a grip on the current level of breakage so he can tell what (if anything) is broken additionally by the new parser.

As of November 19th, there are 11 VMs in deployment-prep showing up with puppet errors or failures.
<snip>

deployment-cache-text04
deployment-cache-upload04

Error: /usr/share/varnish/reload-vcl  && (rm /var/tmp/reload-vcl-failed; true) returned 1 instead of one of [0]
Error: /Stage[main]/Cacheproxy::Instance_pair/Varnish::Instance[text-backend]/Exec[retry-load-new-vcl-file]/returns: change from notrun to 0 failed: /usr/share/varnish/reload-vcl && (rm /var/tmp/reload-vcl-failed; true) returned 1 instead of one of [0]

in hiera, they were missing key between_bytes_timeout to cache::app_def_be_opts

deployment-changeprop
deployment-redis06
Puppet got disabled as part of T179684

puppet disabled 'Testing changeprop-redis issue T179684'
Puppet has been disabled for 5248 minutes

puppet enabled again.

deployment-mx

Error: Could not retrieve catalog from remote server: Error 400 on SERVER: You can only use systemd resources on systems with systemd, got upstart at /etc/puppet/modules/systemd/manifests/init.pp:8 on node deployment-mx.deployment-prep.eqiad.wmflabs

Now requires systemd, must be reimaged to jessie or stretch.

deployment-phab

can be ignored for now, will be recreated eventually

Can't log into this one; maybe it never finished building. From https://horizon.wikimedia.org/project/instances/5152bd16-e455-402e-829f-57ea4097f4f6/console

Could not retrieve catalog from remote server: Error 400 on SERVER:
Could not find data item phabricator_cluster_search in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/phabricator/main.pp:19
on node deployment-phab.deployment-prep.eqiad.wmflabs

The instance has been created on October 26th by @mmodell.

deployment-tin

Various apt/packaging dependency issues

The scap package had a wrong version number. See T180935#3774826

deployment-kafka-jumbo-1 and -2

package dependency issues breaking apt

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		EddieGP	T132259 Deployment-prep hosts with puppet errors (tracking)
		Declined		hashar	T180935 Various puppet issues in deployment-prep

Event Timeline

hashar created this task.Nov 20 2017, 9:26 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 20 2017, 9:26 AM

hashar triaged this task as High priority.Nov 20 2017, 9:27 AM

hashar added a parent task: T132259: Deployment-prep hosts with puppet errors (tracking).

Mentioned in SAL (#wikimedia-releng) [2017-11-20T09:29:36Z] <hashar> deployment-tin: apt-mark hold scap | the apt-repo on deployment-tin is out of date | T180935

deployment-cache-text04 and deployment-cache-upload04 are broken because hieradata for roles are not applied on labs T120165. They are applied via the Horizon Prefix Puppet page.

cache::app_def_be_ops is missing between_bytes_timeout. I have synced the settings from puppet.git. For the prefix deployment-cache-text:

before:

cache::app_def_be_opts:
 connect_timeout: 5s
 first_byte_timeout: 180s
 max_connections: 1000
 port: 80

after:

cache::app_def_be_opts:
  port: 80
  connect_timeout: '3s'
  first_byte_timeout: '63s'     
  between_bytes_timeout: '31s'  
  max_connections: 1000

For the prefix deployment-cache-upload:

before:

cache::app_def_be_opts:
  connect_timeout: 5s
  first_byte_timeout: 35s
  max_connections: 10000
  port: 80

after:

cache::app_def_be_opts:
  port: 80
  connect_timeout: '5s'
  first_byte_timeout: '35s'
  between_bytes_timeout: '60s'
  max_connections: 10000

That fixed puppet on both deployment-cache-text04 and deployment-cache-upload04.

Mentioned in SAL (#wikimedia-releng) [2017-11-20T09:39:06Z] <hashar> deployment-prep added missing key between_bytes_timeout to cache::app_def_be_opts for deployment-cache-text04 and deployment-cache-upload04 | T180935

hashar updated the task description. (Show Details)Nov 20 2017, 9:40 AM

hashar added a subtask: T179684: Kafka sometimes misses to rebalance topics properly.

hashar updated the task description. (Show Details)Nov 20 2017, 9:44 AM

hashar updated the task description. (Show Details)Nov 20 2017, 9:58 AM

hashar updated the task description. (Show Details)

hashar added a subscriber: • mmodell.

Mentioned in SAL (#wikimedia-releng) [2017-11-20T10:05:40Z] <hashar> deployment-phab : set hiera 'phabricator_cluster_search: []' trying to unblock puppet and soft rebooted the instance | T180935

hashar removed a subtask: T179684: Kafka sometimes misses to rebalance topics properly.Nov 20 2017, 10:35 AM

hashar updated the task description. (Show Details)

If deployment-mx is still in use/needed, it should be reimaged to jessie or stretch.

deployment-tin seems failing because scap is put on hold, since 3.7.3 is also on apt.wikimedia.org "apt-mark unhold scap" should fix it.

I have marked scap on hold to get the version from apt.wikimedia.org, else it tries to get an outdated version generated by CI (from deployment-prep which has priority 1500).

$ apt-cache policy scap
scap:
  Installed: 3.7.3-1
  Candidate: 3.6.0-1~20171117182426.238
  Version table:
 *** 3.7.3-1 0
       1001 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status
     3.6.0-1~20171117182426.238 0
       1500 http://deployment-tin.deployment-prep.eqiad.wmflabs/repo/ jessie-deployment-prep/main amd64 Packages

Would have to sort that out with @mmodell

for deployment-phab we could do https://phabricator.wikimedia.org/P6353 (syntax untested but should work)

In T180935#3774154, @hashar wrote:
I have marked scap on hold to get the version from apt.wikimedia.org, else it tries to get an outdated version generated by CI (from deployment-prep which has priority 1500).
$ apt-cache policy scap
scap:
  Installed: 3.7.3-1
  Candidate: 3.6.0-1~20171117182426.238
  Version table:
 *** 3.7.3-1 0
       1001 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status
     3.6.0-1~20171117182426.238 0
       1500 http://deployment-tin.deployment-prep.eqiad.wmflabs/repo/ jessie-deployment-prep/main amd64 Packages
Would have to sort that out with @mmodell

@mmodell is out this week from the looks of it. The package built by CI is actually a more recent version of scap. CI scap is built from master which is the branch where scap development is done rather than release where the master branch is merged for production deployments after cooking on beta a while.

The version numbers looked wrong because although scap/version.py was being updated as part of development, the debian/changelog in the master branch was missing a bunch of stuff. I pushed up rMSCAfa9df0ce262b982bc0c3a5f0bdfb34836774902e and ran apt-mark unhold scap so this should be fixed now.

hashar updated the task description. (Show Details)Nov 21 2017, 7:55 AM

Thanks @thcipriani and indeed deployment-tin works just fine now :]

deployment-mx

Error: Could not retrieve catalog from remote server: Error 400 on SERVER: You can only use systemd resources on systems with systemd, got upstart at /etc/puppet/modules/systemd/manifests/init.pp:8 on node deployment-mx.deployment-prep.eqiad.wmflabs

I did a git bisect which points at 052e3a87c143a3e736f7a9b84c140a8cb0ad7f22 T179565. It adds to role::mail::mx the class mtail::program which only comes with systemd template.

Looking at past puppet log, seems LetsEncrypt/Nginx has been broken for a while.

Anyway as Moritz said, we should rebuild it to Jessie/Stretch.

hashar updated the task description. (Show Details)Nov 21 2017, 8:55 AM

Per Mukunda, deployment-phab will be recreated anyway. So we can ignore it for now.

@hashar Let's break out the remaining issues from this task into subtasks of T132259: Deployment-prep hosts with puppet errors (tracking) and close this one (they're kind of redundant tasks).

The original purpose of this task was to have puppet upgraded on all the beta cluster instances . I have created T184114 to upgrade the last few remaining.

As for puppet being broken on several instances, indeed we could use some new tasks. The reasons listed in this are no more accurate, so I am declining as outdated.

In T180935#3872752, @hashar wrote:

As for puppet being broken on several instances, indeed we could use some new tasks. The reasons listed in this are no more accurate, so I am declining as outdated.

What a mess. It was disappointing to find deployment-prep in this state. I've opened some more tasks.

Various puppet issues in deployment-prepClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Various puppet issues in deployment-prep
Closed, DeclinedPublic
Actions

Related Objects
Search...