Page MenuHomePhabricator

Puppet broken on several vms in toolsbeta
Closed, ResolvedPublic

Description

As is highlighted nicely by DNS changes, puppet is broken for several toolsbeta servers, and not all for similar reasons. This needs fixing.

  • toolsbeta-docker-registry-01.toolsbeta.eqiad.wmflabs
  • toolsbeta-k8s-lb-01.toolsbeta.eqiad.wmflabs (deleted for now)
  • toolsbeta-proxy-01.toolsbeta.eqiad.wmflabs
  • toolsbeta-puppetdb-01.toolsbeta.eqiad.wmflabs
  • toolsbeta-sgegrid-master.toolsbeta.eqiad.wmflabs

Event Timeline

Bstorm triaged this task as Medium priority.Apr 23 2019, 10:04 PM
Bstorm created this task.

Change 504968 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] puppetdb: adapt the module so it works on Cloud VPS simply again

https://gerrit.wikimedia.org/r/504968

Change 504968 merged by Bstorm:
[operations/puppet@production] puppetdb: adapt the module so it works on Cloud VPS simply again

https://gerrit.wikimedia.org/r/504968

I hand-edited resolv.conf on these hosts so that they will survive the upcoming nameserver change.

toolsbeta-docker-registry-01 is failing on

Function lookup() did not find a value for the name 'profile::toolforge::docker::registry::standby_node' at /etc/puppet/modules/profile/manifests/toolforge/docker/registry.pp:1

toolsbeta-k8s-lb-01 isn't working because I didn't finish it: profile::toolforge::k8s::api_servers is empty and needs values. I may just delete that instance.

toolsbeta-proxy-01 seems kind of half-baked. I think it was someone's work toward making toolsbeta more like tools to test things against the proxy.
Function lookup() did not find a value for the name 'profile::toolforge::toolviews::mysql_password' at /etc/puppet/modules/profile/manifests/toolforge/toolviews.pp:4 I imagine this is a distant reflection of T101651.

For purposes here and now, I think I should delete it until someone is ready to babysit it. A proxy could break a lot of things if the config is messed up, so it'd be best to have it when people are actually working on it.

Mentioned in SAL (#wikimedia-cloud) [2019-06-14T15:55:12Z] <bstorm_> T221721 deleted toolsbeta-proxy-01 until it can be actively worked on.

Mentioned in SAL (#wikimedia-cloud) [2019-06-14T16:03:28Z] <bstorm_> T221721 hard rebooted toolsbeta-sgegrid-master because it had oomkilled basically everything

sigh, toolsbeta-sgegrid-master cannot install jobutils because of weirdness in aptly most likely. I think we ended up making the stretch toolsbeta repo there an actual repo, so jobutils would need to actually be there for it to work E: Unable to locate package jobutils

I'm not sure that matters much since that doesn't flat out break puppet, but it's one to note if testing things later.

Puppet is now able to run cleanly on the grid master.

The gridmaster service isn't running because apparently the shadow is! Jun 14 16:09:07 toolsbeta-sgegrid-master sge_qmaster[16056]: critical error: qmaster on host "toolsbeta-sgegrid-shadow.toolsbeta.eqiad.wmflabs" is still running - terminating

Manually changed the /data/project/.system_sge/gridengine/default/common/act_qmaster file and got the process going. The beta grid is healthy again.

Change 517110 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: make backup registry optional (for toolsbeta)

https://gerrit.wikimedia.org/r/517110

Change 517110 merged by Bstorm:
[operations/puppet@production] toolforge: make backup registry optional (for toolsbeta)

https://gerrit.wikimedia.org/r/517110

Puppet now runs on the registry node, but it doesn't work because it needs the SSL cert placed in the private repo (like in tools) and some odd prometheus error. However, puppet now functions, so this ticket is done. Whoever was working on making a beta registry can continue now.

Bstorm updated the task description. (Show Details)