Page MenuHomePhabricator

toolforge: clush can't connect to some instances
Closed, ResolvedPublic

Description

I checked today what is the status regarding clush connectivity to other nodes, and this is the result:

aborrero@tools-clushmaster-01:~$ clush -q -w @all ':'
tools-static-12.tools.eqiad.wmflabs: Permission denied (publickey,hostbased).
clush: tools-static-12.tools.eqiad.wmflabs: exited with exit code 255
tools-services-01.tools.eqiad.wmflabs: ssh: connect to host tools-services-01.tools.eqiad.wmflabs port 22: Connection timed out
clush: tools-services-01.tools.eqiad.wmflabs: exited with exit code 255

Both tools-static-12.tools.eqiad.wmflabs and tools-servives-01.tools.eqiad.wmflabs can't be reached by clush (for different reasons I guess)

Event Timeline

In tools-services-01.tools.eqiad.wmflabs there is a file etc/ferm/conf.d/10_role::toollabs::clush::target with content:

# Autogenerated by puppet. DO NOT EDIT BY HAND!
#
#
&R_SERVICE(tcp, 22, @resolve((tools-puppetmaster-01.tools.eqiad.wmflabs)));

This seems wrong, since the clushmaster is tools-clushmaster-01.tools.eqiad.wmflabs. This was generated using the code located in puppet modules/role/manifests/toollabs/clush/target.pp.

In tools-services-01.tools.eqiad.wmflabs there is a file etc/ferm/conf.d/10_role::toollabs::clush::target with content:

# Autogenerated by puppet. DO NOT EDIT BY HAND!
#
#
&R_SERVICE(tcp, 22, @resolve((tools-puppetmaster-01.tools.eqiad.wmflabs)));

This seems wrong, since the clushmaster is tools-clushmaster-01.tools.eqiad.wmflabs. This was generated using the code located in puppet modules/role/manifests/toollabs/clush/target.pp.

from https://wikitech.wikimedia.org/wiki/Hiera:Tools I see:

"role::toollabs::clush::target::master": tools-clushmaster-01.tools.eqiad.wmflabs

Is puppet running on tools-services-01?

Is puppet running on tools-services-01?

good question. It seems to be running with no issues.

I tested this:

  • removed the /etc/ferm/conf.d/10_role::toollabs::clush::target file by hand
  • run puppet to see it recreated
  • nothing happened, no errors from puppet, but the file wasn't created

Is somehow ferm disabled from puppet POV in this host?

Mentioned in SAL (#wikimedia-cloud) [2018-02-15T13:54:35Z] <arturo> cleanup ferm (deinstall) in tools-services-01 for T187435

After this cleanup, the state is:

aborrero@tools-clushmaster-01:~$ clush -q -w @all ':'
tools-static-12.tools.eqiad.wmflabs: Permission denied (publickey,hostbased).
clush: tools-static-12.tools.eqiad.wmflabs: exited with exit code 255

We can close this task now, given tools-static-12 is a WIP (has puppet disabled right now)