Page MenuHomePhabricator

pontoon: ever more Cloud VPS VMs unreachable via puppet or cumin
Closed, ResolvedPublic

Description

Right now there are 71 VMs that I can't reach via cloud cumin:

a11y.reading-web-staging.eqiad1.wikimedia.cloud,backend.wikicommunityhealth.eqiad1.wikimedia.cloud,canary[1027,1036]-01.cloudvirt-canary.eqiad1.wikimedia.cloud,canary-wdqs1003-01.cloudvirt-canary.eqiad1.wikimedia.cloud,client-[05,09].swift.eqiad1.wikimedia.cloud,client-a.monitoring.eqiad1.wikimedia.cloud,commonsarchive-mwtest.commonsarchive.eqiad1.wikimedia.cloud,cumin.mariadb104-test.eqiad1.wikimedia.cloud,fullstackd-20210723161838.admin-monitoring.eqiad1.wikimedia.cloud,gerrit-prod-1001.devtools.eqiad1.wikimedia.cloud,lb-01.swift.eqiad1.wikimedia.cloud,locality.trove.eqiad1.wikimedia.cloud,logging-cassandra-01.logging.eqiad1.wikimedia.cloud,logging-elastic7-[02-03].logging.eqiad1.wikimedia.cloud,logging-grafana-01.logging.eqiad1.wikimedia.cloud,logging-logstash7-01.logging.eqiad1.wikimedia.cloud,logging-loki-[01-02].logging.eqiad1.wikimedia.cloud,logging-lts-01.logging.eqiad1.wikimedia.cloud,logging-puppet-05.logging.eqiad1.wikimedia.cloud,logging-puppetdb-03.logging.eqiad1.wikimedia.cloud,logging-sts-01.logging.eqiad1.wikimedia.cloud,maria1.trove.eqiad1.wikimedia.cloud,mariadb104-test1.mariadb104-test.eqiad1.wikimedia.cloud,metricsinfra-db-1.trove.eqiad1.wikimedia.cloud,ms-be-[01-02].swift.eqiad1.wikimedia.cloud,ms-fe-[01-03].swift.eqiad1.wikimedia.cloud,mwv-builder-03.mediawiki-vagrant.eqiad1.wikimedia.cloud,nehpets.reading-web-staging.eqiad1.wikimedia.cloud,pawsdb-1.trove.eqiad1.wikimedia.cloud,pki-01.swift.eqiad1.wikimedia.cloud,pontoon-acmechief-01.monitoring.eqiad1.wikimedia.cloud,pontoon-cumin-01.monitoring.eqiad1.wikimedia.cloud,pontoon-elastic7-02.monitoring.eqiad1.wikimedia.cloud,pontoon-frontend-02.monitoring.eqiad1.wikimedia.cloud,pontoon-grafana-01.monitoring.eqiad1.wikimedia.cloud,pontoon-graphite-03.monitoring.eqiad1.wikimedia.cloud,pontoon-icinga-01.monitoring.eqiad1.wikimedia.cloud,pontoon-kafka-01.monitoring.eqiad1.wikimedia.cloud,pontoon-kafkamon-01.monitoring.eqiad1.wikimedia.cloud,pontoon-log-[01-02].monitoring.eqiad1.wikimedia.cloud,pontoon-logstash7-03.monitoring.eqiad1.wikimedia.cloud,pontoon-ms-be-[01-02].monitoring.eqiad1.wikimedia.cloud,pontoon-mwlog-01.monitoring.eqiad1.wikimedia.cloud,pontoon-netmon-01.monitoring.eqiad1.wikimedia.cloud,pontoon-prometheus-01.monitoring.eqiad1.wikimedia.cloud,pontoon-puppet-[01,05].monitoring.eqiad1.wikimedia.cloud,pontoon-puppetdb-01.monitoring.eqiad1.wikimedia.cloud,pontoon-thanos-[01-02].monitoring.eqiad1.wikimedia.cloud,puppet-[01,03].swift.eqiad1.wikimedia.cloud,puppetdb.mariadb104-test.eqiad1.wikimedia.cloud,puppetmaster.mariadb104-test.eqiad1.wikimedia.cloud,relforge-search.search.eqiad1.wikimedia.cloud,server-[02-03].swift.eqiad1.wikimedia.cloud,server-a.monitoring.eqiad1.wikimedia.cloud,slave[1-2].mariadb104-test.eqiad1.wikimedia.cloud,zarcillo[0-1].mariadb104-test.eqiad1.wikimedia.cloud

Many of those hosts seem to have puppet disabled or broken because they're managed by pontoon. I expected this to be true exclusively in the 'pontoon' project but it seems also to be true at least in 'swift', and possibly elsewhere.

At the moment, it is a requirement of Cloud VPS usage that VMs be puppetized and accessible with Cumin. A good example of why turned up today: the libera people were on the verge of blocking cloud access to their servers due to misbehaving IRC clients; forensics were difficult on the cuminless hosts and ultimately one of the culprits turned out to be a host in the pontoon project. See T287265 for details.

My preference is to cut SRE staff some slack regarding the puppet/cumin requirements, but these gaps in our access and observability are an increasing issue. If pontoon work continues on Cloud VPS, please prioritize restoring standard puppet configs, or at the very least no breaking cumin access on these VMs.

Thanks!

Event Timeline

@Andrew thanks for raising this issue, we will discuss internally on our monday team meeting and will follow up with a better response.

I suspect the culprit is cumin_masters setting used by default in Pontoon is production's not Cloud VPS', in other words the public key is correct but not the from= option:

# cat /etc/ssh/userkeys/root.d/cumin 
# Cumin Masters. TODO: use 'restrict' once available across the fleet (> jessie)
from="10.64.32.25,2620:0:861:103:10:64:32:25,10.192.48.16,2620:0:860:104:10:192:48:16,10.192.32.49,2620:0:860:103:10:192:32:49",no-agent-forwarding,no-port-forwarding,no-x11-forwarding,no-user-rc ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICcav+ECiF6hW2XRuP7R8nqDw4hPlD0OChsGvB6K27jK root@cloudinfra-internal-puppetmaster-02

I'll send out a review shortly to fix this discrepancy

Change 708042 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: point cumin_masters in pontoon to cloudinfra hosts

https://gerrit.wikimedia.org/r/708042

Change 708042 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: point cumin_masters in pontoon to cloudinfra hosts

https://gerrit.wikimedia.org/r/708042

Something else to consider is that cumin_masters might be overridden if e.g. a pontoon stack runs its own cumin, in other words the full set of cumin_masters must be cloudinfra + the stack's cumin masters, I'll send a review

Change 708243 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] pontoon: allow access to cloud-cumin-* hosts as cumin_masters

https://gerrit.wikimedia.org/r/708243

Change 708243 merged by Filippo Giunchedi:

[operations/puppet@production] pontoon: allow access to cloud-cumin-* hosts as cumin_masters

https://gerrit.wikimedia.org/r/708243

The number of missing VMs has reduced somewhat, from 71 to 55. Here's the new list:

a11y.reading-web-staging.eqiad1.wikimedia.cloud,backend.wikicommunityhealth.eqiad1.wikimedia.cloud,canary[1027,1036]-01.cloudvirt-canary.eqiad1.wikimedia.cloud,canary-wdqs1003-01.cloudvirt-canary.eqiad1.wikimedia.cloud,commonsarchive-mwtest.commonsarchive.eqiad1.wikimedia.cloud,cumin.mariadb104-test.eqiad1.wikimedia.cloud,gerrit-prod-1001.devtools.eqiad1.wikimedia.cloud,locality.trove.eqiad1.wikimedia.cloud,logging-cassandra-01.logging.eqiad1.wikimedia.cloud,logging-elastic7-[02-03].logging.eqiad1.wikimedia.cloud,logging-grafana-01.logging.eqiad1.wikimedia.cloud,logging-logstash7-01.logging.eqiad1.wikimedia.cloud,logging-loki-[01-02].logging.eqiad1.wikimedia.cloud,logging-lts-01.logging.eqiad1.wikimedia.cloud,logging-puppet-05.logging.eqiad1.wikimedia.cloud,logging-puppetdb-03.logging.eqiad1.wikimedia.cloud,logging-sts-01.logging.eqiad1.wikimedia.cloud,maria1.trove.eqiad1.wikimedia.cloud,mariadb104-test1.mariadb104-test.eqiad1.wikimedia.cloud,metricsinfra-db-1.trove.eqiad1.wikimedia.cloud,ms-be-[01-02].swift.eqiad1.wikimedia.cloud,ms-fe-[01-03].swift.eqiad1.wikimedia.cloud,mwv-builder-03.mediawiki-vagrant.eqiad1.wikimedia.cloud,nehpets.reading-web-staging.eqiad1.wikimedia.cloud,pawsdb-1.trove.eqiad1.wikimedia.cloud,pontoon-cumin-01.monitoring.eqiad1.wikimedia.cloud,pontoon-frontend-02.monitoring.eqiad1.wikimedia.cloud,pontoon-grafana-01.monitoring.eqiad1.wikimedia.cloud,pontoon-graphite-03.monitoring.eqiad1.wikimedia.cloud,pontoon-icinga-01.monitoring.eqiad1.wikimedia.cloud,pontoon-kafkamon-01.monitoring.eqiad1.wikimedia.cloud,pontoon-log-[01-02].monitoring.eqiad1.wikimedia.cloud,pontoon-logstash7-03.monitoring.eqiad1.wikimedia.cloud,pontoon-mwlog-01.monitoring.eqiad1.wikimedia.cloud,pontoon-prometheus-01.monitoring.eqiad1.wikimedia.cloud,pontoon-puppet-[01,05].monitoring.eqiad1.wikimedia.cloud,pontoon-puppetdb-01.monitoring.eqiad1.wikimedia.cloud,pontoon-thanos-[01-02].monitoring.eqiad1.wikimedia.cloud,puppet-01.swift.eqiad1.wikimedia.cloud,puppetdb.mariadb104-test.eqiad1.wikimedia.cloud,puppetmaster.mariadb104-test.eqiad1.wikimedia.cloud,relforge-search.search.eqiad1.wikimedia.cloud,slave[1-2].mariadb104-test.eqiad1.wikimedia.cloud,zarcillo[0-1].mariadb104-test.eqiad1.wikimedia.cloud

Can't speak for other projects, though projects currently running pontoon are:

  • monitoring (access should be fixed within the next puppet run)
  • logging (access also should be possible shortly)
  • swift (should be fixed within the next puppet run)
  • mariadb104-test (kormat is out ATM, though should be easy to get access if urgent)

Please try again in the next hour, thank you !

OK, limiting my search to the above four projects, cumin still can't reach these 19 VMs:

logging-logstash7-01.logging.eqiad1.wikimedia.cloud
ms-be-[01-02].swift.eqiad1.wikimedia.cloud
pontoon-kafkamon-01.monitoring.eqiad1.wikimedia.cloud
pontoon-log-[01-02].monitoring.eqiad1.wikimedia.cloud
pontoon-logstash7-03.monitoring.eqiad1.wikimedia.cloud
pontoon-puppet-05.monitoring.eqiad1.wikimedia.cloud
pontoon-puppetdb-01.monitoring.eqiad1.wikimedia.cloud
pontoon-thanos-[01-02].monitoring.eqiad1.wikimedia.cloud
cumin.mariadb104-test.eqiad1.wikimedia.cloud
mariadb104-test1.mariadb104-test.eqiad1.wikimedia.cloud
puppetdb.mariadb104-test.eqiad1.wikimedia.cloud
puppetmaster.mariadb104-test.eqiad1.wikimedia.cloud
slave[1-2].mariadb104-test.eqiad1.wikimedia.cloud
zarcillo[0-1].mariadb104-test.eqiad1.wikimedia.cloud

Hosts in logging/swift/monitoring should have access now, I let @Kormat comment once she's back but should be equally straightforward to fix

I've rebased my puppet branch and pushed to the mariadb104-test pontoon cluster. They all seem to have the correct setting now:

===== NODE GROUP =====                                                                                                                  
(8) cumin.mariadb104-test.eqiad1.wikimedia.cloud,mariadb104-test1.mariadb104-test.eqiad1.wikimedia.cloud,puppetdb.mariadb104-test.eqiad1.wikimedia.cloud,puppetmaster.mariadb104-test.eqiad1.wikimedia.cloud,slave[1-2].mariadb104-test.eqiad1.wikimedia.cloud,zarcillo[0-1].mariadb104-test.eqiad1.wikimedia.cloud
----- OUTPUT of 'grep CUMIN_MASTE...m/conf.d/00_defs' -----                                                                             
@def $CUMIN_MASTERS = (172.16.4.46 172.16.6.133 172.16.2.34 );

and the set of IPs in /etc/ssh/userkeys/root.d/cumin also matches.

Please let me know if they are still problematic.

Thanks @Kormat and @ @fgiunchedi . The following hosts still don't respond to my cloud-wide cumin actions:

cumin.mariadb104-test.eqiad1.wikimedia.cloud
mariadb104-test1.mariadb104-test.eqiad1.wikimedia.cloud
puppetdb.mariadb104-test.eqiad1.wikimedia.cloud
puppetmaster.mariadb104-test.eqiad1.wikimedia.cloud
slave[1-2].mariadb104-test.eqiad1.wikimedia.cloud
zarcillo[0-1].mariadb104-test.eqiad1.wikimedia.cloud

So something else still needs doing in that project.

@Andrew: well, the good news is that i've figured out what the problem is. My (pontoon) cumin host has an ssh key that's installed in /etc/ssh/userkeys/root.d/cumin on all hosts in my project. This means the cloud-cumin key isn't listed as allowed, at all.

The bad news is, well, i'm not sure how to fix this without making the pontoon env completely pointless.

The real cloud puppetmasters will have /srv/private/modules/secret/secrets/keyholder/cumin_master.pub containing the cloud-cumin public key. This is used by profile::cumin::target to populate /etc/ssh/userkeys/root.d/cumin. They will also have the private key, which gets installed into /etc/keyholder.d/ on the cloud-cumin nodes.

In my pontoon env, i cannot (and do not want to) have a copy of the real private repo. I'm currently using labs/private with customisations done in a branch, to provide a cumin_master ssh private+public key that are specific to my env.

So to make this work we need:

  • keyholder/cumin_master.pub in (my branch of) labs/private to be the upstream cloud-cumin public key.
  • keyholder/cumin_master in (my branch of) labs/private to be my env's cumin key.
  • some mechanism to additionally add the public key of my env's cumin key to /etc/ssh/userkeys/root.d/<something>

Correction: the relevant profile is actually profile::openstack::eqiad1::cumin::target.

I have a minimal patch which seems to work:

diff --git modules/profile/manifests/openstack/eqiad1/cumin/target.pp modules/profile/manifests/openstack/eqiad1/cumin/target.pp
index 91e1b5c08f..3f7c6f2831 100644
--- modules/profile/manifests/openstack/eqiad1/cumin/target.pp
+++ modules/profile/manifests/openstack/eqiad1/cumin/target.pp
@@ -42,6 +42,13 @@ class profile::openstack::eqiad1::cumin::target(
         content => template('profile/openstack/eqiad1/cumin/userkey.erb'),
     }
 
+    ssh::userkey { 'root-cloud-cumin':
+        ensure  => present,
+        user    => 'root',
+        skey    => 'cloud_cumin',
+        content => secret('keyholder/cloud-cumin_master.pub'),
+    }
+
     if $ssh_project_ferm_sources != '' {
         ::ferm::service { 'ssh-from-cumin-project-masters':
             proto  => 'tcp',

@Andrew : can you please test and let me know if this fixes things from your perspective?

At the moment these hosts are still unreachable:

$ sudo cumin --force  "O{project:mariadb104-test}" "true"8 hosts will be targeted:
cumin.mariadb104-test.eqiad1.wikimedia.cloud,mariadb104-test1.mariadb104-test.eqiad1.wikimedia.cloud,puppetdb.mariadb104-test.eqiad1.wikimedia.cloud,puppetmaster.mariadb104-test.eqiad1.wikimedia.cloud,slave[1-2].mariadb104-test.eqiad1.wikimedia.cloud,zarcillo[0-1].mariadb104-test.eqiad1.wikimedia.cloud
FORCE mode enabled, continuing without confirmation
===== NODE GROUP =====                                                          
(8) cumin.mariadb104-test.eqiad1.wikimedia.cloud,mariadb104-test1.mariadb104-test.eqiad1.wikimedia.cloud,puppetdb.mariadb104-test.eqiad1.wikimedia.cloud,puppetmaster.mariadb104-test.eqiad1.wikimedia.cloud,slave[1-2].mariadb104-test.eqiad1.wikimedia.cloud,zarcillo[0-1].mariadb104-test.eqiad1.wikimedia.cloud
----- OUTPUT of 'true' -----                                                    
Permission denied (publickey,keyboard-interactive).                             
================                                                                
PASS:  |                                   |   0% (0/8) [00:00<?, ?hosts/s]     
FAIL:  |███████████████████████████| 100% (8/8) [00:00<00:00,  5.51hosts/s]     
100.0% (8/8) of nodes failed to execute command 'true': cumin.mariadb104-test.eqiad1.wikimedia.cloud,mariadb104-test1.mariadb104-test.eqiad1.wikimedia.cloud,puppetdb.mariadb104-test.eqiad1.wikimedia.cloud,puppetmaster.mariadb104-test.eqiad1.wikimedia.cloud,slave[1-2].mariadb104-test.eqiad1.wikimedia.cloud,zarcillo[0-1].mariadb104-test.eqiad1.wikimedia.cloud
0.0% (0/8) success ratio (< 100.0% threshold) for command: 'true'. Aborting.
0.0% (0/8) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.

I'm not entirely following the list of issues above but perhaps @fgiunchedi will weigh in.

This is very weird. An example failed login:

Aug  4 16:00:13 puppetmaster sshd[2548]: Connection from 172.16.4.46 port 57374 on 172.16.2.52 port 22
Aug  4 16:00:14 puppetmaster sshd[2548]: Failed publickey for root from 172.16.4.46 port 57374 ssh2: ED25519 SHA256:XnIP7kLtSjPg/VdZQiuyMs5YWl0UtUM8EOfchEBPPEE
Aug  4 16:00:14 puppetmaster sshd[2548]: Connection closed by authenticating user root 172.16.4.46 port 57374 [preauth]

Here's the public key for cloud-cumin:

root@puppetmaster:~# cat /etc/ssh/userkeys/root.d/cloud_cumin
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICcav+ECiF6hW2XRuP7R8nqDw4hPlD0OChsGvB6K27jK root@cloudinfra-internal-puppetmaster-02

And the hash of it matches exactly the sshd error:

root@puppetmaster:~# ssh-keygen -lf /etc/ssh/userkeys/root.d/cloud_cumin
256 SHA256:XnIP7kLtSjPg/VdZQiuyMs5YWl0UtUM8EOfchEBPPEE root@cloudinfra-internal-puppetmaster-02 (ED25519)

Oh, argh. /etc/ssh/sshd_config contains:

sshd_config:AuthorizedKeysFile  /etc/ssh/userkeys/%u /etc/ssh/userkeys/%u.d/cumin

so it hard-codes a cumin entry :/

Ok, new patch:

diff --git modules/profile/manifests/openstack/eqiad1/cumin/target.pp modules/profile/manifests/openstack/eqiad1/cumin/target.pp
index 91e1b5c08f..3f7c6f2831 100644
--- modules/profile/manifests/openstack/eqiad1/cumin/target.pp
+++ modules/profile/manifests/openstack/eqiad1/cumin/target.pp
@@ -42,6 +42,13 @@ class profile::openstack::eqiad1::cumin::target(
         content => template('profile/openstack/eqiad1/cumin/userkey.erb'),
     }
 
+    ssh::userkey { 'root-cloud-cumin':
+        ensure  => present,
+        user    => 'root',
+        skey    => 'cloud_cumin',
+        content => secret('keyholder/cloud-cumin_master.pub'),
+    }
+
     if $ssh_project_ferm_sources != '' {
         ::ferm::service { 'ssh-from-cumin-project-masters':
             proto  => 'tcp',
diff --git modules/ssh/manifests/server.pp modules/ssh/manifests/server.pp
index 77ded2ab1c..372fe149dc 100644
--- modules/ssh/manifests/server.pp
+++ modules/ssh/manifests/server.pp
@@ -16,7 +16,7 @@ class ssh::server (
     Stdlib::Port                 $listen_port              = 22,
     Array[Stdlib::IP::Address]   $listen_addresses         = [],
     Ssh::Config::PermitRootLogin $permit_root              = true,
-    Array[Stdlib::Unixpath]      $authorized_keys_file     = ['/etc/ssh/userkeys/%u', '/etc/ssh/userkeys/%u.d/cumin'],
+    Array[Stdlib::Unixpath]      $authorized_keys_file     = ['/etc/ssh/userkeys/%u', '/etc/ssh/userkeys/%u.d/cumin', '/etc/ssh/userkeys/%u.d/cloud_cumin'],
     Stdlib::Unixpath             $authorized_keys_command  = '/usr/sbin/ssh-key-ldap-lookup',
     Boolean                      $disable_nist_kex         = true,
     Boolean                      $explicit_macs            = true,

Pretty it ain't. But i think it should work.

@Andrew: can you run your test again, please?

That dd it! I can reach the mariadb104-testing VMs now. Thanks @Kormat

The only remaining unreachable VM in this category is pontoon.traffic.eqiad1.wikimedia.cloud.

That dd it! I can reach the mariadb104-testing VMs now. Thanks @Kormat

The only remaining unreachable VM in this category is pontoon.traffic.eqiad1.wikimedia.cloud.

Thank you @Andrew, I think this might have been bad timing and the host was enrolling, I'm seeing this:

pontoon:~# cat /etc/ssh/userkeys/root.d/cumin 
# Cumin Masters. TODO: use 'restrict' once available across the fleet (> jessie)
from="172.16.4.46,172.16.6.133",no-agent-forwarding,no-port-forwarding,no-x11-forwarding,no-user-rc ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICcav+ECiF6hW2XRuP7R8nqDw4hPlD0OChsGvB6K27jK root@cloudinfra-internal-puppetmaster-02

You're right, cumin works for all pontoon hosts now. Thank you!

Next, puppet failures! This is a shorter list:

mariadb104-test1.mariadb104-test.eqiad1.wikimedia.cloud
ms-be-01.swift.eqiad1.wikimedia.cloud
ms-fe-[01-03].swift.eqiad1.wikimedia.cloud
pki-01.swift.eqiad1.wikimedia.cloud
pontoon-conf-01.monitoring.eqiad1.wikimedia.cloud
pontoon-icinga-01.monitoring.eqiad1.wikimedia.cloud
pontoon-lb-01.monitoring.eqiad1.wikimedia.cloud
pontoon-logstash7-03.monitoring.eqiad1.wikimedia.cloud
pontoon-netmon-01.monitoring.eqiad1.wikimedia.cloud
puppet-03.swift.eqiad1.wikimedia.cloud

I'm glad we could fix cumin access across the board, though puppet failures seems out of scope to me for this task. Also some of these hosts have known puppet failures and/or might be under development and puppet failures are to be expected

Andrew claimed this task.

seems out of scope to me for this task

Agreed; I'll keep an eye on those and can open special-purpose tickets as needed.

Thank you for your work on this, everyone!