Page MenuHomePhabricator

Create in-cloud puppetmaster for codfw1dev
Closed, ResolvedPublic

Description

We should be mostly ready to set up an in-cloud puppetmaster for codf1dev and decom labtestpuppetmaster2001. @Krenair has graciously offered to work on this.

The puppetmaster will live in the 'cloudinfra-codfw1dev' project -- I've made 'labtestkrenair' a project admin there. LMK if you need to use a different account.

I've also created a bastion for codfw1dev at 185.15.57.2 -- I'm not sure that we have proper public DNS yet so planning to just set that in my /etc/hosts for now.

No doubt I'm forgetting about other pieces we need for this; let me know if there's anything else I should set up.

Event Timeline

For now I'm using this SSH config:

Host *.codfw1dev.cloud
User root
ProxyCommand ssh -a -W %h:%p -I /usr/lib/x86_64-linux-gnu/opensc-pkcs11.so root@185.15.57.3
PKCS11Provider /usr/lib/x86_64-linux-gnu/opensc-pkcs11.so

(PKCS11 stuff is just for the yubikey my Cloud VPS root key lives on - don't have working LDAP-based SSH on these things yet)

Created a small instance:

$ ssh root@puppetmaster-codfw1dev-01.cloudinfra-codfw1dev.codfw1dev.cloud
Linux puppetmaster-codfw1dev-01 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64
Debian GNU/Linux 10 (buster)
The last Puppet run was at Tue Jan 14 00:28:06 UTC 2020 (20 minutes ago). 
Last puppet commit: (6b9d61fd99) Dzahn - codesearch: Install docker-ce from thirdparty/kubeadm-k8s component
Last login: Tue Jan 14 00:18:18 2020 from 172.16.128.32
root@puppetmaster-codfw1dev-01:~# 

Also I vaguely recall using a public DNS record in eqiad1, possibly so puppetmaster can distinguish between requests from clients to the frontend vs. requests from the frontend to the backend service running on the frontend instance? I forget exactly. We should probably have working public DNS anyway

Change 566369 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] openstack: Create codfw1dev puppetmaster frontend_vm role to mirror eqiad1

https://gerrit.wikimedia.org/r/566369

Change 566369 merged by Andrew Bogott:
[operations/puppet@production] openstack: Create codfw1dev puppetmaster frontend_vm role to mirror eqiad1

https://gerrit.wikimedia.org/r/566369

Here's a fun one:
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Class[Profile::Openstack::Base::Observerenv] is already declared in file /etc/puppet/modules/profile/manifests/openstack/eqiad1/observerenv.pp:7; cannot redeclare at /etc/puppet/modules/profile/manifests/openstack/codfw1dev/observerenv.pp:7 at /etc/puppet/modules/profile/manifests/openstack/codfw1dev/observerenv.pp:7:5 on node puppetmaster-codfw1dev-01.cloudinfra-codfw1dev.codfw1dev.cloud
What's that eqiad1 observerenv class doing there? My suspicion is this:
modules/role/manifests/wmcs/instance.pp: include ::profile::openstack::eqiad1::observerenv

I suppose some of the next questions (@Andrew?) may include:

  • How do we make modules/role/manifests/wmcs/instance.pp include the right classes depending on region?
  • Do we even need observerenv and co. to be split based on region?
  • Should manifests/site.pp include a different version of role::wmcs::instance depending on region?
  • Instances don't seem to know what region they're in (even curl http://169.254.169.254/latest/meta-data/placement/availability-zone just says nova), does that mean we need to resort to saying .eqiad.wmflabs -> eqiad1, codfw1dev.cloud -> codfw1dev i.e. based on hostname?

I assume puppetmasters need observerenv config due to validatelabsfqdn or some encapi integration with the OpenStack APIs or something.

I can take a look at this and make a proposal if you want.

I could just split modules/role/manifests/wmcs/instance.pp into two and conditionally include the right one from manifests/site.pp based on hostname, but it feels hacky. Better ideas welcome.

Change 566735 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] realm: introduce global variable $wmcs_deployment

https://gerrit.wikimedia.org/r/566735

Change 566736 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: instance: introduce per-deployment openstack profiles

https://gerrit.wikimedia.org/r/566736

These 2 patches are my proposal. I think @bd808 will like the approach too based on comments I saw on IRC backlog from yesterday:

Review, comments and testing welcome!

Change 566735 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] realm: introduce global variable $wmcs_deployment

https://gerrit.wikimedia.org/r/566735

Change 566736 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] wmcs: instance: introduce per-deployment openstack profiles

https://gerrit.wikimedia.org/r/566736

Bstorm triaged this task as Medium priority.Feb 11 2020, 4:13 PM

Change 572421 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] Misc work to make puppet run in codfw1dev again following Icad66f70

https://gerrit.wikimedia.org/r/572421

I wanted to put down writing since it's not particularly obvious that this work is currently blocked on puppet functioning using the existing labtestpuppetmaster2001.wikimedia.org master. I've done what I can by setting hieradata in horizon, but I'm at the stage now where I can't continue without either control of the master or https://gerrit.wikimedia.org/r/572421

Actually I realised I could hack around the problem to do something roughly equivalent to the patch, it works now:

labtestkrenair@puppetmaster-01:~$ sudo puppet agent -tv
Warning: Downgrading to PSON for future requests
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud
Info: Applying configuration version '(e7b256f089) Andrew Bogott - keystone fernet key rotation: delete files during rsync'
Notice: The LDAP client stack for this host is: sssd/sudo
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
Notice: Applied catalog in 3.94 seconds

Think I'm gonna let that get reviewed before trying to actually install this thing as a puppetmaster though.

Change 572421 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] Misc work to make puppet run in codfw1dev again following Icad66f70

https://gerrit.wikimedia.org/r/572421

Change 574136 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] More fixes for codfw1dev puppet

https://gerrit.wikimedia.org/r/574136

Change 574136 merged by Andrew Bogott:
[operations/puppet@production] More fixes for codfw1dev puppet

https://gerrit.wikimedia.org/r/574136

@aborrero: Hi, I went to make a cloudinfra-db-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud instance but found that puppet errors because of an unknown package, and apt update is broken because it seems that instances in codfw1dev without floating IPs now can't talk to the outside world (e.g. deb.debian.org). Is that some sort of NAT problem in codfw1dev? I understand you've been working on networking there.

Set up cloudinfra-db-01 (with a floating IP for the moment to bypass the above issue), in particular the following (basically my steps on T216067 but for buster with wmf-mariadb104 instead of stretch with wmf-mariadb101):

  • Moved /srv/sqldata out the way
  • /opt/wmf-mariadb104/scripts/mysql_install_db --user=mysql --basedir=/opt/wmf-mariadb104 --datadir=/srv/sqldata

Puppet is upset that the heartbeat schema hasn't been created, not sure what has responsibility for creating that yet.
Edit: Did that manually with the schema from https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Enabling_replication_on_the_recovered_server

TODO: Arrange copying of current encapi database to here

Got a new puppet problem: Package[python-novaclient] is declared in both modules/openstack/manifests/clientpackages/vms/common.pp (note the vms) and also modules/openstack/manifests/clientpackages/pike/buster.pp (note the lack of vms - this presumably originates through modules/profile/manifests/openstack/codfw1dev/puppetmaster/frontend.pp including the non-vm variant - note how modules/openstack/manifests/clientpackages/mitaka/buster.pp has no config, and that's what the eqiad1 equivalent host will be using)
Setting profile::openstack::codfw1dev::version: mitaka to be equivalent to eqiad1. We should really sort out the inconsistent mess that is our clientpackages puppetisation :(

Change 575719 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] Move more eqiad1.yaml hieradata to cloud.yaml

https://gerrit.wikimedia.org/r/575719

Change 575744 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] profile::mariadb::cloudinfra: Allow overriding of hostnames

https://gerrit.wikimedia.org/r/575744

root@puppetmaster-01:~# puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud
Info: Applying configuration version '(0297e53bc1) Luca Toscano - role::analytics_test_cluster::hadoop::ui: add BigTop/Hive settings'
Notice: The LDAP client stack for this host is: sssd/sudo
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
Notice: Applied catalog in 4.99 seconds
root@test-t242607:~# grep server /etc/puppet/puppet.conf
server = puppetmaster.cloudinfra-codfw1dev.codfw1dev.wmcloud.org
root@test-t242607:~# puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for test-t242607.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud
Info: Applying configuration version '(99650242c2) root - codfw1dev: Register our bastion hosts'
Notice: The LDAP client stack for this host is: sssd/sudo
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
Notice: Applied catalog in 3.74 seconds

So,

  • Merge gerrit changes. Will need reviews.
  • Tidyup and commit hieradata for puppetmaster-01, cloudinfra-db-01, and cloudinfra-internal-puppetmaster-01 as appropriate. Will need reviews.
  • Test it all still works
  • Update hieradata/codfw/profile/openstack/codfw1dev/pdns.yaml to make 172.16.128.20 the new puppet..
  • Copy encapi data across (done the schema to get stuff working already). Will need help for this.
  • Update modules/puppetmaster/files/labtest.hiera.yaml to make http://puppetmaster.cloudinfra-codfw1dev.codfw1dev.wmcloud.org:8100/v1/ the new httpyaml :url_prefix (already done this in root@cloudinfra-internal-puppetmaster-01:/var/lib/git/operations/puppet to get things working)
  • TODO: Update puppet stuff that horizon/designate use to talk in, also set up security group ingress rules for them.
  • Shut down labtestpuppetmaster2001.wikimedia.org.

Change 575719 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] Move more eqiad1.yaml hieradata to cloud.yaml

https://gerrit.wikimedia.org/r/575719

Change 575744 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] profile::mariadb::cloudinfra: Allow overriding of hostnames

https://gerrit.wikimedia.org/r/575744

Change 612674 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloud puppetmaster frontend: don't double-include packages on VMs

https://gerrit.wikimedia.org/r/612674

Change 612674 merged by Andrew Bogott:
[operations/puppet@production] cloud puppetmaster frontend: don't double-include packages on VMs

https://gerrit.wikimedia.org/r/612674

Change 612697 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] codfw1dev: switch to an in-cloud puppetmaster

https://gerrit.wikimedia.org/r/612697

Change 612697 merged by Andrew Bogott:
[operations/puppet@production] codfw1dev: switch to an in-cloud puppetmaster

https://gerrit.wikimedia.org/r/612697

Change 612877 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] codfw1dev: fix 'puppet' dns hack to point to cloud-internal puppetmaster

https://gerrit.wikimedia.org/r/612877

Change 612877 merged by Andrew Bogott:
[operations/puppet@production] codfw1dev: fix 'puppet' dns hack to point to cloud-internal puppetmaster

https://gerrit.wikimedia.org/r/612877

This is done now, the puppetmaster is puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud

(For future cert troubleshooting, this might be useful: https://phabricator.wikimedia.org/P11917 )