Page MenuHomePhabricator

Deploy new bullseye elastic cluster nodes on deployment-prep
Closed, ResolvedPublic5 Estimated Story Points

Description

In preparation for the upcoming Elasticsearch migration (and Stretch EOL) we need to stand up at least 3 new bullseye-based elastic nodes in deployment prep.

Creating this ticket to prepare.

  • Create new servers
  • Switch configuration to new servers
  • Delete old servers

Related Objects

Event Timeline

  • Looking at https://phabricator.wikimedia.org/T278689 as a reference point for code changes required for new nodes. Will need to check with @Majavah for all details.
  • DNS records described here , but again need more specifics.

Booted instance ID 48ba77ab-3c6d-46ca-93fd-7a0785d7f45c with hostname 'deployment-elastic00.' I could ping and get an SSH prompt, but no login. Checking other instances for user-data and other possible methods for converge (puppet related metadata?).

A DNS record was also not automatically created as stated here , but it's not clear if the automatic DNS setup applies to deployment-prep. And it also might require some pre-configuration in puppet.

I don't know anything about how to set up new Elastic nodes, I imagine it's quite different to new deployment nodes. Setting up new base VMs in the deployment-prep project should be as simple as creating them via Horizon and then following steps 2-4 from https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster#Step_2:_Setup_a_puppet_client to configure the instance to use the project-local puppet master (deployment-puppetmaster04.deployment-prep.eqiad1.wikimedia.cloud).

Booted instance ID 48ba77ab-3c6d-46ca-93fd-7a0785d7f45c with hostname 'deployment-elastic00.' I could ping and get an SSH prompt, but no login. Checking other instances for user-data and other possible methods for converge (puppet related metadata?).

I don't see the instance anywhere. Did you delete it already? It might take 5-10 minutes after the VM is created for it to become accessible via SSH.

A DNS record was also not automatically created as stated here , but it's not clear if the automatic DNS setup applies to deployment-prep. And it also might require some pre-configuration in puppet.

It applies to all projects, not sure what went wrong there (I see no traces of that VM in the relevant logs).

Second attempt, created instance ID 48d468a8-7733-47fd-a078-cc4d931d1545 with deployment-elastic00 with hostname 'deployment-elastic00' (same as before). This time, the DNS record was created. I could get an SSH prompt, but could not login.

Set the puppet vars based on Majavah's post above and the hiera config of an existing elastic host . Maybe puppet will converge this instance, but it sounds like I need to either include the CFSSL profile as referenced here or manually provision certs and add them to the puppetmaster repo (also referenced there). Will pick this up again on Monday.

Second attempt, created instance ID 48d468a8-7733-47fd-a078-cc4d931d1545 with deployment-elastic00 with hostname 'deployment-elastic00' (same as before). This time, the DNS record was created. I could get an SSH prompt, but could not login.

Out of curiosity, how did you select that instance name? If the last used name was deployment-elastic08, I'd expect the next one to be deployment-elastic09 (and the next one deployment-elastic10 and so on)

Looks like you're hitting an annoying bug where if the first Puppet run fails (usually due to the classes applied to it via horizon), the instance will not manage to change auth configuration to let you log into it. For this instance, I logged in with my cloud vps root key and manually adjusted /etc/security/access.conf and /etc/sssd/sssd.conf so that you should be able to log in now. I also looked if there's an easy way to fix that first Puppet run bug, but didn't see anything.

This is how the Puppet runs are failing:

taavi@deployment-elastic00:~$ sudo -i run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Missing title. The title expression resulted in undef (file: /etc/puppet/modules/elasticsearch/manifests/init.pp, line: 145, column: 35) on node deployment-elastic00.deployment-prep.eqiad1.wikimedia.cloud
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

The error is coming from the Elasticsearch puppetization, which I don't really know anything about, so I can't really help with figuring out how to solve it :/ My best guess is that the new instance needs to be added to a list of hosts somewhere in puppet hiera. Looks like deployment-prep has elasticsearch hiera spread out between horizon configured hiera and hieradata/cloud/eqiad1/deployment-prep on the main puppet repo (I said deployment-prep had a ton of tech debt! I hope this hiera mess makes my point clear), but I'm not exactly sure what to change.

Change 756643 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] deployment-prep: add 3 elastic nodes in preparation for bullseye upgrade

https://gerrit.wikimedia.org/r/756643

Change 756643 merged by Bking:

[operations/puppet@production] deployment-prep: add 3 elastic nodes in preparation for bullseye upgrade

https://gerrit.wikimedia.org/r/756643

Change 757699 had a related patch set uploaded (by Bking; author: Bking):

[labs/private@master] deployment-prep: add cergen config for elastic service

https://gerrit.wikimedia.org/r/757699

Change 757699 merged by Bking:

[labs/private@master] deployment-prep: add cergen config for elastic service

https://gerrit.wikimedia.org/r/757699

Mentioned in SAL (#wikimedia-operations) [2022-02-04T23:02:48Z] <inflatador> bking@deployment-puppetmaster04 local commit to public/private repo, see T299797 for more details

Finally got the certs in place, but running into login issues (likely same issue that @Majavah mentioned above). I couldn't figure out how to add the puppet classes during build, and I may have also copy/pasted bad info into the Horizon puppet config box.

Will try again Monday with user-data so I can login even if puppet is borked.

MPhamWMF set the point value for this task to 5.Feb 7 2022, 4:56 PM
MPhamWMF moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

Update: the cloud image does not allow SSH access except from cumin, so adding a user via cloud-init does not help.

If I login locally, I see the following error message when I try to 'run-puppet-agent':

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret ssl/deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud.key (file: /etc/puppet/modules/sslcert/manifests/certificate.pp, line: 91, column: 26) (file: /etc/puppet/modules/tlsproxy/manifests/localssl.pp, line: 160) on node deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud

This suggests my recent local comment to public/private repo on deployment-puppetmaster04 did not work correctly. Will continue to troubleshoot...

bking renamed this task from Deploy new elastic cluster nodes on deployment-prep to Deploy new bullseye elastic cluster nodes on deployment-prep.Feb 9 2022, 9:19 PM
bking updated the task description. (Show Details)

Checking the cert and key via[[ https://security.stackexchange.com/questions/73127/how-can-you-check-if-a-private-key-and-certificate-match-in-openssl-with-ecdsa | this method ]] suggests that they don't actually match.

On the deployment-prep puppet master, the cert is in /var/lib/puppet/server/ssl/ca/signed (it may also need to be in /var/lib/git/operations/puppet/files/ssl/deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud.crt )

and the key is in /var/lib/puppet/ssl/private_keys

(may also need to be in /etc/puppet/private/modules/secret/secrets/ssl )

Still verifying whether we have the right key/cert combination in the right place.

Confirmed, we have a cert/key mismatch. I created the keypair using puppet-ecdsacert . However, the key in puppet's running config (/var/lib/puppet) is NOT the correct key. The correct key is in /etc/puppet/private/modules/secret/secrets/ssl on deployment-puppetmaster04 .

If I attempt to regenerate the cert/key using puppet-ecsdacert, I get the following error: Signing request to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs failed with code 500: {"message":"Server Error: deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud already has a signed certificate; ignoring certificate request","issue_kind":"RUNTIME_ERROR"}

I'm not sure if it's safe to replace this key "live", or if there is a better/safer way to proceed. Tagging @jbond for advice on how best to proceed.

Thanks all, please let me know if you need more info.

Per IRC advice from herron, did the following:

  • Remove deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud 's cert using puppet cert --clean $fqdn
  • Generate new cert/key via puppet-ecsdacert script

Next steps will be to commit the new certs and key into source code

Change 762006 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] Elastic: add deployment-prep cert

https://gerrit.wikimedia.org/r/762006

Change 762006 merged by Bking:

[operations/puppet@production] Elastic: add deployment-prep cert

https://gerrit.wikimedia.org/r/762006

Elasticsearch is installed on our new Bullseye VM, but fails with error: Unrecognized VM option 'PrintGCDateStamps'

A quick web search suggests that this option no longer works in Java 11 . In addition to moving between Stretch and Bullseye, we are also moving between Java 8 and Java 11, so we need to review all our jvm options.

Elasticsearch is installed on our new Bullseye VM, but fails with error: Unrecognized VM option 'PrintGCDateStamps'

A quick web search suggests that this option no longer works in Java 11 . In addition to moving between Stretch and Bullseye, we are also moving between Java 8 and Java 11, so we need to review all our jvm options.

https://github.com/wikimedia/puppet/blob/e1787fadbb2764a4036f2c4fcab79000861696c7/modules/elasticsearch/manifests/instance.pp#L173-L195

and here's where these flags are defined

Change 787106 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] Elastic: test puppet logic

https://gerrit.wikimedia.org/r/787106

Opensearch is configured at modules/opensearch/templates/jvm.options.erb . Checking to see what version of the JDK they use and which JVM options.

Change 787505 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] Elastic: test puppet logic

https://gerrit.wikimedia.org/r/787505

Change 787505 merged by Bking:

[operations/puppet@production] Elastic: Use OS major version for GC flags

https://gerrit.wikimedia.org/r/787505

Change 788768 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: use java version to choose GC flags

https://gerrit.wikimedia.org/r/788768

Change 787106 abandoned by Ryan Kemper:

[operations/puppet@production] Elastic: test puppet logic

Reason:

obsoleted by https://gerrit.wikimedia.org/r/c/operations/puppet/+/788768

https://gerrit.wikimedia.org/r/787106

Change 789877 had a related patch set uploaded (by Bking; author: Bking):

[operations/mediawiki-config@master] elastic: update deployment-prep hostnames

https://gerrit.wikimedia.org/r/789877

Change 789877 merged by jenkins-bot:

[operations/mediawiki-config@master] [Beta Cluster] LabsServices: Switch elastic hosts to bullseye hosts

https://gerrit.wikimedia.org/r/789877

Jdforrester-WMF subscribed.

FWICT everything seems to be working in Beta Cluster since the switch. Does that mean we "just" need to shut off and delete the old instances?

Change 791050 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] Elastic: Use OS major version for GC flags

https://gerrit.wikimedia.org/r/791050

@Jdforrester-WMF I wasn't sure if the code was in production yet or not. If you can confirm that is in production, let us know and we will decommission the old instances.

I checked deployment-mediawiki12 and it does indeed look like the above config is in production (for beta anyway, which is all we care about).

Thus, I will start the decommissioning process for deployment-elastic05-07 immediately.

Mentioned in SAL (#wikimedia-cloud) [2022-05-12T22:09:48Z] <inflatador> bking@deployment-elastic05 banned deployment-elastic05 from beta ES cluster in preparation for decom T299797

Mentioned in SAL (#wikimedia-cloud) [2022-05-13T18:58:21Z] <inflatador> bking@deployment-elastic05 halted deployment-elastic05 in beta ES cluster; will decom in 1 wk T299797

Change 791666 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: remove decommissioned hosts in beta

https://gerrit.wikimedia.org/r/791666

@Jdforrester-WMF I wasn't sure if the code was in production yet or not. If you can confirm that is in production, let us know and we will decommission the old instances.

Hey, sorry for the slow reply (was at an off-site). Good to know you found out manually, but sorry I wasn't more explicit to assure you so you didn't need to!

Mentioned in SAL (#wikimedia-cloud) [2022-05-16T19:02:56Z] <inflatador> bking@deployment-elastic06 halted deployment-elastic06 in beta ES cluster; will decom on Friday T299797

Mentioned in SAL (#wikimedia-cloud) [2022-05-16T19:30:59Z] <inflatador> bking@deployment-elastic07 halted deployment-elastic07 in beta ES cluster; will decom on Friday T299797

Mentioned in SAL (#wikimedia-cloud) [2022-05-23T19:21:36Z] <inflatador> Deleted deployment-elastic0[5-7] in favor of newer bullseye hosts T299797

Stretch servers have been deleted, deployment-prep elastic hosts are now all on Bullseye. Closing...

Change 791050 abandoned by Bking:

[operations/puppet@production] Elastic: Use OS major version for GC flags

Reason:

Using the same versions of ES everywhere now, this is no longer needed

https://gerrit.wikimedia.org/r/791050