Deploy new bullseye elastic cluster nodes on deployment-prep
Closed, ResolvedPublic5 Estimated Story Points
Actions

Description

In preparation for the upcoming Elasticsearch migration (and Stretch EOL) we need to stand up at least 3 new bullseye-based elastic nodes in deployment prep.

Creating this ticket to prepare.

Create new servers
Switch configuration to new servers
Delete old servers

Details

Subject	Repo	Branch	Lines +/-
Elastic: Use OS major version for GC flags	operations/puppet	production	+10 -30
elastic: remove decommissioned hosts in beta	operations/puppet	production	+1 -1
[Beta Cluster] LabsServices: Switch elastic hosts to bullseye hosts	operations/mediawiki-config	master	+9 -5
Elastic: test puppet logic	operations/puppet	production	+24 -4
Elastic: Use OS major version for GC flags	operations/puppet	production	+21 -8
Elastic: add deployment-prep cert	operations/puppet	production	+35 -0
deployment-prep: add cergen config for elastic service	labs/private	master	+15 -0
deployment-prep: add 3 elastic nodes in preparation for bullseye upgrade	operations/puppet	production	+8 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T248925 Make MediaWiki release tarball compatible with PHP 8.0
Resolved	Jdforrester-WMF	T300463 Make PHP 8.0 voting on MW master
Resolved	None	T283275 Make MW master tests pass on PHP 8.0
Resolved	Reedy	T268861 CirrusSearch uses Elastica's Match class
Resolved	Reedy	T268863 Translate uses Elastica's Match class
Resolved	matthiasmullie	T268866 WikibaseMediaInfo uses Elastica's Match class
Invalid	None	T268864 WikibaseCirrusSearch uses Elastica's Match class
Resolved	Reedy	T268865 WikibaseLexemeCirrusSearch uses Elastica's Match class
Resolved	EBernhardson	T271777 Bump rufin/elastica (and related libraries) to versions that support PHP 8.0
Resolved	Gehel	T263142 [EPIC] Upgrade Elasticsearch to version 7.10
Stalled	None	T302086 Set scap minimum python version to 3.7
Resolved	None	T247045 Migrate all of production metal and VMs to Buster or later
Declined	None	T244736 Migrate Elasticsearch to Debian Buster
Resolved	None	T306068 Cloud VPS "deployment-prep" project Stretch deprecation
Resolved	• taavi	T278641 Migrate deployment-prep away from Debian Stretch to Buster/Bullseye
Open	None	T291916 Tracking task for Bullseye migrations in production
Resolved	bking	T289135 Upgrade Cirrus Elasticsearch clusters to Debian Bullseye
Resolved	bking	T299797 Deploy new bullseye elastic cluster nodes on deployment-prep
Resolved	bking	T301408 Enable search team SRE access to deployment-prep VMs
Resolved	bking	T307510 Update tls proxy config for Bullseye

Event Timeline

bking created this task.Jan 21 2022, 7:46 PM

bking mentioned this in T278641: Migrate deployment-prep away from Debian Stretch to Buster/Bullseye.Jan 21 2022, 7:58 PM

• taavi edited parent tasks, added: T298253: Upgrade deployment-prep Swift cluster to Debian Buster or newer; removed: T278641: Migrate deployment-prep away from Debian Stretch to Buster/Bullseye.Jan 21 2022, 7:59 PM

• taavi edited parent tasks, added: T298252: Upgrade deployment-prep Elastic cluster to Debian Buster or newer; removed: T298253: Upgrade deployment-prep Swift cluster to Debian Buster or newer.

bking added a project: Discovery-ARCHIVED.Jan 21 2022, 8:05 PM

Looking at https://phabricator.wikimedia.org/T278689 as a reference point for code changes required for new nodes. Will need to check with @Majavah for all details.
DNS records described here , but again need more specifics.

Booted instance ID 48ba77ab-3c6d-46ca-93fd-7a0785d7f45c with hostname 'deployment-elastic00.' I could ping and get an SSH prompt, but no login. Checking other instances for user-data and other possible methods for converge (puppet related metadata?).

A DNS record was also not automatically created as stated here , but it's not clear if the automatic DNS setup applies to deployment-prep. And it also might require some pre-configuration in puppet.

In T299797#7641306, @bking wrote:

Looking at https://phabricator.wikimedia.org/T278689 as a reference point for code changes required for new nodes. Will need to check with @Majavah for all details.

I don't know anything about how to set up new Elastic nodes, I imagine it's quite different to new deployment nodes. Setting up new base VMs in the deployment-prep project should be as simple as creating them via Horizon and then following steps 2-4 from https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster#Step_2:_Setup_a_puppet_client to configure the instance to use the project-local puppet master (deployment-puppetmaster04.deployment-prep.eqiad1.wikimedia.cloud).

In T299797#7641410, @bking wrote:

Booted instance ID 48ba77ab-3c6d-46ca-93fd-7a0785d7f45c with hostname 'deployment-elastic00.' I could ping and get an SSH prompt, but no login. Checking other instances for user-data and other possible methods for converge (puppet related metadata?).

I don't see the instance anywhere. Did you delete it already? It might take 5-10 minutes after the VM is created for it to become accessible via SSH.

A DNS record was also not automatically created as stated here , but it's not clear if the automatic DNS setup applies to deployment-prep. And it also might require some pre-configuration in puppet.

It applies to all projects, not sure what went wrong there (I see no traces of that VM in the relevant logs).

Second attempt, created instance ID 48d468a8-7733-47fd-a078-cc4d931d1545 with deployment-elastic00 with hostname 'deployment-elastic00' (same as before). This time, the DNS record was created. I could get an SSH prompt, but could not login.

Set the puppet vars based on Majavah's post above and the hiera config of an existing elastic host . Maybe puppet will converge this instance, but it sounds like I need to either include the CFSSL profile as referenced here or manually provision certs and add them to the puppetmaster repo (also referenced there). Will pick this up again on Monday.

In T299797#7641787, @bking wrote:

Second attempt, created instance ID 48d468a8-7733-47fd-a078-cc4d931d1545 with deployment-elastic00 with hostname 'deployment-elastic00' (same as before). This time, the DNS record was created. I could get an SSH prompt, but could not login.

Out of curiosity, how did you select that instance name? If the last used name was deployment-elastic08, I'd expect the next one to be deployment-elastic09 (and the next one deployment-elastic10 and so on)

Looks like you're hitting an annoying bug where if the first Puppet run fails (usually due to the classes applied to it via horizon), the instance will not manage to change auth configuration to let you log into it. For this instance, I logged in with my cloud vps root key and manually adjusted /etc/security/access.conf and /etc/sssd/sssd.conf so that you should be able to log in now. I also looked if there's an easy way to fix that first Puppet run bug, but didn't see anything.

This is how the Puppet runs are failing:

taavi@deployment-elastic00:~$ sudo -i run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Missing title. The title expression resulted in undef (file: /etc/puppet/modules/elasticsearch/manifests/init.pp, line: 145, column: 35) on node deployment-elastic00.deployment-prep.eqiad1.wikimedia.cloud
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

The error is coming from the Elasticsearch puppetization, which I don't really know anything about, so I can't really help with figuring out how to solve it :/ My best guess is that the new instance needs to be added to a list of hosts somewhere in puppet hiera. Looks like deployment-prep has elasticsearch hiera spread out between horizon configured hiera and hieradata/cloud/eqiad1/deployment-prep on the main puppet repo (I said deployment-prep had a ton of tech debt! I hope this hiera mess makes my point clear), but I'm not exactly sure what to change.

Change 756643 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] deployment-prep: add 3 elastic nodes in preparation for bullseye upgrade

https://gerrit.wikimedia.org/r/756643

gerritbot added a project: Patch-For-Review.Jan 24 2022, 6:58 PM

Change 756643 merged by Bking:

[operations/puppet@production] deployment-prep: add 3 elastic nodes in preparation for bullseye upgrade

https://gerrit.wikimedia.org/r/756643

Maintenance_bot removed a project: Patch-For-Review.Jan 25 2022, 3:10 PM

Change 757699 had a related patch set uploaded (by Bking; author: Bking):

[labs/private@master] deployment-prep: add cergen config for elastic service

https://gerrit.wikimedia.org/r/757699

gerritbot added a project: Patch-For-Review.Jan 27 2022, 6:13 PM

Change 757699 merged by Bking:

[labs/private@master] deployment-prep: add cergen config for elastic service

https://gerrit.wikimedia.org/r/757699

bking mentioned this in rLPRI9ba40ca54703: deployment-prep: add cergen config for elastic service.Jan 28 2022, 8:16 PM

Maintenance_bot removed a project: Patch-For-Review.Jan 28 2022, 9:10 PM

bking added a project: Discovery-Search (Current work).Jan 31 2022, 4:45 PM

Mentioned in SAL (#wikimedia-operations) [2022-02-04T23:02:48Z] <inflatador> bking@deployment-puppetmaster04 local commit to public/private repo, see T299797 for more details

Finally got the certs in place, but running into login issues (likely same issue that @Majavah mentioned above). I couldn't figure out how to add the puppet classes during build, and I may have also copy/pasted bad info into the Horizon puppet config box.

Will try again Monday with user-data so I can login even if puppet is borked.

• MPhamWMF set the point value for this task to 5.Feb 7 2022, 4:56 PM

• MPhamWMF moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

Update: the cloud image does not allow SSH access except from cumin, so adding a user via cloud-init does not help.

If I login locally, I see the following error message when I try to 'run-puppet-agent':

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret ssl/deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud.key (file: /etc/puppet/modules/sslcert/manifests/certificate.pp, line: 91, column: 26) (file: /etc/puppet/modules/tlsproxy/manifests/localssl.pp, line: 160) on node deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud

This suggests my recent local comment to public/private repo on deployment-puppetmaster04 did not work correctly. Will continue to troubleshoot...

RhinosF1 subscribed.Feb 9 2022, 8:58 PM

bd808 added a subtask: T301408: Enable search team SRE access to deployment-prep VMs.Feb 9 2022, 8:59 PM

bking renamed this task from Deploy new elastic cluster nodes on deployment-prep to Deploy new bullseye elastic cluster nodes on deployment-prep.Feb 9 2022, 9:19 PM

bking updated the task description. (Show Details)

bking closed subtask T301408: Enable search team SRE access to deployment-prep VMs as Resolved.Feb 10 2022, 3:27 PM

Checking the cert and key via[[ https://security.stackexchange.com/questions/73127/how-can-you-check-if-a-private-key-and-certificate-match-in-openssl-with-ecdsa | this method ]] suggests that they don't actually match.

On the deployment-prep puppet master, the cert is in /var/lib/puppet/server/ssl/ca/signed (it may also need to be in /var/lib/git/operations/puppet/files/ssl/deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud.crt )

and the key is in /var/lib/puppet/ssl/private_keys

(may also need to be in /etc/puppet/private/modules/secret/secrets/ssl )

Still verifying whether we have the right key/cert combination in the right place.

Confirmed, we have a cert/key mismatch. I created the keypair using puppet-ecdsacert . However, the key in puppet's running config (/var/lib/puppet) is NOT the correct key. The correct key is in /etc/puppet/private/modules/secret/secrets/ssl on deployment-puppetmaster04 .

If I attempt to regenerate the cert/key using puppet-ecsdacert, I get the following error: Signing request to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs failed with code 500: {"message":"Server Error: deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud already has a signed certificate; ignoring certificate request","issue_kind":"RUNTIME_ERROR"}

I'm not sure if it's safe to replace this key "live", or if there is a better/safer way to proceed. Tagging @jbond for advice on how best to proceed.

Thanks all, please let me know if you need more info.

Per IRC advice from herron, did the following:

Remove deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud 's cert using puppet cert --clean $fqdn
Generate new cert/key via puppet-ecsdacert script

Next steps will be to commit the new certs and key into source code

Change 762006 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] Elastic: add deployment-prep cert

https://gerrit.wikimedia.org/r/762006

gerritbot added a project: Patch-For-Review.Feb 11 2022, 10:43 PM

Change 762006 merged by Bking:

[operations/puppet@production] Elastic: add deployment-prep cert

https://gerrit.wikimedia.org/r/762006

Maintenance_bot removed a project: Patch-For-Review.Feb 12 2022, 12:10 AM

Gehel moved this task from In Progress to Waiting on the Discovery-Search (Current work) board.Feb 22 2022, 8:21 PM

Gehel merged a task: T298252: Upgrade deployment-prep Elastic cluster to Debian Buster or newer.

Jdforrester-WMF added a parent task: T278641: Migrate deployment-prep away from Debian Stretch to Buster/Bullseye.Feb 24 2022, 2:19 PM

Jdforrester-WMF edited parent tasks, added: T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye; removed: T298252: Upgrade deployment-prep Elastic cluster to Debian Buster or newer.

bking mentioned this in T306847: Figure out solution for certificate_name/hostname overloading in deployment-prep.Apr 25 2022, 10:42 PM

bking mentioned this in T306907: Determine what to do about missing bullseye packages.Apr 26 2022, 2:36 PM

Elasticsearch is installed on our new Bullseye VM, but fails with error: Unrecognized VM option 'PrintGCDateStamps'

A quick web search suggests that this option no longer works in Java 11 . In addition to moving between Stretch and Bullseye, we are also moving between Java 8 and Java 11, so we need to review all our jvm options.

In T299797#7886111, @bking wrote:

Elasticsearch is installed on our new Bullseye VM, but fails with error: Unrecognized VM option 'PrintGCDateStamps'

A quick web search suggests that this option no longer works in Java 11 . In addition to moving between Stretch and Bullseye, we are also moving between Java 8 and Java 11, so we need to review all our jvm options.

https://github.com/wikimedia/puppet/blob/e1787fadbb2764a4036f2c4fcab79000861696c7/modules/elasticsearch/manifests/instance.pp#L173-L195

and here's where these flags are defined

Change 787106 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] Elastic: test puppet logic

https://gerrit.wikimedia.org/r/787106

gerritbot added a project: Patch-For-Review.Apr 27 2022, 10:06 PM

Opensearch is configured at modules/opensearch/templates/jvm.options.erb . Checking to see what version of the JDK they use and which JVM options.

Change 787505 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] Elastic: test puppet logic

https://gerrit.wikimedia.org/r/787505

Change 787505 merged by Bking:

[operations/puppet@production] Elastic: Use OS major version for GC flags

https://gerrit.wikimedia.org/r/787505

Change 788768 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: use java version to choose GC flags

https://gerrit.wikimedia.org/r/788768

Change 787106 abandoned by Ryan Kemper:

[operations/puppet@production] Elastic: test puppet logic

Reason:

obsoleted by https://gerrit.wikimedia.org/r/c/operations/puppet/+/788768

https://gerrit.wikimedia.org/r/787106

bking added a subtask: T307510: Update tls proxy config for Bullseye.May 3 2022, 9:05 PM

bking mentioned this in T307795: Review java installation method in our puppet code.May 6 2022, 2:57 PM

Change 789877 had a related patch set uploaded (by Bking; author: Bking):

[operations/mediawiki-config@master] elastic: update deployment-prep hostnames

https://gerrit.wikimedia.org/r/789877

Gehel moved this task from Waiting to In Progress on the Discovery-Search (Current work) board.May 9 2022, 3:18 PM

Change 789877 merged by jenkins-bot:

[operations/mediawiki-config@master] [Beta Cluster] LabsServices: Switch elastic hosts to bullseye hosts

https://gerrit.wikimedia.org/r/789877

FWICT everything seems to be working in Beta Cluster since the switch. Does that mean we "just" need to shut off and delete the old instances?

Change 791050 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] Elastic: Use OS major version for GC flags

https://gerrit.wikimedia.org/r/791050

@Jdforrester-WMF I wasn't sure if the code was in production yet or not. If you can confirm that is in production, let us know and we will decommission the old instances.

I checked deployment-mediawiki12 and it does indeed look like the above config is in production (for beta anyway, which is all we care about).

Thus, I will start the decommissioning process for deployment-elastic05-07 immediately.

Mentioned in SAL (#wikimedia-cloud) [2022-05-12T22:09:48Z] <inflatador> bking@deployment-elastic05 banned deployment-elastic05 from beta ES cluster in preparation for decom T299797

Mentioned in SAL (#wikimedia-cloud) [2022-05-13T18:58:21Z] <inflatador> bking@deployment-elastic05 halted deployment-elastic05 in beta ES cluster; will decom in 1 wk T299797

Change 791666 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: remove decommissioned hosts in beta

https://gerrit.wikimedia.org/r/791666

In T299797#7922213, @bking wrote:

@Jdforrester-WMF I wasn't sure if the code was in production yet or not. If you can confirm that is in production, let us know and we will decommission the old instances.

Hey, sorry for the slow reply (was at an off-site). Good to know you found out manually, but sorry I wasn't more explicit to assure you so you didn't need to!

Mentioned in SAL (#wikimedia-cloud) [2022-05-16T19:02:56Z] <inflatador> bking@deployment-elastic06 halted deployment-elastic06 in beta ES cluster; will decom on Friday T299797

Mentioned in SAL (#wikimedia-cloud) [2022-05-16T19:30:59Z] <inflatador> bking@deployment-elastic07 halted deployment-elastic07 in beta ES cluster; will decom on Friday T299797

Mentioned in SAL (#wikimedia-cloud) [2022-05-23T19:21:36Z] <inflatador> Deleted deployment-elastic0[5-7] in favor of newer bullseye hosts T299797

Stretch servers have been deleted, deployment-prep elastic hosts are now all on Bullseye. Closing...

bking closed this task as Resolved.May 23 2022, 8:10 PM

Gehel closed subtask T307510: Update tls proxy config for Bullseye as Resolved.Jul 20 2022, 3:06 PM

Change 791050 abandoned by Bking:

[operations/puppet@production] Elastic: Use OS major version for GC flags

Reason:

Using the same versions of ES everywhere now, this is no longer needed

https://gerrit.wikimedia.org/r/791050

Deploy new bullseye elastic cluster nodes on deployment-prepClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Deploy new bullseye elastic cluster nodes on deployment-prep
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...