Page MenuHomePhabricator

Migrate Relforge to Opensearch
Closed, ResolvedPublic8 Estimated Story Points

Description

Relforge (Search Platform's Elasticsearch playground environment) will be the first environment in which we deploy Opensearch.

Creating this ticket to migrate the Relforge cluster from Elasticsearch 7.10 to Opensearch

AC:

  • Relforge is running OpenSearch, with a similar feature set as the current Elasticsearch deployment
  • Additional security / authentication is NOT part of this ticket

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+5 -1
operations/puppetproduction+2 -6
operations/cookbooksmaster+10 -2
operations/puppetproduction+70 -0
operations/puppetproduction+50 -1
operations/puppetproduction+3 -3
operations/puppetproduction+5 -5
operations/puppetproduction+25 -0
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+3 -1
operations/puppetproduction+3 -0
operations/puppetproduction+3 -0
operations/puppetproduction+8 -2
operations/puppetproduction+2 -1
operations/puppetproduction+7 -0
operations/puppetproduction+12 -18
operations/puppetproduction+28 -27
operations/puppetproduction+357 -4
operations/cookbooksmaster+4 -0
operations/puppetproduction+116 -0
operations/puppetproduction+81 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1090529 merged by Bking:

[operations/puppet@production] Transition relforge to OpenSearch

https://gerrit.wikimedia.org/r/1090529

Mentioned in SAL (#wikimedia-operations) [2025-02-12T20:01:22Z] <bking@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on relforge1004.eqiad.wmnet with reason: T380752

Change #1119227 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cirrus: disable opensearch-madvise while we debate its future

https://gerrit.wikimedia.org/r/1119227

Change #1119227 merged by Bking:

[operations/puppet@production] cirrus: disable opensearch-madvise while we debate its future

https://gerrit.wikimedia.org/r/1119227

We found out that relforge1004 cannot reimage via cookbook, as it's an HP chassis (WMF hasn't bought them for years; the relforge hosts themselves will be replaced very soon ).

We talked about our options in Slack and I ultimately decided to repurpose some Elastic hosts as Relforge hosts (see T386357).

The next steps will involve working our way through the Puppet errors recorded here .

bking mentioned this in Unknown Object (Task).Feb 13 2025, 2:37 PM

Change #1119520 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] relforge/elastic: repurpose elastic hosts for relforge

https://gerrit.wikimedia.org/r/1119520

Change #1119520 merged by Bking:

[operations/puppet@production] relforge/elastic: repurpose elastic hosts for relforge

https://gerrit.wikimedia.org/r/1119520

Change #1120140 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] opensearch: include the minor version in the apt component name

https://gerrit.wikimedia.org/r/1120140

Mentioned in SAL (#wikimedia-operations) [2025-02-17T15:59:24Z] <bking@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on relforge1004.eqiad.wmnet with reason: T380752

Change #1120654 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] opensearch-cirrus: add repository before attempting plugin install

https://gerrit.wikimedia.org/r/1120654

Change #1120654 merged by Bking:

[operations/puppet@production] opensearch-cirrus: add repository before attempting plugin install

https://gerrit.wikimedia.org/r/1120654

Change #1120140 abandoned by Brouberol:

[operations/puppet@production] opensearch: include the minor version in the apt component name

Reason:

superseded by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1120654

https://gerrit.wikimedia.org/r/1120140

Change #1120900 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] cirrus: add the opensearch.motd file

https://gerrit.wikimedia.org/r/1120900

Change #1120900 merged by Brouberol:

[operations/puppet@production] cirrus: add the opensearch.motd file

https://gerrit.wikimedia.org/r/1120900

Change #1120903 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] opensearcgh:cirrus: include the diffie-hellman parameter file

https://gerrit.wikimedia.org/r/1120903

Change #1120903 merged by Brouberol:

[operations/puppet@production] opensearcgh:cirrus: include the diffie-hellman parameter file

https://gerrit.wikimedia.org/r/1120903

Change #1120908 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] opensearch:cirrus: install curator for opensearch

https://gerrit.wikimedia.org/r/1120908

Change #1120908 merged by Brouberol:

[operations/puppet@production] opensearch:cirrus: install curator for opensearch

https://gerrit.wikimedia.org/r/1120908

Change #1120914 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] opensearch:cirrus: pin elasticsearch-curator version

https://gerrit.wikimedia.org/r/1120914

Change #1120914 merged by Brouberol:

[operations/puppet@production] opensearch:cirrus: pin elasticsearch-curator version

https://gerrit.wikimedia.org/r/1120914

Change #1120969 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] relforge: define opensearch datadir as 'opensearch'

https://gerrit.wikimedia.org/r/1120969

Change #1120969 merged by Bking:

[operations/puppet@production] relforge: define opensearch datadir as 'opensearch'

https://gerrit.wikimedia.org/r/1120969

Change #1121087 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cirrus: add commands to configure opensearch keystore

https://gerrit.wikimedia.org/r/1121087

Change #1121087 merged by Bking:

[operations/puppet@production] cirrus: add commands to configure opensearch keystore

https://gerrit.wikimedia.org/r/1121087

Change #1121101 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cirrus: rename s3 resources

https://gerrit.wikimedia.org/r/1121101

Change #1121101 merged by Bking:

[operations/puppet@production] cirrus: rename s3 resources

https://gerrit.wikimedia.org/r/1121101

Change #1121312 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] opensearch:cirrus: add the opensearch- prefix to some plugins

https://gerrit.wikimedia.org/r/1121312

Change #1121312 merged by Brouberol:

[operations/puppet@production] opensearch:cirrus: add the opensearch- prefix to some plugins

https://gerrit.wikimedia.org/r/1121312

Mentioned in SAL (#wikimedia-operations) [2025-02-20T14:46:59Z] <inflatador> bking@apt1002:~/pkg$ sudo -E reprepro -C component/opensearch13 include bullseye-wikimedia $HOME/pkg/wmf-opensearch-search-plugins_1.3.20-1_amd64.changes T380752

Mentioned in SAL (#wikimedia-operations) [2025-02-20T15:20:28Z] <inflatador> bking@apt1002:~/pkg$ sudo -E reprepro -C component/opensearch13 remove bullseye-wikimedia wmf-opensearch-search-plugins T380752

Mentioned in SAL (#wikimedia-operations) [2025-02-20T15:20:43Z] <inflatador> bking@apt1002:~/pkg$ sudo -E reprepro -C component/opensearch13 include bullseye-wikimedia $HOME/pkg/wmf-opensearch-search-plugins_1.3.20-1_amd64.changes (again)T380752

Mentioned in SAL (#wikimedia-operations) [2025-02-20T15:51:30Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test operations in mixed opensearch/elasticsearch cluster - bking@cumin2002 - T380752

Mentioned in SAL (#wikimedia-operations) [2025-02-20T15:51:34Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test operations in mixed opensearch/elasticsearch cluster - bking@cumin2002 - T380752

Mentioned in SAL (#wikimedia-operations) [2025-02-20T18:42:07Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test operations in mixed opensearch/elasticsearch cluster - bking@cumin2002 - T380752:

Mentioned in SAL (#wikimedia-operations) [2025-02-20T18:42:11Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test operations in mixed opensearch/elasticsearch cluster - bking@cumin2002 - T380752:

bking changed the task status from Open to In Progress.Feb 20 2025, 9:38 PM
bking triaged this task as Medium priority.

Change #1121711 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] relforge: reassign relforge1005 to Opensearch role

https://gerrit.wikimedia.org/r/1121711

Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin2002 for host relforge1005.eqiad.wmnet with OS bullseye

Change #1121711 merged by Brouberol:

[operations/puppet@production] relforge: reassign relforge1005 to Opensearch role

https://gerrit.wikimedia.org/r/1121711

Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin2002 for host relforge1005.eqiad.wmnet with OS bullseye completed:

  • relforge1005 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502241445_brouberol_2558491_relforge1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change #1122900 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] global_config: add external services for opensearch clusters

https://gerrit.wikimedia.org/r/1122900

bking changed the task status from In Progress to Open.Mar 4 2025, 10:01 PM
bking added a subscriber: dcausse.

Per IRC conversation in Wikimedia-Search , the cluster is currently in mixed state (relforge1003 is on Elastic, relforge1004 is on Opensearch) .

We've opted to leave it this way as both relforge hosts will be replaced in the near future (ref T382906 ).

Moving to 'needs review' so @dcausse has a chance to give feedback before we close this out.

brouberol updated Other Assignee, removed: bking.
brouberol subscribed.

It's been two weeks, so I'm gonna close this one out. Feel free to reopen if we need to address anything else.

Change #1122900 abandoned by Brouberol:

[operations/puppet@production] global_config: add external services for opensearch clusters

https://gerrit.wikimedia.org/r/1122900

Change #1119058 merged by jenkins-bot:

[operations/cookbooks@master] ES/rolling-operation: add a optional flag to ask for confirmation before running operation

https://gerrit.wikimedia.org/r/1119058

bking reopened this task as In Progress.Mar 20 2025, 3:52 PM

Change #1129877 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] relforge: move relforge1003 into OpenSearch role

https://gerrit.wikimedia.org/r/1129877

Change #1129877 merged by Bking:

[operations/puppet@production] relforge: move relforge1003 into OpenSearch role

https://gerrit.wikimedia.org/r/1129877

Mentioned in SAL (#wikimedia-operations) [2025-03-20T17:23:01Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1003* for ban host prior to reimage - bking@cumin2002 - T380752

Mentioned in SAL (#wikimedia-operations) [2025-03-20T17:23:05Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1003* for ban host prior to reimage - bking@cumin2002 - T380752

Mentioned in SAL (#wikimedia-operations) [2025-03-20T18:51:14Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1003* for ban host to test reimage - bking@cumin2002 - T380752

Mentioned in SAL (#wikimedia-operations) [2025-03-20T18:51:18Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1003* for ban host to test reimage - bking@cumin2002 - T380752

Change #1129915 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] relforge: add relforge1004 as master eligible

https://gerrit.wikimedia.org/r/1129915

Change #1129915 merged by Bking:

[operations/puppet@production] relforge: add relforge1004 as master eligible

https://gerrit.wikimedia.org/r/1129915

Mentioned in SAL (#wikimedia-operations) [2025-03-20T19:21:47Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1004* for ban host to test reimage - bking@cumin2002 - T380752

Mentioned in SAL (#wikimedia-operations) [2025-03-20T19:21:51Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1004* for ban host to test reimage - bking@cumin2002 - T380752

Mentioned in SAL (#wikimedia-operations) [2025-03-20T19:27:34Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1003* for ban host to test puppet code - bking@cumin2002 - T380752

Mentioned in SAL (#wikimedia-operations) [2025-03-20T19:27:37Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1003* for ban host to test puppet code - bking@cumin2002 - T380752

relforge1004 is now using OpenSearch. Since our reimage automation does not work with this extremely old chassis, I one-offed the host. This should be OK until we can add the new relforge hosts (labeled elastic112[3-5] in T384966 ), which will probably happen in the next week or so. Closing...