Rebuild Toolforge elasticsearch cluster with Stretch or Buster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bd808
	Oct 27 2019, 5:08 AM

Description

Build 3 new Buster nodes and join them to the existing cluster.
dump the indexes from the old nodes and load on the new nodes.
Setup a service name and lb to front the Buster 3 node cluster.
Migrate user accounts to the Buster 3 node cluster.
Contact the maintainers and tell them to switch to the service name instead of round-robin to the jessie cluster nodes
Shutdown tools-elastic-01.tools.eqiad.wmflabs
Shutdown tools-elastic-02.tools.eqiad.wmflabs
Shutdown tools-elastic-03.tools.eqiad.wmflabs
Delete tools-elastic-01,02,03.tools.eqiad.wmflabs virtual machine instances

Details

Subject	Repo	Branch	Lines +/-
toolforge: remove elasticsearch5 role and profile manifests	operations/puppet	production	+0 -82
toolforge: increase elasticsearch timeout on haproxy	operations/puppet	production	+1 -1
keepalived: add initial module and toolforge profile	operations/puppet	production	+94 -0
toolforge: upgrade elasticsearch and add debian buster support	operations/puppet	production	+110 -0
haproxy: update systemd service for buster	operations/puppet	production	+4 -0
toolforge: elasticsearch: nginx: fix dependency cycle	operations/puppet	production	+1 -1
toolforge: refactor elastic role/profile into modern layout	operations/puppet	production	+82 -36

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Bstorm	T236565 "tools" Cloud VPS project jessie deprecation
Resolved	• JHedden	T236606 Rebuild Toolforge elasticsearch cluster with Stretch or Buster
Resolved	• JHedden	T246688 Confirm tools.similarity elasticsearch indexes
Resolved	• JHedden	T247098 tools.bash elasticsearch migration
Resolved	Cirdan	T247524 denkmalbot tool elasticsearch migration
Resolved	Legoktm	T247525 flaky-ci tool elasticsearch migration
Resolved	Surlycyborg	T247526 similarity tool elasticsearch migration
Resolved	Hjfocs	T247527 strephit tool elasticsearch migration
Declined	Tarrow	T247528 wikifactmine-pipeline tool elasticsearch migration
Resolved	WMDE-Fisch	T247529 wmde-uca-test tool elasticsearch migration
Resolved	Cyberpower678	T247530 refill-api tool elasticsearch migration
Resolved	bd808	T247715 stashbot elasticsearch migration

Event Timeline

bd808 created this task.Oct 27 2019, 5:08 AM

bd808 added projects: Toolforge, cloud-services-team (Kanban).

bd808 moved this task from Backlog to WMCS infrastructure on the Cloud-VPS (Debian Jessie Deprecation) board.Oct 27 2019, 5:21 AM

bd808 moved this task from Backlog to Ready to be worked on on the Toolforge board.Oct 28 2019, 4:09 PM

bd808 mentioned this in T236565: "tools" Cloud VPS project jessie deprecation.Oct 28 2019, 4:23 PM

https://wikitech.wikimedia.org/wiki/User:Jhedden/notes/keepalived -- consider introducing a service name and lb in front of the cluster so that the renaming burden on connected tools is a one time problem.

Change 566704 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: refactor elastic role/profile into modern layout

https://gerrit.wikimedia.org/r/566704

gerritbot added a project: Patch-For-Review.Jan 23 2020, 10:28 AM

Change 566704 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: refactor elastic role/profile into modern layout

https://gerrit.wikimedia.org/r/566704

Maintenance_bot removed a project: Patch-For-Review.Jan 24 2020, 10:10 AM

Change 567022 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: elasticsearch: nginx: fix dependency cycle

https://gerrit.wikimedia.org/r/567022

gerritbot added a project: Patch-For-Review.Jan 24 2020, 12:36 PM

Change 567022 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: elasticsearch: nginx: fix dependency cycle

https://gerrit.wikimedia.org/r/567022

Maintenance_bot removed a project: Patch-For-Review.Jan 24 2020, 1:10 PM

Setting most jessie deprecation things to "high priority".

@JHedden Unlicking this cookie myself and handing off to you based on our brief discussion in the WMCS team meeting.

My general idea about doing this was that I would:

Build 3 new Buster nodes and join them to the existing cluster
Deal with whatever madness is required to get the existing indexes to sync to those nodes, probably at least involving upping the replica count. I haven't checked to see if we have completely different elasticsearch versions in apt for Jessie and Buster. If we do it may not actually be possible to mix them together depending on the versions and the internal API changes between them.
- Worst case scenario: make the new nodes an independent cluster and then dump the indexes from the old nodes and load on the new nodes. There is not a ton of data in the cluster in practice.
Setup a service name and lb to front the cluster of 6 nodes (or just the Buster 3 node cluster if that's what ends up working)
Contact the maintainers of tools with credentials to write to the cluster (see local patches on the Toolforge puppetmaster for that) and tell them to switch to the service name instead of round-robin to the jessie cluster nodes
Shutdown the jessie nodes
Possibly setup DNS using the old Jessie node names as CNAMEs for the new service name if there were tool maintainers I could not reach
Profit!!!

I'm the maintainer of 2 of the tools that write to the cluster (stashbot & bash), so I would be glad to be the first tester of the service name gateway.

In T236606#5899746, @bd808 wrote:

@JHedden Unlicking this cookie myself and handing off to you based on our brief discussion in the WMCS team meeting.

My general idea about doing this was that I would:

Build 3 new Buster nodes and join them to the existing cluster

Deal with whatever madness is required to get the existing indexes to sync to those nodes, probably at least involving upping the replica count. I haven't checked to see if we have completely different elasticsearch versions in apt for Jessie and Buster. If we do it may not actually be possible to mix them together depending on the versions and the internal API changes between them.

Doesn't look like this is possible. We have 5.5.2 on Jessie and 7.4.2 on Buster. https://www.elastic.co/guide/en/elasticsearch/reference/7.4/setup-upgrade.html

Worst case scenario: make the new nodes an independent cluster and then dump the indexes from the old nodes and load on the new nodes. There is not a ton of data in the cluster in practice.

This is probably our best approach.

Setup a service name and lb to front the cluster of 6 nodes (or just the Buster 3 node cluster if that's what ends up working)

Contact the maintainers of tools with credentials to write to the cluster (see local patches on the Toolforge puppetmaster for that) and tell them to switch to the service name instead of round-robin to the jessie cluster nodes

Shutdown the jessie nodes

Possibly setup DNS using the old Jessie node names as CNAMEs for the new service name if there were tool maintainers I could not reach

Profit!!!

I'm the maintainer of 2 of the tools that write to the cluster (stashbot & bash), so I would be glad to be the first tester of the service name gateway.

Sounds like a good plan.

In T236606#5901064, @JHedden wrote:

In T236606#5899746, @bd808 wrote:

Worst case scenario: make the new nodes an independent cluster and then dump the indexes from the old nodes and load on the new nodes. There is not a ton of data in the cluster in practice.

This is probably our best approach.

https://www.elastic.co/guide/en/elasticsearch/reference/7.4/reindex-upgrade-remote.html may or may not be useful in transferring indices.

Splitting to a new cluster and setting up the lb/service name there actually may be useful for client migration as well. Jumping from 5.5.2 to 7.4.2 may also require client code upgrades leading to a need for a testing cycle on the tools themselves as well.

Change 574063 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] haproxy: update systemd service for buster

https://gerrit.wikimedia.org/r/574063

gerritbot added a project: Patch-For-Review.Feb 21 2020, 9:42 PM

Change 574063 merged by Jhedden:
[operations/puppet@production] haproxy: update systemd service for buster

https://gerrit.wikimedia.org/r/574063

Maintenance_bot removed a project: Patch-For-Review.Feb 24 2020, 3:11 PM

Change 574527 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] toolforge: upgrade elasticsearch and add debian buster support

https://gerrit.wikimedia.org/r/574527

gerritbot added a project: Patch-For-Review.Feb 24 2020, 7:05 PM

Patch up for review, if it looks good I'll create a test cluster in toolsbeta for further testing.

@bd808 I kept the configuration close to our existing install, I'm curious if we should we setup HTTPS for these endpoints?

In T236606#5914003, @JHedden wrote:

@bd808 I kept the configuration close to our existing install, I'm curious if we should we setup HTTPS for these endpoints?

Adding TLS would be fine if it is easy. The traffic (both east-west in the elastic cluster and north-south to the using code) is all internal to the tools project so if it is not trivially easy to setup TLS and ensure that it is using a server cert that is client code friendly (LE rather than Puppet certs or something where the signing CA is untrusted by default) then I think we can leave that out of the upgrade/migration.

• Bstorm moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Feb 25 2020, 5:18 PM

Change 574527 merged by Jhedden:
[operations/puppet@production] toolforge: upgrade elasticsearch and add debian buster support

https://gerrit.wikimedia.org/r/574527

Maintenance_bot removed a project: Patch-For-Review.Feb 27 2020, 3:10 PM

Mentioned in SAL (#wikimedia-cloud) [2020-02-27T20:19:55Z] <jeh> update elasticsearch VPS security group to allow toolsbeta-elastic7-1 access on tcp 80 T236606

Mentioned in SAL (#wikimedia-cloud) [2020-02-27T21:03:13Z] <jeh> add reindex service account to elasticsearch for data migration T236606

I've setup a new elasticsearch v7 test cluster in toolsbeta and opened up a connection between the existing tools elasticsearch cluster. Using this I was able to verify the data migration process works between v5 and v7 by reindexing from a remote source. [0]

Steps to configure and reindex from the existing cluster to a new one.

Temporarily add the remote cluster to the remote reindex whitelist

prepare new cluster

$ sudo vi /etc/elasticsearch/labs-tools/elasticsearch.yml
    # add - index.remote.whitelist: tools-elastic-01.tools.eqiad.wmflabs:[80](80)
$ sudo systemctl restart elasticsearch_7@labs-tools.service

Reindex <index_name>

on new cluster

$ curl -HContent-Type:application/json -XPOST 127.0.0.1:9200/_reindex?pretty -d'{
  "source": {
    "remote": {
      "host": "http://tools-elastic-01.tools.eqiad.wmflabs:80",
      "username": "reindex",
      "password": "MASKED"
    },
    "index": "<index_name>"
  },
  "dest": {
    "index": "<index_name>"
  }
}'

Once the index has been created, change the number of replicas to 2

$ curl -HContent-Type:application/json -XPUT 'localhost:9200/<index_name>/_settings' -d '{
    "index.number_of_replicas" : 2
}'

Confirm the document count between the old and new index

old index

$ curl 'http://tools-elastic-01.tools.eqiad.wmflabs/_cat/indices/<index_name>?v&s=index'
health status index         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   <index_name>  zszLDjngSTKMfWBo0vyQgQ   1   2          3            4     32.7kb         10.9kb

new index

$ curl -XGET "http://localhost:9200/_cat/indices/<index_name>?v&s=index"
health status index         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   <index_name>  SLREKcaKSv65LIB_I809oA   1   2          3            0     26.6kb          8.8kb

[0] https://www.elastic.co/guide/en/elasticsearch/reference/current/reindex-upgrade-remote.html

Mentioned in SAL (#wikimedia-cloud) [2020-02-28T15:09:37Z] <jeh> create 3 new elasticsearch VMs tools-elastic-[1,2,3] T236606

Mentioned in SAL (#wikimedia-cloud) [2020-02-28T15:28:48Z] <jeh> create OpenStack server group tools-elastic with anti-affinty policy enabled T236606

Krenair subscribed.Feb 28 2020, 10:30 PM

• JHedden mentioned this in T246688: Confirm tools.similarity elasticsearch indexes.Mar 2 2020, 5:20 PM

Mentioned in SAL (#wikimedia-cloud) [2020-03-02T22:26:36Z] <jeh> starting first pass of elasticsearch data migration to new cluster T236606

The data migration process was able to successfully copy over 84 indexes, there were only 2 indexes that have problems.

health status index                                    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   s53402__wmde-uca-test_content_1502297469 neldQFYoQOmEO5O4hkOaDg   4   2        358           10     60.7mb         20.2mb
green  open   s53402__wmde-uca-test_general_1502297503 cW7QuoLkT9SJYbVCDCTiDw   4   2        110            0     14.6mb          4.8mb

These indexes failed to copy due to boolean typing errors "Failed to parse value [<value>] as only [true] or [false] are allowed."

Change 576395 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] keepalived: add initial module and toolforge profile

https://gerrit.wikimedia.org/r/576395

gerritbot added a project: Patch-For-Review.Mar 3 2020, 5:26 PM

Mentioned in SAL (#wikimedia-cloud) [2020-03-03T17:31:14Z] <jeh> create a OpenStack virtual ip address for the new elasticsearch cluster T236606

Mentioned in SAL (#wikimedia-cloud) [2020-03-03T18:02:49Z] <jeh> create OpenStack DNS record for elasticsearch.svc.tools.eqiad.wikimedia.cloud T236606

Mentioned in SAL (#wikimedia-cloud) [2020-03-03T18:16:30Z] <jeh> create OpenStack DNS record for elasticsearch.svc.tools.eqiad1.wikimedia.cloud (eqiad1 subdomain change) T236606

• JHedden closed subtask T246688: Confirm tools.similarity elasticsearch indexes as Resolved.Mar 3 2020, 9:19 PM

Change 576395 merged by Jhedden:
[operations/puppet@production] keepalived: add initial module and toolforge profile

https://gerrit.wikimedia.org/r/576395

The new cluster is online and ready for early user testing.

All existing Elasticsearch user accounts and data has migrated over and the new service address is running in active/standby across the 3 Elasticsearch nodes.

$ curl http://elasticsearch.svc.tools.eqiad1.wikimedia.cloud/_cat/health?v
epoch      timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1583271911 21:45:11  tools   green           3         3    255  85    0    0        0             0                  -                100.0%

Keepavlied failover testing logs

Mar 03 21:44:46 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Entering BACKUP STATE (init)
Mar 03 21:44:49 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Entering MASTER STATE
Mar 03 21:44:49 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Master received advert from 172.16.1.102 with higher priority 130, ours 110
Mar 03 21:44:49 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Entering BACKUP STATE
Mar 03 21:46:14 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Backup received priority 0 advertisement
Mar 03 21:46:14 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Entering MASTER STATE
Mar 03 21:46:37 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Master received advert from 172.16.1.102 with higher priority 130, ours 110
Mar 03 21:46:37 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Entering BACKUP STATE

Next I'll be working on documentation and a plan to cut-over production.

Maintenance_bot removed a project: Patch-For-Review.Mar 3 2020, 10:10 PM

Change 577352 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] toolforge: increase elasticsearch timeout on haproxy

https://gerrit.wikimedia.org/r/577352

gerritbot added a project: Patch-For-Review.Mar 5 2020, 9:37 PM

Change 577352 merged by Jhedden:
[operations/puppet@production] toolforge: increase elasticsearch timeout on haproxy

https://gerrit.wikimedia.org/r/577352

Maintenance_bot removed a project: Patch-For-Review.Mar 5 2020, 10:10 PM

• JHedden closed subtask T247098: tools.bash elasticsearch migration as Resolved.Mar 6 2020, 4:55 PM

• JHedden updated the task description. (Show Details)Mar 6 2020, 8:48 PM

Surlycyborg closed subtask T247526: similarity tool elasticsearch migration as Resolved.Mar 13 2020, 7:26 PM

Dzahn added a parent task: T247045: Migrate all of production metal and VMs to Buster or later.Mar 14 2020, 12:19 AM

• JHedden moved this task from Doing to Watching on the cloud-services-team (Kanban) board.Mar 24 2020, 3:02 PM

bd808 moved this task from WMCS infrastructure to Subtasks on the Cloud-VPS (Debian Jessie Deprecation) board.Mar 30 2020, 10:01 PM

Tarrow closed subtask T247528: wikifactmine-pipeline tool elasticsearch migration as Declined.Apr 1 2020, 12:46 PM

• JHedden removed a parent task: T247045: Migrate all of production metal and VMs to Buster or later.Apr 1 2020, 1:30 PM

• JHedden closed subtask T247525: flaky-ci tool elasticsearch migration as Resolved.Apr 6 2020, 11:08 PM

• JHedden closed subtask T247529: wmde-uca-test tool elasticsearch migration as Resolved.

bd808 closed subtask T247715: stashbot elasticsearch migration as Resolved.Apr 12 2020, 3:03 AM

• JHedden closed subtask T247524: denkmalbot tool elasticsearch migration as Resolved.Apr 20 2020, 1:26 PM

• JHedden closed subtask T247527: strephit tool elasticsearch migration as Resolved.

• JHedden closed subtask T247530: refill-api tool elasticsearch migration as Resolved.

Mentioned in SAL (#wikimedia-cloud) [2020-04-20T13:28:11Z] <jeh> shutdown elasticsearch v5 cluster running Jessie T236606

• JHedden updated the task description. (Show Details)Apr 20 2020, 1:29 PM

Mentioned in SAL (#wikimedia-cloud) [2020-05-04T22:08:23Z] <bstorm_> deleting tools-elastic-01/2/3 T236606

• Bstorm closed this task as Resolved.May 4 2020, 10:10 PM

• Bstorm updated the task description. (Show Details)

Change 598082 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: remove elasticsearch5 role and profile manifests

https://gerrit.wikimedia.org/r/598082

gerritbot added a project: Patch-For-Review.May 22 2020, 5:08 PM

Change 598082 merged by Andrew Bogott:
[operations/puppet@production] toolforge: remove elasticsearch5 role and profile manifests

https://gerrit.wikimedia.org/r/598082

Maintenance_bot removed a project: Patch-For-Review.May 22 2020, 6:10 PM

• taavi mentioned this in T311905: Upgrade Toolforge (Elastic|Open)Search cluster to Debian Bullseye.Apr 8 2024, 10:14 AM