- Build 3 new Buster nodes and join them to the existing cluster.
- dump the indexes from the old nodes and load on the new nodes.
- Setup a service name and lb to front the Buster 3 node cluster.
- Migrate user accounts to the Buster 3 node cluster.
- Contact the maintainers and tell them to switch to the service name instead of round-robin to the jessie cluster nodes
- Shutdown tools-elastic-01.tools.eqiad.wmflabs
- Shutdown tools-elastic-02.tools.eqiad.wmflabs
- Shutdown tools-elastic-03.tools.eqiad.wmflabs
- Delete tools-elastic-01,02,03.tools.eqiad.wmflabs virtual machine instances
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Bstorm | T236565 "tools" Cloud VPS project jessie deprecation | |||
Resolved | • JHedden | T236606 Rebuild Toolforge elasticsearch cluster with Stretch or Buster | |||
Resolved | • JHedden | T246688 Confirm tools.similarity elasticsearch indexes | |||
Resolved | • JHedden | T247098 tools.bash elasticsearch migration | |||
Resolved | Cirdan | T247524 denkmalbot tool elasticsearch migration | |||
Resolved | Legoktm | T247525 flaky-ci tool elasticsearch migration | |||
Resolved | Surlycyborg | T247526 similarity tool elasticsearch migration | |||
Resolved | Hjfocs | T247527 strephit tool elasticsearch migration | |||
Declined | Tarrow | T247528 wikifactmine-pipeline tool elasticsearch migration | |||
Resolved | WMDE-Fisch | T247529 wmde-uca-test tool elasticsearch migration | |||
Resolved | Cyberpower678 | T247530 refill-api tool elasticsearch migration | |||
Resolved | bd808 | T247715 stashbot elasticsearch migration |
Event Timeline
https://wikitech.wikimedia.org/wiki/User:Jhedden/notes/keepalived -- consider introducing a service name and lb in front of the cluster so that the renaming burden on connected tools is a one time problem.
Change 566704 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: refactor elastic role/profile into modern layout
Change 566704 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: refactor elastic role/profile into modern layout
Change 567022 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: elasticsearch: nginx: fix dependency cycle
Change 567022 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: elasticsearch: nginx: fix dependency cycle
@JHedden Unlicking this cookie myself and handing off to you based on our brief discussion in the WMCS team meeting.
My general idea about doing this was that I would:
- Build 3 new Buster nodes and join them to the existing cluster
- Deal with whatever madness is required to get the existing indexes to sync to those nodes, probably at least involving upping the replica count. I haven't checked to see if we have completely different elasticsearch versions in apt for Jessie and Buster. If we do it may not actually be possible to mix them together depending on the versions and the internal API changes between them.
- Worst case scenario: make the new nodes an independent cluster and then dump the indexes from the old nodes and load on the new nodes. There is not a ton of data in the cluster in practice.
- Setup a service name and lb to front the cluster of 6 nodes (or just the Buster 3 node cluster if that's what ends up working)
- Contact the maintainers of tools with credentials to write to the cluster (see local patches on the Toolforge puppetmaster for that) and tell them to switch to the service name instead of round-robin to the jessie cluster nodes
- Shutdown the jessie nodes
- Possibly setup DNS using the old Jessie node names as CNAMEs for the new service name if there were tool maintainers I could not reach
- Profit!!!
I'm the maintainer of 2 of the tools that write to the cluster (stashbot & bash), so I would be glad to be the first tester of the service name gateway.
Doesn't look like this is possible. We have 5.5.2 on Jessie and 7.4.2 on Buster. https://www.elastic.co/guide/en/elasticsearch/reference/7.4/setup-upgrade.html
- Worst case scenario: make the new nodes an independent cluster and then dump the indexes from the old nodes and load on the new nodes. There is not a ton of data in the cluster in practice.
This is probably our best approach.
- Setup a service name and lb to front the cluster of 6 nodes (or just the Buster 3 node cluster if that's what ends up working)
- Contact the maintainers of tools with credentials to write to the cluster (see local patches on the Toolforge puppetmaster for that) and tell them to switch to the service name instead of round-robin to the jessie cluster nodes
- Shutdown the jessie nodes
- Possibly setup DNS using the old Jessie node names as CNAMEs for the new service name if there were tool maintainers I could not reach
- Profit!!!
I'm the maintainer of 2 of the tools that write to the cluster (stashbot & bash), so I would be glad to be the first tester of the service name gateway.
Sounds like a good plan.
https://www.elastic.co/guide/en/elasticsearch/reference/7.4/reindex-upgrade-remote.html may or may not be useful in transferring indices.
Splitting to a new cluster and setting up the lb/service name there actually may be useful for client migration as well. Jumping from 5.5.2 to 7.4.2 may also require client code upgrades leading to a need for a testing cycle on the tools themselves as well.
Change 574063 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] haproxy: update systemd service for buster
Change 574063 merged by Jhedden:
[operations/puppet@production] haproxy: update systemd service for buster
Change 574527 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] toolforge: upgrade elasticsearch and add debian buster support
Patch up for review, if it looks good I'll create a test cluster in toolsbeta for further testing.
@bd808 I kept the configuration close to our existing install, I'm curious if we should we setup HTTPS for these endpoints?
Adding TLS would be fine if it is easy. The traffic (both east-west in the elastic cluster and north-south to the using code) is all internal to the tools project so if it is not trivially easy to setup TLS and ensure that it is using a server cert that is client code friendly (LE rather than Puppet certs or something where the signing CA is untrusted by default) then I think we can leave that out of the upgrade/migration.
Change 574527 merged by Jhedden:
[operations/puppet@production] toolforge: upgrade elasticsearch and add debian buster support
Mentioned in SAL (#wikimedia-cloud) [2020-02-27T20:19:55Z] <jeh> update elasticsearch VPS security group to allow toolsbeta-elastic7-1 access on tcp 80 T236606
Mentioned in SAL (#wikimedia-cloud) [2020-02-27T21:03:13Z] <jeh> add reindex service account to elasticsearch for data migration T236606
I've setup a new elasticsearch v7 test cluster in toolsbeta and opened up a connection between the existing tools elasticsearch cluster. Using this I was able to verify the data migration process works between v5 and v7 by reindexing from a remote source. [0]
Steps to configure and reindex from the existing cluster to a new one.
Temporarily add the remote cluster to the remote reindex whitelist
$ sudo vi /etc/elasticsearch/labs-tools/elasticsearch.yml # add - index.remote.whitelist: tools-elastic-01.tools.eqiad.wmflabs:[80](80) $ sudo systemctl restart elasticsearch_7@labs-tools.service
Reindex <index_name>
$ curl -HContent-Type:application/json -XPOST 127.0.0.1:9200/_reindex?pretty -d'{ "source": { "remote": { "host": "http://tools-elastic-01.tools.eqiad.wmflabs:80", "username": "reindex", "password": "MASKED" }, "index": "<index_name>" }, "dest": { "index": "<index_name>" } }'
Once the index has been created, change the number of replicas to 2
$ curl -HContent-Type:application/json -XPUT 'localhost:9200/<index_name>/_settings' -d '{ "index.number_of_replicas" : 2 }'
Confirm the document count between the old and new index
$ curl 'http://tools-elastic-01.tools.eqiad.wmflabs/_cat/indices/<index_name>?v&s=index' health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open <index_name> zszLDjngSTKMfWBo0vyQgQ 1 2 3 4 32.7kb 10.9kb
$ curl -XGET "http://localhost:9200/_cat/indices/<index_name>?v&s=index" health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open <index_name> SLREKcaKSv65LIB_I809oA 1 2 3 0 26.6kb 8.8kb
[0] https://www.elastic.co/guide/en/elasticsearch/reference/current/reindex-upgrade-remote.html
Mentioned in SAL (#wikimedia-cloud) [2020-02-28T15:09:37Z] <jeh> create 3 new elasticsearch VMs tools-elastic-[1,2,3] T236606
Mentioned in SAL (#wikimedia-cloud) [2020-02-28T15:28:48Z] <jeh> create OpenStack server group tools-elastic with anti-affinty policy enabled T236606
Mentioned in SAL (#wikimedia-cloud) [2020-03-02T22:26:36Z] <jeh> starting first pass of elasticsearch data migration to new cluster T236606
The data migration process was able to successfully copy over 84 indexes, there were only 2 indexes that have problems.
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open s53402__wmde-uca-test_content_1502297469 neldQFYoQOmEO5O4hkOaDg 4 2 358 10 60.7mb 20.2mb green open s53402__wmde-uca-test_general_1502297503 cW7QuoLkT9SJYbVCDCTiDw 4 2 110 0 14.6mb 4.8mb
These indexes failed to copy due to boolean typing errors "Failed to parse value [<value>] as only [true] or [false] are allowed."
Change 576395 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] keepalived: add initial module and toolforge profile
Mentioned in SAL (#wikimedia-cloud) [2020-03-03T17:31:14Z] <jeh> create a OpenStack virtual ip address for the new elasticsearch cluster T236606
Mentioned in SAL (#wikimedia-cloud) [2020-03-03T18:02:49Z] <jeh> create OpenStack DNS record for elasticsearch.svc.tools.eqiad.wikimedia.cloud T236606
Mentioned in SAL (#wikimedia-cloud) [2020-03-03T18:16:30Z] <jeh> create OpenStack DNS record for elasticsearch.svc.tools.eqiad1.wikimedia.cloud (eqiad1 subdomain change) T236606
Change 576395 merged by Jhedden:
[operations/puppet@production] keepalived: add initial module and toolforge profile
The new cluster is online and ready for early user testing.
All existing Elasticsearch user accounts and data has migrated over and the new service address is running in active/standby across the 3 Elasticsearch nodes.
$ curl http://elasticsearch.svc.tools.eqiad1.wikimedia.cloud/_cat/health?v epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 1583271911 21:45:11 tools green 3 3 255 85 0 0 0 0 - 100.0%
Keepavlied failover testing logs
Mar 03 21:44:46 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Entering BACKUP STATE (init) Mar 03 21:44:49 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Entering MASTER STATE Mar 03 21:44:49 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Master received advert from 172.16.1.102 with higher priority 130, ours 110 Mar 03 21:44:49 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Entering BACKUP STATE Mar 03 21:46:14 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Backup received priority 0 advertisement Mar 03 21:46:14 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Entering MASTER STATE Mar 03 21:46:37 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Master received advert from 172.16.1.102 with higher priority 130, ours 110 Mar 03 21:46:37 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Entering BACKUP STATE
Next I'll be working on documentation and a plan to cut-over production.
Change 577352 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] toolforge: increase elasticsearch timeout on haproxy
Change 577352 merged by Jhedden:
[operations/puppet@production] toolforge: increase elasticsearch timeout on haproxy
Mentioned in SAL (#wikimedia-cloud) [2020-04-20T13:28:11Z] <jeh> shutdown elasticsearch v5 cluster running Jessie T236606
Mentioned in SAL (#wikimedia-cloud) [2020-05-04T22:08:23Z] <bstorm_> deleting tools-elastic-01/2/3 T236606
Change 598082 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: remove elasticsearch5 role and profile manifests
Change 598082 merged by Andrew Bogott:
[operations/puppet@production] toolforge: remove elasticsearch5 role and profile manifests