Page MenuHomePhabricator

Rebuild Toolforge elasticsearch cluster with Stretch or Buster
Closed, ResolvedPublic

Description

  • Build 3 new Buster nodes and join them to the existing cluster.
  • dump the indexes from the old nodes and load on the new nodes.
  • Setup a service name and lb to front the Buster 3 node cluster.
  • Migrate user accounts to the Buster 3 node cluster.
  • Contact the maintainers and tell them to switch to the service name instead of round-robin to the jessie cluster nodes
  • Shutdown tools-elastic-01.tools.eqiad.wmflabs
  • Shutdown tools-elastic-02.tools.eqiad.wmflabs
  • Shutdown tools-elastic-03.tools.eqiad.wmflabs
  • Delete tools-elastic-01,02,03.tools.eqiad.wmflabs virtual machine instances

Event Timeline

https://wikitech.wikimedia.org/wiki/User:Jhedden/notes/keepalived -- consider introducing a service name and lb in front of the cluster so that the renaming burden on connected tools is a one time problem.

Change 566704 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: refactor elastic role/profile into modern layout

https://gerrit.wikimedia.org/r/566704

Change 566704 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: refactor elastic role/profile into modern layout

https://gerrit.wikimedia.org/r/566704

Change 567022 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: elasticsearch: nginx: fix dependency cycle

https://gerrit.wikimedia.org/r/567022

Change 567022 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: elasticsearch: nginx: fix dependency cycle

https://gerrit.wikimedia.org/r/567022

Bstorm triaged this task as High priority.Feb 11 2020, 4:20 PM
Bstorm added a subscriber: Bstorm.

Setting most jessie deprecation things to "high priority".

bd808 added a subscriber: JHedden.

@JHedden Unlicking this cookie myself and handing off to you based on our brief discussion in the WMCS team meeting.

My general idea about doing this was that I would:

  • Build 3 new Buster nodes and join them to the existing cluster
  • Deal with whatever madness is required to get the existing indexes to sync to those nodes, probably at least involving upping the replica count. I haven't checked to see if we have completely different elasticsearch versions in apt for Jessie and Buster. If we do it may not actually be possible to mix them together depending on the versions and the internal API changes between them.
    • Worst case scenario: make the new nodes an independent cluster and then dump the indexes from the old nodes and load on the new nodes. There is not a ton of data in the cluster in practice.
  • Setup a service name and lb to front the cluster of 6 nodes (or just the Buster 3 node cluster if that's what ends up working)
  • Contact the maintainers of tools with credentials to write to the cluster (see local patches on the Toolforge puppetmaster for that) and tell them to switch to the service name instead of round-robin to the jessie cluster nodes
  • Shutdown the jessie nodes
  • Possibly setup DNS using the old Jessie node names as CNAMEs for the new service name if there were tool maintainers I could not reach
  • Profit!!!

I'm the maintainer of 2 of the tools that write to the cluster (stashbot & bash), so I would be glad to be the first tester of the service name gateway.

@JHedden Unlicking this cookie myself and handing off to you based on our brief discussion in the WMCS team meeting.

My general idea about doing this was that I would:

  • Build 3 new Buster nodes and join them to the existing cluster
  • Deal with whatever madness is required to get the existing indexes to sync to those nodes, probably at least involving upping the replica count. I haven't checked to see if we have completely different elasticsearch versions in apt for Jessie and Buster. If we do it may not actually be possible to mix them together depending on the versions and the internal API changes between them.

Doesn't look like this is possible. We have 5.5.2 on Jessie and 7.4.2 on Buster. https://www.elastic.co/guide/en/elasticsearch/reference/7.4/setup-upgrade.html

  • Worst case scenario: make the new nodes an independent cluster and then dump the indexes from the old nodes and load on the new nodes. There is not a ton of data in the cluster in practice.

This is probably our best approach.

  • Setup a service name and lb to front the cluster of 6 nodes (or just the Buster 3 node cluster if that's what ends up working)
  • Contact the maintainers of tools with credentials to write to the cluster (see local patches on the Toolforge puppetmaster for that) and tell them to switch to the service name instead of round-robin to the jessie cluster nodes
  • Shutdown the jessie nodes
  • Possibly setup DNS using the old Jessie node names as CNAMEs for the new service name if there were tool maintainers I could not reach
  • Profit!!!

I'm the maintainer of 2 of the tools that write to the cluster (stashbot & bash), so I would be glad to be the first tester of the service name gateway.

Sounds like a good plan.

  • Worst case scenario: make the new nodes an independent cluster and then dump the indexes from the old nodes and load on the new nodes. There is not a ton of data in the cluster in practice.

This is probably our best approach.

https://www.elastic.co/guide/en/elasticsearch/reference/7.4/reindex-upgrade-remote.html may or may not be useful in transferring indices.

Splitting to a new cluster and setting up the lb/service name there actually may be useful for client migration as well. Jumping from 5.5.2 to 7.4.2 may also require client code upgrades leading to a need for a testing cycle on the tools themselves as well.

Change 574063 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] haproxy: update systemd service for buster

https://gerrit.wikimedia.org/r/574063

Change 574063 merged by Jhedden:
[operations/puppet@production] haproxy: update systemd service for buster

https://gerrit.wikimedia.org/r/574063

Change 574527 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] toolforge: upgrade elasticsearch and add debian buster support

https://gerrit.wikimedia.org/r/574527

Patch up for review, if it looks good I'll create a test cluster in toolsbeta for further testing.

@bd808 I kept the configuration close to our existing install, I'm curious if we should we setup HTTPS for these endpoints?

@bd808 I kept the configuration close to our existing install, I'm curious if we should we setup HTTPS for these endpoints?

Adding TLS would be fine if it is easy. The traffic (both east-west in the elastic cluster and north-south to the using code) is all internal to the tools project so if it is not trivially easy to setup TLS and ensure that it is using a server cert that is client code friendly (LE rather than Puppet certs or something where the signing CA is untrusted by default) then I think we can leave that out of the upgrade/migration.

Change 574527 merged by Jhedden:
[operations/puppet@production] toolforge: upgrade elasticsearch and add debian buster support

https://gerrit.wikimedia.org/r/574527

Mentioned in SAL (#wikimedia-cloud) [2020-02-27T20:19:55Z] <jeh> update elasticsearch VPS security group to allow toolsbeta-elastic7-1 access on tcp 80 T236606

Mentioned in SAL (#wikimedia-cloud) [2020-02-27T21:03:13Z] <jeh> add reindex service account to elasticsearch for data migration T236606

I've setup a new elasticsearch v7 test cluster in toolsbeta and opened up a connection between the existing tools elasticsearch cluster. Using this I was able to verify the data migration process works between v5 and v7 by reindexing from a remote source. [0]

Steps to configure and reindex from the existing cluster to a new one.

Temporarily add the remote cluster to the remote reindex whitelist

prepare new cluster
$ sudo vi /etc/elasticsearch/labs-tools/elasticsearch.yml
    # add - index.remote.whitelist: tools-elastic-01.tools.eqiad.wmflabs:[80](80)
$ sudo systemctl restart elasticsearch_7@labs-tools.service

Reindex <index_name>

on new cluster
$ curl -HContent-Type:application/json -XPOST 127.0.0.1:9200/_reindex?pretty -d'{
  "source": {
    "remote": {
      "host": "http://tools-elastic-01.tools.eqiad.wmflabs:80",
      "username": "reindex",
      "password": "MASKED"
    },
    "index": "<index_name>"
  },
  "dest": {
    "index": "<index_name>"
  }
}'

Once the index has been created, change the number of replicas to 2

$ curl -HContent-Type:application/json -XPUT 'localhost:9200/<index_name>/_settings' -d '{
    "index.number_of_replicas" : 2
}'

Confirm the document count between the old and new index

old index
$ curl 'http://tools-elastic-01.tools.eqiad.wmflabs/_cat/indices/<index_name>?v&s=index'
health status index         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   <index_name>  zszLDjngSTKMfWBo0vyQgQ   1   2          3            4     32.7kb         10.9kb
new index
$ curl -XGET "http://localhost:9200/_cat/indices/<index_name>?v&s=index"
health status index         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   <index_name>  SLREKcaKSv65LIB_I809oA   1   2          3            0     26.6kb          8.8kb

[0] https://www.elastic.co/guide/en/elasticsearch/reference/current/reindex-upgrade-remote.html

Mentioned in SAL (#wikimedia-cloud) [2020-02-28T15:09:37Z] <jeh> create 3 new elasticsearch VMs tools-elastic-[1,2,3] T236606

Mentioned in SAL (#wikimedia-cloud) [2020-02-28T15:28:48Z] <jeh> create OpenStack server group tools-elastic with anti-affinty policy enabled T236606

Mentioned in SAL (#wikimedia-cloud) [2020-03-02T22:26:36Z] <jeh> starting first pass of elasticsearch data migration to new cluster T236606

The data migration process was able to successfully copy over 84 indexes, there were only 2 indexes that have problems.

health status index                                    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   s53402__wmde-uca-test_content_1502297469 neldQFYoQOmEO5O4hkOaDg   4   2        358           10     60.7mb         20.2mb
green  open   s53402__wmde-uca-test_general_1502297503 cW7QuoLkT9SJYbVCDCTiDw   4   2        110            0     14.6mb          4.8mb

These indexes failed to copy due to boolean typing errors "Failed to parse value [<value>] as only [true] or [false] are allowed."

Change 576395 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] keepalived: add initial module and toolforge profile

https://gerrit.wikimedia.org/r/576395

Mentioned in SAL (#wikimedia-cloud) [2020-03-03T17:31:14Z] <jeh> create a OpenStack virtual ip address for the new elasticsearch cluster T236606

Mentioned in SAL (#wikimedia-cloud) [2020-03-03T18:02:49Z] <jeh> create OpenStack DNS record for elasticsearch.svc.tools.eqiad.wikimedia.cloud T236606

Mentioned in SAL (#wikimedia-cloud) [2020-03-03T18:16:30Z] <jeh> create OpenStack DNS record for elasticsearch.svc.tools.eqiad1.wikimedia.cloud (eqiad1 subdomain change) T236606

Change 576395 merged by Jhedden:
[operations/puppet@production] keepalived: add initial module and toolforge profile

https://gerrit.wikimedia.org/r/576395

The new cluster is online and ready for early user testing.

All existing Elasticsearch user accounts and data has migrated over and the new service address is running in active/standby across the 3 Elasticsearch nodes.

$ curl http://elasticsearch.svc.tools.eqiad1.wikimedia.cloud/_cat/health?v
epoch      timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1583271911 21:45:11  tools   green           3         3    255  85    0    0        0             0                  -                100.0%

Keepavlied failover testing logs

Mar 03 21:44:46 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Entering BACKUP STATE (init)
Mar 03 21:44:49 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Entering MASTER STATE
Mar 03 21:44:49 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Master received advert from 172.16.1.102 with higher priority 130, ours 110
Mar 03 21:44:49 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Entering BACKUP STATE
Mar 03 21:46:14 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Backup received priority 0 advertisement
Mar 03 21:46:14 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Entering MASTER STATE
Mar 03 21:46:37 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Master received advert from 172.16.1.102 with higher priority 130, ours 110
Mar 03 21:46:37 tools-elastic-2 Keepalived_vrrp[12176]: (VRRP1) Entering BACKUP STATE

Next I'll be working on documentation and a plan to cut-over production.

Change 577352 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] toolforge: increase elasticsearch timeout on haproxy

https://gerrit.wikimedia.org/r/577352

Change 577352 merged by Jhedden:
[operations/puppet@production] toolforge: increase elasticsearch timeout on haproxy

https://gerrit.wikimedia.org/r/577352

Mentioned in SAL (#wikimedia-cloud) [2020-04-20T13:28:11Z] <jeh> shutdown elasticsearch v5 cluster running Jessie T236606

Mentioned in SAL (#wikimedia-cloud) [2020-05-04T22:08:23Z] <bstorm_> deleting tools-elastic-01/2/3 T236606

Bstorm updated the task description. (Show Details)

Change 598082 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: remove elasticsearch5 role and profile manifests

https://gerrit.wikimedia.org/r/598082

Change 598082 merged by Andrew Bogott:
[operations/puppet@production] toolforge: remove elasticsearch5 role and profile manifests

https://gerrit.wikimedia.org/r/598082