Page MenuHomePhabricator

Migrate toolhub indices from production OpenSearch to OpenSearch on k8s
Open, In Progress, MediumPublic

Description

Per parent ticket, we need to migrate the following indices from the production CirrusSearch eqiad cluster to a net-new OpenSearch on K8s cluster:

curl -s https://search.svc.eqiad.wmnet:9243/_cat/indices | grep toolhub
green open toolhub_tools                              LClpNguQSpmGof0eIVsk8A  1 2      4260       32   29.3mb    9.7mb
green open toolhub_lists                              6rSaAMvoQ_anrDVv2esAfA  1 2        76        5    3.7mb    1.2mb

Event Timeline

bking changed the task status from Open to In Progress.Tue, May 12, 6:56 PM
bking triaged this task as Medium priority.

I talked with @bking about this today and it sounds like our general plan will be something like:

  • Data Platform sets up new containerized opensearch cluster for Toolhub use
  • Data Platform sets up an envoy proxy that can be used to reach the new cluster from the eqiad and codfw WikiKube clusters
  • @bd808 uses his hacky "maintenance environment" (container running on local laptop with ssh tunnels to reach backend services) to populate and test indexes in the the new cluster
  • @bd808 updates the helmfile settings to point the Wikikube deployment away from search-chi-eqiad and towards the new cluster
  • @bd808 deploys the config changes & rebuilds the indexes to match the latest canonical data
  • Profit!

^^ Basically what he said 🙃

  • Data Platform sets up new containerized opensearch cluster for Toolhub use Done, see this page for how to access
  • Data Platform sets up an envoy proxy that can be used to reach the new cluster from the eqiad and codfw WikiKube clusters. DPE SRE will need a series of patches similar to the ones associated with T421293.

@bd808 uses his hacky "maintenance environment" (container running on local laptop with ssh tunnels to reach backend services) to populate and test indexes in the the new cluster
@bd808 updates the helmfile settings to point the Wikikube deployment away from search-chi-eqiad and towards the new cluster
@bd808 deploys the config changes & rebuilds the indexes to match the latest canonical data
Profit!

Change #1287388 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/puppet@production] services_proxy: isetting up toolhub and ttmserver

https://gerrit.wikimedia.org/r/1287388

Change #1287388 merged by Atsuko:

[operations/puppet@production] services_proxy: isetting up toolhub and ttmserver

https://gerrit.wikimedia.org/r/1287388

Change #1287428 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/puppet@production] services_proxy: enabling toolhub and ttmserver

https://gerrit.wikimedia.org/r/1287428

Change #1287428 merged by Atsuko:

[operations/puppet@production] services_proxy: enabling toolhub and ttmserver

https://gerrit.wikimedia.org/r/1287428

atsuko subscribed.

@bd808 hi! the instance is available at
https://opensearch-toolhub-test.svc.eqiad.wmnet:30443/
https://opensearch-toolhub-test.svc.codfw.wmnet:30443/

and globally at
https://opensearch-toolhub-test.discovery.wmnet:30443/
http://localhost:6047

it will need authentication, see /etc/helmfile-defaults/private/dse-k8s_services/opensearch-toolhub-test/dse-k8s-eqiad.yaml

it will need authentication, see /etc/helmfile-defaults/private/dse-k8s_services/opensearch-toolhub-test/dse-k8s-eqiad.yaml

I will need to extend the application itself and its helm chart in order to implement authentication for the opensearch connections. Is this new requirement for the migration negotiable?

@bd808 I believe we are forced by the OpenSearch operator to use basic and/or mutual TLS auth. I'll check again and have an answer for you by this time next week.

It seems like it is possible to completely disable security plugin, this will disable the password requirement, as well as the double TLS. However, I don't think opensearch-operator support bootstrapping such clusters, here's options I was experimenting with.

I think we can do something around anonymous user permissions as well.

cc @bking

Change #1287846 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/deployment-charts@master] draft: disabling security config for opensearch-toolhub-test

https://gerrit.wikimedia.org/r/1287846

Change #1287920 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/deployment-charts@master] opensearch-cluster: full permission for anonymous users

https://gerrit.wikimedia.org/r/1287920

Change #1287846 abandoned by Atsuko:

[operations/deployment-charts@master] draft: disabling security config for opensearch-toolhub-test

Reason:

We are not ready to take this approach

https://gerrit.wikimedia.org/r/1287846

@bd808 i'll roll out the new version without a requirement for authentication on Monday morning

Change #1287920 merged by jenkins-bot:

[operations/deployment-charts@master] opensearch-cluster: full permission for anonymous users

https://gerrit.wikimedia.org/r/1287920

Hi, indices access should now work without HTTP auth now. If the test cluster is working, I'll provision the prod cluster as well.

Hi, indices access should now work without HTTP auth now. If the test cluster is working, I'll provision the prod cluster as well.

bd808@deploy1003:~$ curl -s 'https://opensearch-toolhub-test.discovery.wmnet:30443/_cat/indices?v&h=health,status,index,pri,rep,docs.count,pri.store.size,store.size&s=index' | grep toolhub
green  open   toolhub_lists                  1   2         76            1mb      3.2mb
green  open   toolhub_tools                  1   2       4283         11.7mb     35.2mb

For the benefit of the future, I built the indices using the script from T290357#7578866 with a few changes:

  • connect search index ssh tunnel to localhost:6047 on the WMF network side (I had TLS problems trying to use opensearch-toolhub-test.discovery.wmnet:30443 that I assume were related to SNI)
  • use ES_HOSTS=http://host.docker.internal:19200 (no TLS)

Then I did poetry run ./manage.py search_index --rebuild inside the connected container. This was as noted in other tasks about using the maintenance rig from my local laptop sloooooow. That's all about speed of light and data round trips though. It shouldn't be a problem if I initialize the "prod" cluster before switching the public deploy over.

Back to you for the second cluster setup @atsuko. :)

Change #1295901 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/puppet@production] service: services_proxy: prod opensearch-on-k8s services

https://gerrit.wikimedia.org/r/1295901

Change #1295901 merged by Atsuko:

[operations/puppet@production] service: services_proxy: prod opensearch-on-k8s services

https://gerrit.wikimedia.org/r/1295901

Change #1296539 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/deployment-charts@master] opensearch-cluster: anonymous access for ttmsearch and toolhub

https://gerrit.wikimedia.org/r/1296539

Change #1296539 merged by jenkins-bot:

[operations/deployment-charts@master] opensearch-cluster: anonymous access for ttmsearch and toolhub

https://gerrit.wikimedia.org/r/1296539

Change #1296608 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/puppet@production] services_proxy: switch to prod opensearch-on-k8s services

https://gerrit.wikimedia.org/r/1296608

Configured prod instance, per-DC URLs are
https://opensearch-toolhub.svc.eqiad.wmnet:30443/
https://opensearch-toolhub.svc.codfw.wmnet:30443/

and global are
https://opensearch-toolhub.discovery.wmnet:30443/
http://localhost:6519

Updated test instance services_proxy URL, too
http://localhost:6517

I will now proceed with populating production DB.

Change #1296608 merged by Atsuko:

[operations/puppet@production] services_proxy: switch to prod opensearch-on-k8s services

https://gerrit.wikimedia.org/r/1296608

@bd808 I noticed that the opensearch-toolhub-test.svc.codfw.wmnet wasn't populated, I've done it myself using your script.

I found it easiest to connect thru services_proxy endpoint to the desired part of the cluster (eqiad or codfw) via local deployment server (deploy1003.eqiad.wmnet or deploy2002.codfw.wmnet correspondingly).

I noticed that the current setup uses different ports for eqiad and codfw, do you want me to mimic this, or is it better to use active-active dns endpoint?

@bd808 I noticed that the opensearch-toolhub-test.svc.codfw.wmnet wasn't populated, I've done it myself using your script.

Does that mean that the data got lost between my T426073#11961684 work and whenever you did it again?

@bd808 I noticed that the opensearch-toolhub-test.svc.codfw.wmnet wasn't populated, I've done it myself using your script.

Does that mean that the data got lost between my T426073#11961684 work and whenever you did it again?

I think if you did the single upload to the localhost:6047 forwarded from deploy1003.eqiad.wmnet, you've only hit the opensearch-toolhub-test.svc.eqiad.wmnet cluster. opensearch-toolhub-test.svc.eqiad.wmnet and opensearch-toolhub-test.svc.codfw.wmnet are completely separate, and as far as I understand, the Translate extension is connecting to both of them for writing, and to the dnsdisc for reading.

So no, it wasn't a data loss, but the data was populated only on the single cluster.

So no, it wasn't a data loss, but the data was populated only on the single cluster.

Ack. My brain ignored the codfw vs eqiad difference. Bad brain.