We have a variety of processes that are manually performed, such as recovering after an outage. These are documented on wikitech. Review the existing processes and write up new processes for the streaming/k8s world.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | EBernhardson | T356303 Review wikitech:Search and write processes for k8s world | |||
Resolved | Gehel | T356803 Develop recovery/reindex procedures for new Search Update Pipeline | |||
Resolved | bking | T356806 Document review/refresh for https://wikitech.wikimedia.org/wiki/Search | |||
Declined | bking | T364936 Move inline code snippets from https://wikitech.wikimedia.org/wiki/Search to a repo |
Event Timeline
First pass review of the administration processes listed on wikitech and which will be changing. This started as only about the streaming updater, but added a second section out outdated topics. perhaps another ticket?
SUP
- Monitoring the job queue - should mention the jobqueue dashboard and streaming updater dashboards
- Indexing - This section gives a description of what the application does. We should update it to describe how data flows through the new system
- Recovering from an Elasticsearch outage/interruption in updates - We need to design, document, and test a process for this
- In place reindex - The catchup indexing after a reindex needs the same attention as above.
- Full reindex - Again, a process needs to be worked out. This can't entirely be done from the new updater, will have to work out how this interacts with the read-only clusters in cirrus config.
- Multi-DC Operations should perhaps talk about streaming updater multi-dc operations. Also this section is really out of date.
- No updates section should talk about where to look to verify streaming updater operation
Other
- Restarting a node is outdated, talks about using es-tool. Should be updated to reflect cookbooks?
- Pool Counter rejections could probably point to somewhere on the
- Docs mention scripts/check_indices.py from CirrusSearch, this works off the list of writable clusters and thus doesn't work in the new system. Likely the script needs updating.
Change 1003502 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[operations/deployment-charts@master] cirrus updater: Introduce backfill releases
Change 1003502 merged by jenkins-bot:
[operations/deployment-charts@master] cirrus updater: Introduce backfill releases
I've been reviewing our options for backfilling and trying to come up with a plan, i think the following will work:
- We define matching -backfill releases in helm for the consumers. So consumer-search will also have a consumer-search-backfill release
- Backfill releases will read the same values files as the normal release, plus an extra values-backfill.yaml which supplies the default values for kafka-source-{start,end}-time.
- Default values should be set far in the past, so when the backfill release starts up it immediately completes. Tested in staging and this works reasonably. The taskmanager is shut down and the job manager idles on and reports the job as finished via REST api.
- In the current reindexing procedure replace the ForceSearchIndex.php invocations with:
- A helmfile invocation that submits custom start/end/wiki filters for the related -backfill release
- Something that waits arround polling the -backfill REST api until /v1/jobs reports the job as finished.
- This will necessitate running the reindexing process from the deployment host (and not mwmaint as we have historically done). Reindexing doesn't really use any local compute, it's all requests to other services and waiting. Should be reasonable.
Caveats:
- The integration from top to bottom is a bit weak. The helmfile invocation has to be just right, but it can be done. We can kinda-sorta check that the job isn't doing anything before we submit the custom values, but there is no control against race conditions. We are probably mostly ok though since reindexing is run by a person and not kicked off in an automated manner. The full-cluster reindexing process takes most of a week, this gives ample opportunity to accidentally run a reindex on a cluster while one is already running.
Example helmfile invocation:
helmfile -e staging \ -i apply \ --selector name=consumer-search-backfill \ --set 'app.config_files.app\.config\.yaml.kafka-source-start-time=2024-02-11T01:23:45Z' \ --set 'app.config_files.app\.config\.yaml.kafka-source-end-time=2024-02-11T03:45:12Z' \ --set 'app.config_files.app\.config\.yaml.wikis=testwiki'
Example query of the rest api (could be nicer if we installed curl or wget, or exposed the rest api directly):
KUBECONFIG=/etc/kubernetes/cirrus-streaming-updater-deploy-staging.config kubectl \ exec \ flink-app-consumer-search-backfill-5b9f979487-dsqsb \ -c flink-main-container \ -- \ python3 -c 'import urllib.request; print(urllib.request.urlopen("http://localhost:8081/v1/jobs").read().decode("utf8"))'
On further review, simply documenting the various commands to run seemed error prone. Attached patch adds a python script that simplifies away most of the reindexing and backfill to ease future burden.
Change 1005635 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[operations/deployment-charts@master] cirrus: Add script to orchestrate reindexing
Example query of the rest api (could be nicer if we installed curl or wget, or exposed the rest api directly):
KUBECONFIG=/etc/kubernetes/cirrus-streaming-updater-deploy-staging.config kubectl \> exec \ flink-app-consumer-search-backfill-5b9f979487-dsqsb \ -c flink-main-container \ -- \ python3 -c 'import urllib.request; print(urllib.request.urlopen("http://localhost:8081/v1/jobs").read().decode("utf8"))'
We found that this information is also updated into the flinkdeployment resource. It can be queried more simply with:
kubectl get -o json --selector release=consumer-cloudelastic-backfill flinkdeployment | jq .items[0].status.jobStatus.state
To review the documentation changes (there are also two revisions from bking mixed in there): https://wikitech.wikimedia.org/w/index.php?title=Search&diff=2153071&oldid=2127290
Monitoring the job queue - should mention the jobqueue dashboard and streaming updater dashboards
Updated this section to link to the JobQueue Job dashboard, and to a list of configured cirrussearch jobs. Added a new section on monitoring the streaming updater that links to the relevent grafana dashboards and a logstash dashboard that has all the application logging.
Indexing - This section gives a description of what the application does. We should update it to describe how data flows through the new system
Wrote two sections here, one on streaming updater and a shorter one on streaming updater backfilling. These sections give a high level overview of what they are and how they work, with hopefully enough information to enable a curious reader to find relevant implementations.
Added a note at the top of sections for streaming and classic updater noting that both exist and are in transition.
Recovering from an Elasticsearch outage/interruption in updates - We need to design, document, and test a process for this
In place reindex - The catchup indexing after a reindex needs the same attention as above.
Full reindex - Again, a process needs to be worked out. This can't entirely be done from the new updater, will have to work out how this interacts with the read-only clusters in cirrus config.
This is in-progress, see patch above. Documentation related to backfilling and reindexing with the script introduced by the patch has been added, but is not yet fully complete. It's a bit awkward to try and have the old and new documentation here without a clear process around how to delineate between them.
Multi-DC Operations should perhaps talk about streaming updater multi-dc operations. Also this section is really out of date.
Rewrote the whole section. It's not clear what point we were trying to get across in this section. I've used it to give a high level description of how multi-dc / multi-cluster is implemented.
No updates section should talk about where to look to verify streaming updater operation
Added mention of streaming updater. Added a few notes for where to look to see if various bits are acting up.
Yesterday on IRC the question was raised:
this is probably the wrong way around, but i have a python script that uses helmfile apply --set ... to deploy a special backfilling release that is not part of the normal release process. This release runs to completion, but the related custom operator (flink) only understands things that run forever, so my python script also does a helm destroy to clean up afterwards.
I guess my question is, is there a reasonable way to ensure i'm deleting the thing i think i'm deleting? I was considering perhaps adjusting the chart so i can provide a backfill_id label with --set and then use that id in a selector when destroy'ing
From what I understood (and please correct my if I'm wrong! :)) the process is as follows:
- You deploy a separate helmfile release "...-backfill" that creates a separate FlinkDeployment which launches a job that runs to completion (may take a long time, though)
- The jobmanager Pod than keeps lingering around (blocking resources, 500m CPU, 100Mi Memory) because the flink-operator configures SHUTDOWN_ON_APPLICATION_FINISH=false in any case to for internal reasons
- You destroy the helmfile release to clean up the jobmanager (by removing the FlinkDeployment object)
One question that comes to mind immediately, and I might be completely off here: Isn't this what a Flink session cluster is for? Having just one Jobmanager that controls multiple Jobs (e.g. the generic one plus backfill) that can be submitted at runtime?
Anyways. I'm not 100% sure what you meant by is there a reasonable way to ensure i'm deleting the thing i think i'm deleting?. In general a release name is unique per namespace. So if you created consumer-search-backfill and you destroy consumer-search-backfill afterwards, that will be exactly the thing.
If you require support for running multiple backfill jobs in parallel, you will have to use a different release name for each of them. @RLazarus invented something similar for mwscript_k8s in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/988850/7/helmfile.d/services/mw-script/helmfile.yaml#47 (basically reading the release name from an environment variable). With that your release name could include the backfill_id you mentioned.
Flink session cluster can do that, although i suppose i was trying to keep all the flink things managed in the same way. I thought it would be operationally more complex if we are deploying flink in two different ways.
Anyways. I'm not 100% sure what you meant by is there a reasonable way to ensure i'm deleting the thing i think i'm deleting?. In general a release name is unique per namespace. So if you created consumer-search-backfill and you destroy consumer-search-backfill afterwards, that will be exactly the thing.
If you require support for running multiple backfill jobs in parallel, you will have to use a different release name for each of them. @RLazarus invented something similar for mwscript_k8s in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/988850/7/helmfile.d/services/mw-script/helmfile.yaml#47 (basically reading the release name from an environment variable). With that your release name could include the backfill_id you mentioned.
I suppose I'm mixing up a few requirements here, what led to that thought was:
- When the script starts up, the -backfill release has a current state. It should generally be unreleased, but it could be in a running, finished, or failed state. If the backfill script starts up and the -backfill release is not in an unreleased state, can it proceed and replace the existing release?
- It could try and progress forward if the state is finished or failed. In both those cases the release is in a completed state and will not do anything further.
- If the release is in a finished or failed state, something was supposed to clean that up. If we progress forward and replace the release there is a possibility that whatever was going to cleanup the old release is now going to clean up the new one.
- For now i print some instructions on how to manually monitor and cleanup the -backfill release if it's already deployed, and then bail.
Related unstated constraints:
- One use case for this process is to run in a for loop over 1000 wikis and most of a week. Having it print instructions on manual intervention on wiki 314 is not particularly helpful. In this case actually all wikis would fail with the same error, so wiki 314 and every wiki after it prints the same instructions. It is manageable at least. Historically we write the output of all of this to one log per wiki, grep through them for error messages, and repeat the process on the smaller list of wikis that errored the previous time. There will be other errors anyways.
- backfilling can (depending on selected wikis / time range) generate a fairly significant load in both the mediawiki application servers and the elasticsearch servers. We generally like the constraint provided by having a single -backfill release per cluster that only a single backfill can be running at a time per search cluster. I initially discarded the idea of per-wiki or otherwise one-off release names to maintain the invariant that we won't accidently schedule many backfills in parallel. There are probably other ways to provide this though?
- The backfill operation itself can backfill many (or all) wikis in parallel, the unique part of these backfills is really the start/end timestamps.
Now of course, the apply/destroy interaction between two scripts is very edge case-y, but i couldn't help pondering it while thinking about where this script might run into errors. Can probably not handle it and everything will be fine.
The idea about using a session cluster is interesting. I don't think we want to use a session cluster for the normal job, but perhaps backfilling can run in a session cluster which has a clearer cleanup. I had previously not seriously considered the session cluster as I wasn't too interested in having the applications run in different ways. In my experience running things in the same ways helps them have the same errors, instead of unique problems per method. I'm not super worried about cleanup, we have to wait for the backfill to finish either way. Invoking helmfile destroy isn't too involved. But I suspect this will be a source of tedium in the future. It would be nice to have a better solution and it might be the session cluster.
Change 1005635 merged by jenkins-bot:
[operations/deployment-charts@master] cirrus: Add script to orchestrate reindexing
I'm not very familiar with running Flink in general, so I really can't speak to that, "we want to run N related things" just sounded like what the session clusters idea is to me.
I'm not exactly sure now if there are any open questions regarding the script that started the conversation. :)