Fix slow super_detect_noop code and monitor for future Elastic hangs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	Jul 28 2022, 6:22 PM

Description

We had a significant jump in search thread pool counter rejections today.

We tracked it down to super_detect_noop code. We're already capturing "increase(elasticsearch_indices_indexing_index_total{exported_cluster="production-search-eqiad", index=""}[30m]) < 500" in prometheus , we could potentially use this to send alerts.

Creating this ticket to discuss this and other improvements to detect/alert on ES deadlocks, add proper docs, etc.

Details

	Subject	Repo	Branch	Lines +/-
	elastic: alert on per-node indexing not occurring	operations/alerts	master	+32 -0
	6.8.23-wmf2 search-extra for bullseye	operations/software/elasticsearch/plugins	master	+8 -2

Customize query in gerrit

Related Objects

Mentioned In: T314473: Ingest new image suggestions index diffs
T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye
Mentioned Here: T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye

Event Timeline

bking created this task.Jul 28 2022, 6:22 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 28 2022, 6:22 PM

bking updated the task description. (Show Details)Jul 28 2022, 6:23 PM

Gehel updated the task description. (Show Details)Jul 28 2022, 6:31 PM

Order of tracking down the deadlock was:

cirrus error rate increased
One node had it's write queue building in https://search.svc.eqiad.wmnet:9243/_cat/thread_pool/write?v
That node's index_total value in https://search.svc.eqiad.wmnet:9243/_nodes/elastic1071-production-search-eqiad/stats/indices/indexing?pretty wasn't increasing
pulled jstack for write threads with: sudo nsenter -t 68132 -m sudo -u elasticsearch jstack 68132 | grep -A 7 '\[write]'
all threads reported by jstack showed:

"elasticsearch[elastic1072-production-search-eqiad][write][T#1]" #420 daemon prio=5 os_prio=0 tid=0x00007fa7c401a800 nid=0x110f8 runnable [0x00007f99dd459000]
   java.lang.Thread.State: RUNNABLE
        at java.util.ArrayList.indexOf(ArrayList.java:323)
        at java.util.ArrayList.contains(ArrayList.java:306)
        at java.util.AbstractCollection.containsAll(AbstractCollection.java:318)
        at org.wikimedia.search.extra.superdetectnoop.MultiListHandler$MultiList.equalsIgnoreOrder(MultiListHandler.java:98)
        at org.wikimedia.search.extra.superdetectnoop.MultiListHandler$MultiList.replaceFrom(MultiListHandler.java:76)
        at org.wikimedia.search.extra.superdetectnoop.MultiListHandler.handle(MultiListHandler.java:31)

This code includes the comment "Expecting n <= 10, a fancier comparison doesn't seem justified

bking updated the task description. (Show Details)Jul 28 2022, 6:46 PM

bking updated the task description. (Show Details)Jul 28 2022, 6:49 PM

bking updated the task description. (Show Details)Jul 28 2022, 6:52 PM

Added new dashboard here: https://grafana-rw.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&forceLogin&from=now-7d&to=now&viewPanel=57

Change 818214 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/alerts@master] elastic: alert on per-node indexing not occurring

https://gerrit.wikimedia.org/r/818214

gerritbot added a project: Patch-For-Review.Jul 28 2022, 7:47 PM

RKemper changed the task status from Open to In Progress.Jul 28 2022, 7:57 PM

RKemper claimed this task.

RKemper triaged this task as High priority.

In T314078#8113023, @EBernhardson wrote:
Order of tracking down the deadlock was:

cirrus error rate increased

One node had it's write queue building in https://search.svc.eqiad.wmnet:9243/_cat/thread_pool/write?v

That node's index_total value in https://search.svc.eqiad.wmnet:9243/_nodes/elastic1071-production-search-eqiad/stats/indices/indexing?pretty wasn't increasing

pulled jstack for write threads with: sudo nsenter -t 68132 -m sudo -u elasticsearch jstack 68132 | grep -A 7 '\[write]'

all threads reported by jstack showed:
"elasticsearch[elastic1072-production-search-eqiad][write][T#1]" #420 daemon prio=5 os_prio=0 tid=0x00007fa7c401a800 nid=0x110f8 runnable [0x00007f99dd459000]
   java.lang.Thread.State: RUNNABLE
        at java.util.ArrayList.indexOf(ArrayList.java:323)
        at java.util.ArrayList.contains(ArrayList.java:306)
        at java.util.AbstractCollection.containsAll(AbstractCollection.java:318)
        at org.wikimedia.search.extra.superdetectnoop.MultiListHandler$MultiList.equalsIgnoreOrder(MultiListHandler.java:98)
        at org.wikimedia.search.extra.superdetectnoop.MultiListHandler$MultiList.replaceFrom(MultiListHandler.java:76)
        at org.wikimedia.search.extra.superdetectnoop.MultiListHandler.handle(MultiListHandler.java:31)
This code includes the comment "Expecting n <= 10, a fancier comparison doesn't seem justified

Here's where the relevant comparison code is (before us patching it): https://github.com/wikimedia/search-extra/blob/d5ccb77c5bbdef181ee4839c343750f95b2ecd5f/extra/src/main/java/org/wikimedia/search/extra/superdetectnoop/MultiListHandler.java#L97-L98

bking updated the task description. (Show Details)Jul 28 2022, 10:26 PM

New elastic plugin is deployed to apt1001, we will update plugin and restart elastic hosts tomorrow.

Change 818507 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/software/elasticsearch/plugins@master] 6.8.23-wmf2 search-extra for bullseye

https://gerrit.wikimedia.org/r/818507

Created wmf-elasticsearch-search-plugins_6.8.23-5 for bullseye and built/uploaded like so:

# Starting from plugins repo
# (1) Build locally and scp over to build host
./debian/rules prepare_build
cd ..
ssh 'build2001.codfw.wmnet' 'sudo rm -rfv ~/plugins'

scp -r plugins/ build2001.codfw.wmnet:~

ssh build2001.codfw.wmnet
cd plugins
DIST=bullseye-wikimedia pdebuild
exit

# (2) Transfer from build host to repo (apt) host, using local machine as intermediary
rm -rfv ~/wmf/elastic-plugins && mkdir -p ~/wmf/elastic-plugins
scp 'build2001.codfw.wmnet:/var/cache/pbuilder/result/bullseye-amd64/wmf-elasticsearch-search-plugins_6.8.23-5*' ~/wmf/elastic-plugins
ssh 'apt1001.wikimedia.org' 'sudo rm -rfv ~/Elastic_Plugins' && scp -r ~/wmf/elastic-plugins 'apt1001.wikimedia.org:~/Elastic_Plugins'

# (3) Upload newly built package
ssh apt1001.wikimedia.org

GNUPGHOME=/root/.gnupg
REPREPRO_BASE_DIR=/srv/wikimedia
export GNUPGHOME
export REPREPRO_BASE_DIR


# replace `main` with `experimental` to deploy to experimental repo instead
# Check https://apt.wikimedia.org/wikimedia/pool/component/ to see if elastic68 is still the latest
# See https://wikitech.wikimedia.org/wiki/Reprepro#Building_an_unmodified_third-party_package_for_import for some general notes on the reprepro process
sudo -E reprepro -C component/elastic68 include bullseye-wikimedia /home/ryankemper/Elastic_Plugins/wmf-elasticsearch-search-plugins_6.8.23-5_amd64.changes && sudo rm -rfv ~/Elastic_Plugins

exit
rm -rfv ~/wmf/elastic-plugins && ssh 'build2001.codfw.wmnet' 'sudo rm -rfv ~/plugins' && echo all done!

MPhamWMF moved this task from Incoming to To Be Deployed on the Discovery-Search (Current work) board.Aug 1 2022, 3:52 PM

Mentioned in SAL (#wikimedia-operations) [2022-08-01T17:18:30Z] <ryankemper> T289135 T314078 Manually reimaging remaining codfw stretch hosts (elastic[2025,2031,2054,2059-2060]) to bullseye, one host at a time, waiting for green cluster status to return between each run. ryankemper@cumin1001 tmux session codfw_reimage

Stashbot mentioned this in T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye.Aug 1 2022, 5:18 PM

Mentioned in SAL (#wikimedia-operations) [2022-08-02T18:46:26Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: deploy wmf-elasticsearch-search-plugins pkg - bking@cumin1001 - T314078

Mentioned in SAL (#wikimedia-operations) [2022-08-02T21:27:34Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: deploy wmf-elasticsearch-search-plugins pkg - bking@cumin1001 - T314078

dcausse mentioned this in T314473: Ingest new image suggestions index diffs.Aug 3 2022, 1:02 PM

Mentioned in SAL (#wikimedia-operations) [2022-08-03T19:38:52Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster plugin upgrade - ryankemper@cumin1001 - T314078

Mentioned in SAL (#wikimedia-operations) [2022-08-03T19:39:41Z] <ryankemper> T314078 Rolling upgrade of codfw hosts; after this all of eqiad/codfw will have the new plugin version and we can resume the search-loader instances: sudo -E cookbook sre.elasticsearch.rolling-operation search_codfw "codfw cluster plugin upgrade" --upgrade --nodes-per-run 3 --start-datetime 2022-08-03T19:38:10 --task-id T314078

Mentioned in SAL (#wikimedia-operations) [2022-08-03T19:40:41Z] <ryankemper> T314078 Forgot to mention, restart is at ryankemper@cumin1001 tmux session codfw_restarts

Mentioned in SAL (#wikimedia-operations) [2022-08-03T21:37:40Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster plugin upgrade - ryankemper@cumin1001 - T314078

search-loader instances are both re-enabled now, eqiad has been running since yesterday. They are still processing through the backlog of updates that were generated while they were paused including the weekly update. Eqiad should hopefully finish it's backlog by tomorrow, codfw might take two days.

Gehel moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Aug 22 2022, 3:17 PM

Gehel closed this task as Resolved.Aug 29 2022, 2:37 PM

Change 818507 abandoned by Ryan Kemper:

[operations/software/elasticsearch/plugins@master] 6.8.23-wmf2 search-extra for bullseye

Reason:

not needed anymore

https://gerrit.wikimedia.org/r/818507

Change 818214 merged by jenkins-bot:

[operations/alerts@master] elastic: alert on per-node indexing not occurring

https://gerrit.wikimedia.org/r/818214

Maintenance_bot removed a project: Patch-For-Review.Dec 6 2022, 9:30 PM