Page MenuHomePhabricator

Fix slow super_detect_noop code and monitor for future Elastic hangs
Closed, ResolvedPublic

Description

We had a significant jump in search thread pool counter rejections today.

We tracked it down to super_detect_noop code. We're already capturing "increase(elasticsearch_indices_indexing_index_total{exported_cluster="production-search-eqiad", index=""}[30m]) < 500" in prometheus , we could potentially use this to send alerts.

Creating this ticket to discuss this and other improvements to detect/alert on ES deadlocks, add proper docs, etc.

Event Timeline

Order of tracking down the deadlock was:

  1. cirrus error rate increased
  2. One node had it's write queue building in https://search.svc.eqiad.wmnet:9243/_cat/thread_pool/write?v
  3. That node's index_total value in https://search.svc.eqiad.wmnet:9243/_nodes/elastic1071-production-search-eqiad/stats/indices/indexing?pretty wasn't increasing
  4. pulled jstack for write threads with: sudo nsenter -t 68132 -m sudo -u elasticsearch jstack 68132 | grep -A 7 '\[write]'
  5. all threads reported by jstack showed:
"elasticsearch[elastic1072-production-search-eqiad][write][T#1]" #420 daemon prio=5 os_prio=0 tid=0x00007fa7c401a800 nid=0x110f8 runnable [0x00007f99dd459000]
   java.lang.Thread.State: RUNNABLE
        at java.util.ArrayList.indexOf(ArrayList.java:323)
        at java.util.ArrayList.contains(ArrayList.java:306)
        at java.util.AbstractCollection.containsAll(AbstractCollection.java:318)
        at org.wikimedia.search.extra.superdetectnoop.MultiListHandler$MultiList.equalsIgnoreOrder(MultiListHandler.java:98)
        at org.wikimedia.search.extra.superdetectnoop.MultiListHandler$MultiList.replaceFrom(MultiListHandler.java:76)
        at org.wikimedia.search.extra.superdetectnoop.MultiListHandler.handle(MultiListHandler.java:31)
  1. This code includes the comment "Expecting n <= 10, a fancier comparison doesn't seem justified

Change 818214 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/alerts@master] elastic: alert on per-node indexing not occurring

https://gerrit.wikimedia.org/r/818214

RKemper changed the task status from Open to In Progress.Jul 28 2022, 7:57 PM
RKemper claimed this task.
RKemper triaged this task as High priority.

Order of tracking down the deadlock was:

  1. cirrus error rate increased
  2. One node had it's write queue building in https://search.svc.eqiad.wmnet:9243/_cat/thread_pool/write?v
  3. That node's index_total value in https://search.svc.eqiad.wmnet:9243/_nodes/elastic1071-production-search-eqiad/stats/indices/indexing?pretty wasn't increasing
  4. pulled jstack for write threads with: sudo nsenter -t 68132 -m sudo -u elasticsearch jstack 68132 | grep -A 7 '\[write]'
  5. all threads reported by jstack showed:
"elasticsearch[elastic1072-production-search-eqiad][write][T#1]" #420 daemon prio=5 os_prio=0 tid=0x00007fa7c401a800 nid=0x110f8 runnable [0x00007f99dd459000]
   java.lang.Thread.State: RUNNABLE
        at java.util.ArrayList.indexOf(ArrayList.java:323)
        at java.util.ArrayList.contains(ArrayList.java:306)
        at java.util.AbstractCollection.containsAll(AbstractCollection.java:318)
        at org.wikimedia.search.extra.superdetectnoop.MultiListHandler$MultiList.equalsIgnoreOrder(MultiListHandler.java:98)
        at org.wikimedia.search.extra.superdetectnoop.MultiListHandler$MultiList.replaceFrom(MultiListHandler.java:76)
        at org.wikimedia.search.extra.superdetectnoop.MultiListHandler.handle(MultiListHandler.java:31)
  1. This code includes the comment "Expecting n <= 10, a fancier comparison doesn't seem justified

Here's where the relevant comparison code is (before us patching it): https://github.com/wikimedia/search-extra/blob/d5ccb77c5bbdef181ee4839c343750f95b2ecd5f/extra/src/main/java/org/wikimedia/search/extra/superdetectnoop/MultiListHandler.java#L97-L98

bking renamed this task from Improve deadlock monitoring for Elastic to Fix slow super_detect_noop code and monitor for future Elastic hangs.Jul 28 2022, 11:24 PM

New elastic plugin is deployed to apt1001, we will update plugin and restart elastic hosts tomorrow.

Change 818507 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/software/elasticsearch/plugins@master] 6.8.23-wmf2 search-extra for bullseye

https://gerrit.wikimedia.org/r/818507

Created wmf-elasticsearch-search-plugins_6.8.23-5 for bullseye and built/uploaded like so:

# Starting from plugins repo
# (1) Build locally and scp over to build host
./debian/rules prepare_build
cd ..
ssh 'build2001.codfw.wmnet' 'sudo rm -rfv ~/plugins'

scp -r plugins/ build2001.codfw.wmnet:~

ssh build2001.codfw.wmnet
cd plugins
DIST=bullseye-wikimedia pdebuild
exit

# (2) Transfer from build host to repo (apt) host, using local machine as intermediary
rm -rfv ~/wmf/elastic-plugins && mkdir -p ~/wmf/elastic-plugins
scp 'build2001.codfw.wmnet:/var/cache/pbuilder/result/bullseye-amd64/wmf-elasticsearch-search-plugins_6.8.23-5*' ~/wmf/elastic-plugins
ssh 'apt1001.wikimedia.org' 'sudo rm -rfv ~/Elastic_Plugins' && scp -r ~/wmf/elastic-plugins 'apt1001.wikimedia.org:~/Elastic_Plugins'

# (3) Upload newly built package
ssh apt1001.wikimedia.org

GNUPGHOME=/root/.gnupg
REPREPRO_BASE_DIR=/srv/wikimedia
export GNUPGHOME
export REPREPRO_BASE_DIR


# replace `main` with `experimental` to deploy to experimental repo instead
# Check https://apt.wikimedia.org/wikimedia/pool/component/ to see if elastic68 is still the latest
# See https://wikitech.wikimedia.org/wiki/Reprepro#Building_an_unmodified_third-party_package_for_import for some general notes on the reprepro process
sudo -E reprepro -C component/elastic68 include bullseye-wikimedia /home/ryankemper/Elastic_Plugins/wmf-elasticsearch-search-plugins_6.8.23-5_amd64.changes && sudo rm -rfv ~/Elastic_Plugins

exit
rm -rfv ~/wmf/elastic-plugins && ssh 'build2001.codfw.wmnet' 'sudo rm -rfv ~/plugins' && echo all done!

Mentioned in SAL (#wikimedia-operations) [2022-08-01T17:18:30Z] <ryankemper> T289135 T314078 Manually reimaging remaining codfw stretch hosts (elastic[2025,2031,2054,2059-2060]) to bullseye, one host at a time, waiting for green cluster status to return between each run. ryankemper@cumin1001 tmux session codfw_reimage

Mentioned in SAL (#wikimedia-operations) [2022-08-02T18:46:26Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: deploy wmf-elasticsearch-search-plugins pkg - bking@cumin1001 - T314078

Mentioned in SAL (#wikimedia-operations) [2022-08-02T21:27:34Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: deploy wmf-elasticsearch-search-plugins pkg - bking@cumin1001 - T314078

Mentioned in SAL (#wikimedia-operations) [2022-08-03T19:38:52Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster plugin upgrade - ryankemper@cumin1001 - T314078

Mentioned in SAL (#wikimedia-operations) [2022-08-03T19:39:41Z] <ryankemper> T314078 Rolling upgrade of codfw hosts; after this all of eqiad/codfw will have the new plugin version and we can resume the search-loader instances: sudo -E cookbook sre.elasticsearch.rolling-operation search_codfw "codfw cluster plugin upgrade" --upgrade --nodes-per-run 3 --start-datetime 2022-08-03T19:38:10 --task-id T314078

Mentioned in SAL (#wikimedia-operations) [2022-08-03T19:40:41Z] <ryankemper> T314078 Forgot to mention, restart is at ryankemper@cumin1001 tmux session codfw_restarts

Mentioned in SAL (#wikimedia-operations) [2022-08-03T21:37:40Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster plugin upgrade - ryankemper@cumin1001 - T314078

search-loader instances are both re-enabled now, eqiad has been running since yesterday. They are still processing through the backlog of updates that were generated while they were paused including the weekly update. Eqiad should hopefully finish it's backlog by tomorrow, codfw might take two days.

Change 818507 abandoned by Ryan Kemper:

[operations/software/elasticsearch/plugins@master] 6.8.23-wmf2 search-extra for bullseye

Reason:

not needed anymore

https://gerrit.wikimedia.org/r/818507

Change 818214 merged by jenkins-bot:

[operations/alerts@master] elastic: alert on per-node indexing not occurring

https://gerrit.wikimedia.org/r/818214