Page MenuHomePhabricator

Deploy new version of Extra Plugin (with Khmer filter) to Elasticsearch cluster
Closed, ResolvedPublic2 Estimated Story Points

Description

Once the new extra plugin with extra-analysis-khmer is available on Maven Central, we need to deploy it to the Elasticsearch cluster and restart the machines with it enabled, so we can reindex the Khmer-language wikis (see subtask).

(This is important but not urgent, so if it makes sense to wait a bit and bundle it with other tasks that require restarting the cluster, that'd be fine.)

  • All elastic* hosts have been rolling upgraded from wmf-elasticsearch-search-plugins/stretch-wikimedia 6.5.4-4~stretch to wmf-elasticsearch-search-plugins/stretch-wikimedia 6.5.4-6~stretch
    • relforge
    • codfw
    • eqiad

Event Timeline

TJones updated the task description. (Show Details)
Gehel set the point value for this task to 2.Feb 15 2021, 4:24 PM
RKemper renamed this task from Deploy new version of Extra Pugin (with Khmer filter) to Elasticsearch cluster to Deploy new version of Extra Plugin (with Khmer filter) to Elasticsearch cluster.Feb 19 2021, 8:52 AM

Mentioned in SAL (#wikimedia-operations) [2021-02-24T23:17:44Z] <ryankemper> T274204 Beginning rolling-upgrade of eqiad CirrusSearch cluster to upgrade to wmf-elasticsearch-search-plugins/stretch-wikimedia 6.5.4-5~stretch, see tmux session elastic_rolling_upgrade on ryankemper@cumin1001

Mentioned in SAL (#wikimedia-operations) [2021-02-24T23:18:40Z] <ryankemper> T274204 sudo -i cookbook sre.elasticsearch.rolling-upgrade search_eqiad "eqiad cluster restarts" --task-id T274204 --nodes-per-run 3

Mentioned in SAL (#wikimedia-operations) [2021-02-25T00:05:35Z] <ryankemper> T274204 Ctrl+C'd out of the current rolling-upgrade; the 3 hosts that have their elasticsearch systemd units in a failing state are running the latest plugin version, meaning the new version is likely the cause of the failures

Mentioned in SAL (#wikimedia-operations) [2021-02-25T00:29:12Z] <ryankemper> T274204 Restored service health on elastic106[0,4,5] via sudo apt-get remove --purge wmf-elasticsearch-search-plugins --yes && sudo dpkg -i /var/cache/apt/archives/wmf-elasticsearch-search-plugins_6.5.4-4~stretch_all.deb && sudo puppet agent -tv. There's some sort of issue with 6.5.4-5~stretch that we will need to circle back and investigate; for now the fleet is staying on 6.5.4-4~stretch

I started doing restarts in eqiad, but hit a show-stopper: any node with the new plugin version had its elasticsearch systemd units stuck in a failure state that persisted across restarts. The most suspicious log-line by far is java.nio.file.AccessDeniedException: /var/run/elasticsearch:

ryankemper@elastic1065:~$ sudo journalctl -u elasticsearch_6@production-search-eqiad.service
...
Feb 24 23:20:29 elastic1065 elasticsearch[113877]: Exception in thread "main" org.elasticsearch.bootstrap.BootstrapException: java.nio.file.AccessDeniedException: /var/run/elasticsearch
Feb 24 23:20:29 elastic1065 elasticsearch[113877]: Likely root cause: java.nio.file.AccessDeniedException: /var/run/elasticsearch
Feb 24 23:20:29 elastic1065 elasticsearch[113877]:         at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
Feb 24 23:20:31 elastic1065 systemd[1]: elasticsearch_6@production-search-eqiad.service: Main process exited, code=exited, status=1/FAILURE
Feb 24 23:20:31 elastic1065 systemd[1]: elasticsearch_6@production-search-eqiad.service: Unit entered failed state.
Feb 24 23:20:31 elastic1065 systemd[1]: elasticsearch_6@production-search-eqiad.service: Failed with result 'exit-code'.
Feb 24 23:40:26 elastic1065 systemd[1]: Started Elasticsearch (cluster production-search-eqiad).
Feb 24 23:40:43 elastic1065 elasticsearch[116212]: Exception in thread "main" org.elasticsearch.bootstrap.BootstrapException: java.nio.file.AccessDeniedException: /var/run/elasticsearch
Feb 24 23:40:43 elastic1065 elasticsearch[116212]: Likely root cause: java.nio.file.AccessDeniedException: /var/run/elasticsearch
Feb 24 23:40:43 elastic1065 elasticsearch[116212]:         at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
Feb 24 23:40:45 elastic1065 systemd[1]: elasticsearch_6@production-search-eqiad.service: Main process exited, code=exited, status=1/FAILURE
Feb 24 23:40:45 elastic1065 systemd[1]: elasticsearch_6@production-search-eqiad.service: Unit entered failed state.
Feb 24 23:40:45 elastic1065 systemd[1]: elasticsearch_6@production-search-eqiad.service: Failed with result 'exit-code'.

@RKemper, I meant to bring this up in today's meeting, but it slipped my mind. Anything I can do to help?

AccessDeniedException: /var/run/elasticsearch looks like a possible weird permissions problem, but I have no idea, and that doesn't quite make sense—unless the plugin has bad permissions, maybe?

Anyway, if you come across anything that you need me to fix in the plugin code, or anything else I might be able to help with, just let me know. (BTW, thanks! I appreciate you jumping on this last week!)

@TJones Thanks, I'll tap in David or Zbyszko to see if they can find the error.

Some context for our hypothetical investigators:

Here's an example of what the (systemd-level) logs look like:

Feb 26 07:53:12 elastic2045 systemd[1]: Started Elasticsearch (cluster production-search-codfw).
Feb 26 07:53:27 elastic2045 elasticsearch[24034]: Exception in thread "main" org.elasticsearch.bootstrap.BootstrapException: java.nio.file.AccessDeniedException: /var/run/elasticsearch
Feb 26 07:53:27 elastic2045 elasticsearch[24034]: Likely root cause: java.nio.file.AccessDeniedException: /var/run/elasticsearch
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at java.nio.file.Files.createDirectory(Files.java:674)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at java.nio.file.Files.createDirectories(Files.java:767)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at org.elasticsearch.common.PidFile.create(PidFile.java:69)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at org.elasticsearch.common.PidFile.create(PidFile.java:55)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:308)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:136)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:127)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at org.elasticsearch.cli.Command.main(Command.java:90)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:93)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:86)
Feb 26 07:53:27 elastic2045 elasticsearch[24034]: Refer to the log for complete error details.
Feb 26 07:53:28 elastic2045 systemd[1]: elasticsearch_6@production-search-codfw.service: Main process exited, code=exited, status=1/FAILURE
Feb 26 07:53:28 elastic2045 systemd[1]: elasticsearch_6@production-search-codfw.service: Unit entered failed state.
Feb 26 07:53:28 elastic2045 systemd[1]: elasticsearch_6@production-search-codfw.service: Failed with result 'exit-code'.

elastic2045 is/was a freshly re-imaged instance and the problem persists past service restarts so we know that it's an actual problem related to the new plugin version specifically.

If for whatever reason testing on a production host needs to be done, once the instance is banned from the Elasticsearch cluster and depooled, update to wmf-elasticsearch-search-plugins to 6.5.4-5~stretch (the latest version so just the normal apt-get upgrade), and then manually re-install the previous version when ready to revert: sudo apt-get remove --purge wmf-elasticsearch-search-plugins --yes && sudo dpkg -i /var/cache/apt/archives/wmf-elasticsearch-search-plugins_6.5.4-4~stretch_all.deb && sudo puppet agent -tv.

The new plugin seems faulty with multiple version of the same jar present in the debian package causing elastic to fail:

class: org.wikimedia.search.extra.ExtraCorePlugin
jar1: /usr/share/elasticsearch/plugins/extra/extra-6.5.4-wmf11.jar
jar2: /usr/share/elasticsearch/plugins/extra/extra-6.5.4-wmf13.jar
        at org.elasticsearch.bootstrap.JarHell.checkClass(JarHell.java:277) ~[elasticsearch-core-6.5.4.jar:6.5.4]
        at org.elasticsearch.bootstrap.JarHell.checkJarHell(JarHell.java:190) ~[elasticsearch-core-6.5.4.jar:6.5.4]
        at org.elasticsearch.plugins.PluginsService.checkBundleJarHell(PluginsService.java:503) ~[elasticsearch-6.5.4.jar:6.5.4]
        ... 14 more

Then for some unknown reasons elasticsearch is deleting /var/run/elasticsearch (see T276198) causing the failure mentioned in earlier comments.

@Gehel suggested to build a new fresh debian package version without the duplicated jar (https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/667837)

Mentioned in SAL (#wikimedia-operations) [2021-03-24T01:49:53Z] <ryankemper> T274204 sudo -i cookbook sre.elasticsearch.rolling-upgrade-reboot relforge "relforge cluster restarts" --task-id T274204 --nodes-per-run 3 --start-datetime 2021-03-24T01:45:59+00:00 on ryankemper@cumin1001 tmux session elasticsearch_rolling_upgrade_reboots

Mentioned in SAL (#wikimedia-operations) [2021-03-24T01:58:59Z] <ryankemper> T274204 ctrl+c'd out of run; relforge is relying on outdated config that is trying to talk to relforge1002 which no longer exists. Need to refactor so that config no longer lives in spicerack

Mentioned in SAL (#wikimedia-operations) [2021-03-24T01:59:24Z] <ryankemper> T274204 For now I'll proceed to the reboots of codfw

Mentioned in SAL (#wikimedia-operations) [2021-03-24T02:38:46Z] <ryankemper> T274204 sudo -i cookbook sre.elasticsearch.rolling-upgrade search_codfw "codfw cluster reboot" --task-id T274204 --nodes-per-run 3 --start-datetime 2021-03-24T02:29:39 on ryankemper@cumin1001 tmux session elasticsearch_rolling_upgrade_reboots

Mentioned in SAL (#wikimedia-operations) [2021-03-24T03:39:45Z] <ryankemper> T274204 Timed out waiting for write queues to empty: [59/60, retrying in 60.00s] Attempt to run 'spicerack.elasticsearch_cluster.ElasticsearchClusters.wait_for_all_write_queues_empty' raised: Write queue not empty (had value of 241631) for partition 0 of topic codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.

Mentioned in SAL (#wikimedia-operations) [2021-03-24T03:41:08Z] <ryankemper> T274204 Restarting codfw restart; the timestamp argument should prevent it from wasting time on nodes that have been rebooted already

Mentioned in SAL (#wikimedia-operations) [2021-03-24T03:41:15Z] <ryankemper> T274204 sudo -i cookbook sre.elasticsearch.rolling-upgrade search_codfw "codfw cluster reboot" --task-id T274204 --nodes-per-run 3 --start-datetime 2021-03-24T02:29:39 on ryankemper@cumin1001 tmux session elasticsearch_rolling_upgrade_reboots

Mentioned in SAL (#wikimedia-operations) [2021-03-25T00:05:50Z] <ryankemper> T274204 sudo -i cookbook sre.elasticsearch.rolling-upgrade search_eqiad "eqiad cluster reboot" --task-id T274204 --nodes-per-run 3 --start-datetime 2021-03-24T23:55:35 on ryankemper@cumin1001 tmux session elasticsearch_rolling_upgrade_reboots

Mentioned in SAL (#wikimedia-operations) [2023-05-18T17:59:39Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin1001 - T274204

Mentioned in SAL (#wikimedia-operations) [2023-05-18T18:07:35Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin1001 - T274204