Page MenuHomePhabricator

Restart Elastic/Blazegraph services to pick up JRE updates
Closed, ResolvedPublic2 Estimated Story Points

Description

Per email with @MoritzMuehlenhoff :

I've gone ahead and deployed the updates on the relforge and cloudelastic clusters, can you take care
of rolling restarts to pick up the updated JRE?

Yes, we will take care of these and record updates in the ticket

I can also directly apply the updates on the main elastic* cluster or
we can wait until relforge/cloudelastic are done, Im fine either way?

You can apply these at your convenience.

Thanks Moritz! Ping me here or in IRC if you have any other questions or directions.

Details

Other Assignee
RKemper

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2023-02-17T17:06:13Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T329957

Mentioned in SAL (#wikimedia-operations) [2023-02-17T17:06:31Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T329957

Mentioned in SAL (#wikimedia-operations) [2023-02-17T17:06:58Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T329957

Mentioned in SAL (#wikimedia-operations) [2023-02-17T17:10:38Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T329957

Mentioned in SAL (#wikimedia-operations) [2023-02-17T22:05:32Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin1001 - T329957

Mentioned in SAL (#wikimedia-operations) [2023-02-17T22:06:22Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin1001 - T329957

Mentioned in SAL (#wikimedia-operations) [2023-02-17T22:09:36Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin1001 - T329957

Mentioned in SAL (#wikimedia-operations) [2023-02-17T22:45:04Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin1001 - T329957

As shown above, I restarted the elasticsearch service on all cluster nodes in the relforge and cloudelastic environments.

Mentioned in SAL (#wikimedia-operations) [2023-02-20T08:08:48Z] <moritzm> updating openjdk-11 on elastic* servers T329957

As shown above, I restarted the elasticsearch service on all cluster nodes in the relforge and cloudelastic environments.

Ack, thanks. I've just updated OpenJDK on the the elastic* nodes as well.

One thing I noticed is that after the run of the sre.elasticsearch.rolling-operation cookbook, there are still logstash* processes using the old JRE on the cloudelastic* nodes, e.g. on cloudelastic1002:

logstash     786  0.6  0.4 5144604 567316 ?      SNsl  2022 1169:09 /bin/java -Xms192m -Xmx192m -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC
-Djava.awt.headless=true -Dfile.encoding=UTF-8 -Dlog4j2.formatMsgNoLookups=true -XX:+HeapDumpOnOutOfMemoryError -Xlog:gc*:file=/var/log/logstash/logstash_jvm_gc.%p.log::filecount=10,filesize=20000 -Xlog:gc+age=trace -cp /usr/share/logstash/logstash-core/lib/jars/animal-sniffer-annotations-1.14.jar:/usr/share/logstash/logstash-core/lib/jars/commons-codec-1.11.jar:/usr/share/logstash/logstash-core/lib/jars/commons-compiler-3.0.8.jar:/usr/share/logstash/logstash-core/lib/jars/error_prone_annotations-2.0.18.jar:/usr/share/logstash/logstash-core/lib/jars/google-java-format-1.1.jar:/usr/share/logstash/logstash-core/lib/jars/gradle-license-report-0.7.1.jar:/usr/share/logstash/logstash-core/lib/jars/guava-22.0.jar:/usr/share/logstash/logstash-core/lib/jars/j2objc-annotations-1.1.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-annotations-2.9.10.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-core-2.9.10.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-databind-2.9.10.1.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-dataformat-cbor-2.9.10.jar:/usr/share/logstash/logstash-core/lib/jars/janino-3.0.8.jar:/usr/share/logstash/logstash-core/lib/jars/javassist-3.22.0-GA.jar:/usr/share/logstash/logstash-core/lib/jars/jruby-complete-9.2.7.0.jar:/usr/share/logstash/logstash-core/lib/jars/jsr305-1.3.9.jar:/usr/share/logstash/logstash-core/lib/jars/log4j-api-2.17.1.jar:/usr/share/logstash/logstash-core/lib/jars/log4j-core-2.17.1.jar:/usr/share/logstash/logstash-core/lib/jars/log4j-slf4j-impl-2.17.1.jar:/usr/share/logstash/logstash-core/lib/jars/logstash-core.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.commands-3.6.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.contenttype-3.4.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.expressions-3.4.300.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.filesystem-1.3.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.jobs-3.5.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.resources-3.7.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.runtime-3.7.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.app-1.3.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.common-3.6.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.preferences-3.4.1.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.registry-3.5.101.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.jdt.core-3.10.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.osgi-3.7.1.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.text-3.5.101.jar:/usr/share/logstash/logstash-core/lib/jars/slf4j-api-1.7.25.jar org.logstash.Logstash --path.settings /etc/logstash

Is that something we're missing in the cookbook?

Let's also use this task for the Blazegraph restarts? I've just deployed the updates debs on wcqs* and wdqs*.

MoritzMuehlenhoff renamed this task from Restart Elastic services to pick up JRE updates to Restart Elastic/Blazegraph services to pick up JRE updates.Feb 23 2023, 3:25 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-23T19:54:32Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply JRE updates - bking@cumin1001 - T329957

We're in the process of restarting wcqs* and elastic codfw. elastic-eqiad still remains to be done [after codfw]. All of wdqs is done, as well as relforge and cloudelastic.

Mentioned in SAL (#wikimedia-operations) [2023-02-23T21:32:31Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply JRE updates - bking@cumin1001 - T329957

Elastic codfw is done; all Search Platform hosts should have the updates applied now, *except* elastic eqiad.

Mentioned in SAL (#wikimedia-operations) [2023-02-23T22:34:23Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply JRE updates - bking@cumin1001 - T329957

Mentioned in SAL (#wikimedia-operations) [2023-02-24T00:13:18Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply JRE updates - bking@cumin1001 - T329957

Actually, there's one additional Search service I missed; apifeatureusage*, I've just updated the JRE packages on those (two) servers.

Mentioned in SAL (#wikimedia-operations) [2023-02-27T22:31:11Z] <ryankemper> [apifeatureusage] T329957 Restarted logstash on apifeatureusage[1-2]001

Should be all done here.

Indeed, looks all good, thanks!