Restart Elastic/Blazegraph services to pick up JRE updates
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	bking
	Feb 17 2023, 4:49 PM

Description

Per email with @MoritzMuehlenhoff :

I've gone ahead and deployed the updates on the relforge and cloudelastic clusters, can you take care
of rolling restarts to pick up the updated JRE?

Yes, we will take care of these and record updates in the ticket

I can also directly apply the updates on the main elastic* cluster or
we can wait until relforge/cloudelastic are done, Im fine either way?

You can apply these at your convenience.

Thanks Moritz! Ping me here or in IRC if you have any other questions or directions.

Details

Other Assignee: RKemper

Related Objects

Mentioned In: T283746: Allow multiple "assigned to"

Event Timeline

bking created this task.Feb 17 2023, 4:49 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 17 2023, 4:49 PM

bking updated Other Assignee, added: RKemper.Feb 17 2023, 4:50 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-17T17:06:13Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T329957

Mentioned in SAL (#wikimedia-operations) [2023-02-17T17:06:31Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T329957

Mentioned in SAL (#wikimedia-operations) [2023-02-17T17:06:58Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T329957

Mentioned in SAL (#wikimedia-operations) [2023-02-17T17:10:38Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T329957

Ammarpad mentioned this in T283746: Allow multiple "assigned to".Feb 17 2023, 5:32 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-17T22:05:32Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin1001 - T329957

Mentioned in SAL (#wikimedia-operations) [2023-02-17T22:06:22Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin1001 - T329957

Mentioned in SAL (#wikimedia-operations) [2023-02-17T22:09:36Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin1001 - T329957

Mentioned in SAL (#wikimedia-operations) [2023-02-17T22:45:04Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin1001 - T329957

As shown above, I restarted the elasticsearch service on all cluster nodes in the relforge and cloudelastic environments.

Mentioned in SAL (#wikimedia-operations) [2023-02-20T08:08:48Z] <moritzm> updating openjdk-11 on elastic* servers T329957

In T329957#8626564, @bking wrote:

As shown above, I restarted the elasticsearch service on all cluster nodes in the relforge and cloudelastic environments.

Ack, thanks. I've just updated OpenJDK on the the elastic* nodes as well.

One thing I noticed is that after the run of the sre.elasticsearch.rolling-operation cookbook, there are still logstash* processes using the old JRE on the cloudelastic* nodes, e.g. on cloudelastic1002:

logstash     786  0.6  0.4 5144604 567316 ?      SNsl  2022 1169:09 /bin/java -Xms192m -Xmx192m -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC
-Djava.awt.headless=true -Dfile.encoding=UTF-8 -Dlog4j2.formatMsgNoLookups=true -XX:+HeapDumpOnOutOfMemoryError -Xlog:gc*:file=/var/log/logstash/logstash_jvm_gc.%p.log::filecount=10,filesize=20000 -Xlog:gc+age=trace -cp /usr/share/logstash/logstash-core/lib/jars/animal-sniffer-annotations-1.14.jar:/usr/share/logstash/logstash-core/lib/jars/commons-codec-1.11.jar:/usr/share/logstash/logstash-core/lib/jars/commons-compiler-3.0.8.jar:/usr/share/logstash/logstash-core/lib/jars/error_prone_annotations-2.0.18.jar:/usr/share/logstash/logstash-core/lib/jars/google-java-format-1.1.jar:/usr/share/logstash/logstash-core/lib/jars/gradle-license-report-0.7.1.jar:/usr/share/logstash/logstash-core/lib/jars/guava-22.0.jar:/usr/share/logstash/logstash-core/lib/jars/j2objc-annotations-1.1.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-annotations-2.9.10.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-core-2.9.10.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-databind-2.9.10.1.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-dataformat-cbor-2.9.10.jar:/usr/share/logstash/logstash-core/lib/jars/janino-3.0.8.jar:/usr/share/logstash/logstash-core/lib/jars/javassist-3.22.0-GA.jar:/usr/share/logstash/logstash-core/lib/jars/jruby-complete-9.2.7.0.jar:/usr/share/logstash/logstash-core/lib/jars/jsr305-1.3.9.jar:/usr/share/logstash/logstash-core/lib/jars/log4j-api-2.17.1.jar:/usr/share/logstash/logstash-core/lib/jars/log4j-core-2.17.1.jar:/usr/share/logstash/logstash-core/lib/jars/log4j-slf4j-impl-2.17.1.jar:/usr/share/logstash/logstash-core/lib/jars/logstash-core.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.commands-3.6.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.contenttype-3.4.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.expressions-3.4.300.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.filesystem-1.3.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.jobs-3.5.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.resources-3.7.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.runtime-3.7.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.app-1.3.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.common-3.6.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.preferences-3.4.1.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.registry-3.5.101.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.jdt.core-3.10.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.osgi-3.7.1.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.text-3.5.101.jar:/usr/share/logstash/logstash-core/lib/jars/slf4j-api-1.7.25.jar org.logstash.Logstash --path.settings /etc/logstash

Is that something we're missing in the cookbook?

Let's also use this task for the Blazegraph restarts? I've just deployed the updates debs on wcqs* and wdqs*.

MoritzMuehlenhoff renamed this task from Restart Elastic services to pick up JRE updates to Restart Elastic/Blazegraph services to pick up JRE updates.Feb 23 2023, 3:25 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-23T19:54:32Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply JRE updates - bking@cumin1001 - T329957

We're in the process of restarting wcqs* and elastic codfw. elastic-eqiad still remains to be done [after codfw]. All of wdqs is done, as well as relforge and cloudelastic.

Mentioned in SAL (#wikimedia-operations) [2023-02-23T21:32:31Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply JRE updates - bking@cumin1001 - T329957

Elastic codfw is done; all Search Platform hosts should have the updates applied now, *except* elastic eqiad.

Mentioned in SAL (#wikimedia-operations) [2023-02-23T22:34:23Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply JRE updates - bking@cumin1001 - T329957

Mentioned in SAL (#wikimedia-operations) [2023-02-24T00:13:18Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply JRE updates - bking@cumin1001 - T329957

Thanks! These look all fine

Actually, there's one additional Search service I missed; apifeatureusage*, I've just updated the JRE packages on those (two) servers.

MPhamWMF set the point value for this task to 2.Feb 27 2023, 4:33 PM

MPhamWMF moved this task from Incoming to Ready for Dev -- SRE/Ops on the Discovery-Search (Current work) board.

RKemper moved this task from Ready for Dev -- SRE/Ops to In Progress on the Discovery-Search (Current work) board.Feb 27 2023, 4:35 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-27T22:31:11Z] <ryankemper> [apifeatureusage] T329957 Restarted logstash on apifeatureusage[1-2]001

Should be all done here.

In T329957#8650955, @RKemper wrote:

Should be all done here.

Indeed, looks all good, thanks!

Gehel closed this task as Resolved.Mar 10 2023, 2:12 PM

Restart Elastic/Blazegraph services to pick up JRE updatesClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Restart Elastic/Blazegraph services to pick up JRE updates
Closed, ResolvedPublic2 Estimated Story Points
Actions