Restart Search Platform-owned services for Java 8 / Java 11 security updates
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	Nov 7 2023, 4:16 PM

Description

Per email with @MoritzMuehlenhoff :

There's new java 11 security updates, I've already rolled out the
updated debs, can you please take care of restarting

~~relforge~~ complete!
~~cloudelastic~~ complete!
~~production eqiad~~ complete!
~~production codfw~~ complete!
~~wcqs~~
~~wdqs/test~~ (ideally not before T347504)
~~wdqs/internal~~
~~wdqs/public~~
~~logstash services on all hosts, if applicable~~ complete!
apifeatureusage

AC: Roll restart services in above environments, for all clusters.

Related Objects

Mentioned Here: T347504: WDQS graph split: load data from dumps into new hosts

Event Timeline

bking created this task.Nov 7 2023, 4:16 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 7 2023, 4:16 PM

bking added a subscriber: RKemper.Nov 7 2023, 4:21 PM

Gehel triaged this task as High priority.Nov 7 2023, 7:54 PM

Gehel moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.

bking claimed this task.Nov 7 2023, 9:35 PM

bking moved this task from Ready for Work to In Progress on the Data-Platform-SRE board.

Mentioned in SAL (#wikimedia-operations) [2023-11-07T21:36:51Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin2002 - T350703

Mentioned in SAL (#wikimedia-operations) [2023-11-07T22:00:54Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin2002 - T350703

bking updated the task description. (Show Details)Nov 7 2023, 10:48 PM

bking updated the task description. (Show Details)Nov 8 2023, 9:33 PM

Mentioned in SAL (#wikimedia-operations) [2023-11-08T22:08:56Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart (java 11 sec updates) - ryankemper@cumin1001 - T350703

Mentioned in SAL (#wikimedia-operations) [2023-11-08T23:28:18Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart (java 11 sec updates) - ryankemper@cumin1001 - T350703

I've also extended this task to cover the restarts for WCSQ and WQDS, I've just rolled out the respective Java 8 security updates.

Also, for cloudelastic* I'm still seeing some logstash processes using the old JRE, maybe that's something that needs to be covered in the cookbook as well?

MoritzMuehlenhoff renamed this task from Restart Elasticsearch services for java 11 updates to Restart Elasticsearch services for Java 8/11 updates.Nov 9 2023, 7:39 AM

MoritzMuehlenhoff updated the task description. (Show Details)

@MoritzMuehlenhoff is it possible to delay the restart of blazegraph on nodes running an data import: wdqs1022, wdqs1023 and 1024 (T347504)?

dcausse updated the task description. (Show Details)Nov 9 2023, 8:56 AM

In T350703#9318464, @dcausse wrote:

@MoritzMuehlenhoff is it possible to delay the restart of blazegraph on nodes running an data import: wdqs1022, wdqs1023 and 1024 (T347504)?

By all means yes, don't let this interfere with the ongoing import.

Aklapper renamed this task from Restart Elasticsearch services for Java 8/11 updates to Restart Elasticsearch services for Java 2023-11-08 updates.Nov 9 2023, 10:50 AM

MoritzMuehlenhoff renamed this task from Restart Elasticsearch services for Java 2023-11-08 updates to Restart Elasticsearch services for Java 8 / Java 11 security updates.Nov 9 2023, 11:57 AM

bking renamed this task from Restart Elasticsearch services for Java 8 / Java 11 security updates to Restart Search Platform-owned services for Java 8 / Java 11 security updates.Nov 9 2023, 2:25 PM

bking updated the task description. (Show Details)

bking updated the task description. (Show Details)Nov 9 2023, 5:47 PM

In T350703#9318370, @MoritzMuehlenhoff wrote:

I've also extended this task to cover the restarts for WCSQ and WQDS, I've just rolled out the respective Java 8 security updates.

Also, for cloudelastic* I'm still seeing some logstash processes using the old JRE, maybe that's something that needs to be covered in the cookbook as well?

Thanks! I'll get a task started for adding that, in the meantime I have restarted the logstash services on all elastic hosts. Let me know if I missed anything.

bking updated the task description. (Show Details)Nov 13 2023, 9:11 PM

I believe this work is complete. Closing, but please reopen if we missed anything.

In T350703#9372167, @bking wrote:

I believe this work is complete. Closing, but please reopen if we missed anything.

The restarts for the Elastic-based roles seems all good to me, but it seems there are still some issues with the cookbooks for Blazegraph?

E.g. looking at wdqs1007, there's a blazegraph process (pid 2106782) dating back to October 26 and streaming-updater-consumer-0.3.129-jar-with-dependencies.jar also hasn't been restarted.

Gehel edited projects, added Data-Platform-SRE (2023.12.01 - 2023.12.31); removed Data-Platform-SRE.Dec 13 2023, 2:32 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2023.12.01 - 2023.12.31) board.

Gehel edited projects, added Data-Platform-SRE (2024.01.01 - 2024.01.21); removed Data-Platform-SRE (2023.12.01 - 2023.12.31).Dec 19 2023, 4:53 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.01.01 - 2024.01.21) board.

There is one more Search service I had initially missed, the apifeatureusage* Logstash cluster. I've extended the task description to add it.

bking updated the task description. (Show Details)Jan 3 2024, 10:37 PM

RKemper updated the task description. (Show Details)Jan 4 2024, 8:37 PM

Mentioned in SAL (#wikimedia-operations) [2024-01-04T20:41:11Z] <ryankemper> [apifeatureusage] T350703 Restarted logstash on apifeatureusage[1,2]001

@MoritzMuehlenhoff This should be all done. Let us know if you see any rogue java processes hanging around!

In T350703#9436291, @RKemper wrote:

@MoritzMuehlenhoff This should be all done. Let us know if you see any rogue java processes hanging around!

Actually there are :-) See below for a list of blazegraph processes, hostnames and PIDs as currently found. But it may also be a case of Blazegraph completing an ongoing query before terminating, I'll recheck the status on Monday.

wdqs1011.eqiad.wmnet:       3751057 blazegraph
wdqs1011.eqiad.wmnet:       3752211 blazegraph
wdqs1011.eqiad.wmnet:       3758645 blazegraph
wdqs1012.eqiad.wmnet:        323958 blazegraph
wdqs1015.eqiad.wmnet:        766658 blazegraph
wdqs1023.eqiad.wmnet:       3347193 blazegraph
wdqs2008.codfw.wmnet:       3638763 blazegraph
wdqs2008.codfw.wmnet:       3639435 blazegraph
wdqs2008.codfw.wmnet:       3639896 blazegraph
wdqs2014.codfw.wmnet:       3743403 blazegraph
wdqs2014.codfw.wmnet:       3747222 blazegraph
wdqs2014.codfw.wmnet:       3748731 blazegraph
wdqs2015.codfw.wmnet:       3760276 blazegraph
wdqs2015.codfw.wmnet:       3761897 blazegraph
wdqs2015.codfw.wmnet:       3765913 blazegraph

Gehel moved this task from Needs Review to In Progress on the Data-Platform-SRE (2024.01.01 - 2024.01.21) board.Jan 9 2024, 4:28 PM

@MoritzMuehlenhoff Oops it appears we made the same mistake twice :P Can you do one more check for us? I think everything is all set now:

ryankemper@cumin2002:~$ sudo -E cumin wdqs* 'lsof -nXd DEL | grep jdk & true'
33 hosts will be targeted:
wdqs[2007-2025].codfw.wmnet,wdqs[1011-1024].eqiad.wmnet
OK to proceed on 33 hosts? Enter the number of affected hosts to confirm or "q" to quit: 33
===== NO OUTPUT =====
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (33/33) [00:01<00:00, 23.57hosts/s]
FAIL |                                                                                                                                                                                                                  |   0% (0/33) [00:01<?, ?hosts/s]
100.0% (33/33) success ratio (>= 100.0% threshold) for command: 'lsof -nXd DEL | grep jdk & true'.
100.0% (33/33) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
ryankemper@cumin2002:~$ sudo -E cumin api* 'lsof -nXd DEL | grep jdk & true'
2 hosts will be targeted:
apifeatureusage2001.codfw.wmnet,apifeatureusage1001.eqiad.wmnet
OK to proceed on 2 hosts? Enter the number of affected hosts to confirm or "q" to quit: 2
===== NO OUTPUT =====
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (2/2) [00:00<00:00,  2.20hosts/s]
FAIL |                                                                                                                                                                                                                   |   0% (0/2) [00:00<?, ?hosts/s]
100.0% (2/2) success ratio (>= 100.0% threshold) for command: 'lsof -nXd DEL | grep jdk & true'.
100.0% (2/2) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
ryankemper@cumin2002:~$ sudo -E cumin api* 'elasticsear^C grep jdk & true'
ryankemper@cumin2002:~$ sudo -E cumin elastic* 'lsof -nXd DEL | grep jdk & true'
127 hosts will be targeted:
elastic[2037-2048,2050-2109].codfw.wmnet,elastic[1053-1107].eqiad.wmnet
OK to proceed on 127 hosts? Enter the number of affected hosts to confirm or "q" to quit: 127
===== NO OUTPUT =====
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (127/127) [00:02<00:00, 44.67hosts/s]
FAIL |                                                                                                                                                                                                                 |   0% (0/127) [00:02<?, ?hosts/s]
100.0% (127/127) success ratio (>= 100.0% threshold) for command: 'lsof -nXd DEL | grep jdk & true'.
100.0% (127/127) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
ryankemper@cumin2002:~$ sudo -E cumin wcqs* 'lsof -nXd DEL | grep jdk & true'
6 hosts will be targeted:
wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet
OK to proceed on 6 hosts? Enter the number of affected hosts to confirm or "q" to quit: 6
===== NO OUTPUT =====
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (6/6) [00:01<00:00,  6.00hosts/s]
FAIL |                                                                                                                                                                                                                   |   0% (0/6) [00:00<?, ?hosts/s]
100.0% (6/6) success ratio (>= 100.0% threshold) for command: 'lsof -nXd DEL | grep jdk & true'.
100.0% (6/6) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Looks all good now, thanks

RKemper moved this task from In Progress to Done on the Data-Platform-SRE (2024.01.01 - 2024.01.21) board.Jan 10 2024, 6:21 PM

Gehel closed this task as Resolved.Jan 12 2024, 2:26 PM

Restart Search Platform-owned services for Java 8 / Java 11 security updatesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Restart Search Platform-owned services for Java 8 / Java 11 security updates
Closed, ResolvedPublic
Actions