Page MenuHomePhabricator

Restart Search Platform-owned services for Java 8 / Java 11 security updates
Closed, ResolvedPublic

Description

Per email with @MoritzMuehlenhoff :

There's new java 11 security updates, I've already rolled out the
updated debs, can you please take care of restarting

  • relforge complete!
  • cloudelastic complete!
  • production eqiad complete!
  • production codfw complete!
  • wcqs
  • wdqs/test (ideally not before T347504)
  • wdqs/internal
  • wdqs/public
  • logstash services on all hosts, if applicable complete!
  • apifeatureusage

AC: Roll restart services in above environments, for all clusters.

Event Timeline

Gehel triaged this task as High priority.Nov 7 2023, 7:54 PM
Gehel moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.

Mentioned in SAL (#wikimedia-operations) [2023-11-07T21:36:51Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin2002 - T350703

Mentioned in SAL (#wikimedia-operations) [2023-11-07T22:00:54Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin2002 - T350703

Mentioned in SAL (#wikimedia-operations) [2023-11-08T22:08:56Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart (java 11 sec updates) - ryankemper@cumin1001 - T350703

Mentioned in SAL (#wikimedia-operations) [2023-11-08T23:28:18Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart (java 11 sec updates) - ryankemper@cumin1001 - T350703

I've also extended this task to cover the restarts for WCSQ and WQDS, I've just rolled out the respective Java 8 security updates.

Also, for cloudelastic* I'm still seeing some logstash processes using the old JRE, maybe that's something that needs to be covered in the cookbook as well?

MoritzMuehlenhoff renamed this task from Restart Elasticsearch services for java 11 updates to Restart Elasticsearch services for Java 8/11 updates.Nov 9 2023, 7:39 AM
MoritzMuehlenhoff updated the task description. (Show Details)

@MoritzMuehlenhoff is it possible to delay the restart of blazegraph on nodes running an data import: wdqs1022, wdqs1023 and 1024 (T347504)?

@MoritzMuehlenhoff is it possible to delay the restart of blazegraph on nodes running an data import: wdqs1022, wdqs1023 and 1024 (T347504)?

By all means yes, don't let this interfere with the ongoing import.

Aklapper renamed this task from Restart Elasticsearch services for Java 8/11 updates to Restart Elasticsearch services for Java 2023-11-08 updates.Nov 9 2023, 10:50 AM
MoritzMuehlenhoff renamed this task from Restart Elasticsearch services for Java 2023-11-08 updates to Restart Elasticsearch services for Java 8 / Java 11 security updates.Nov 9 2023, 11:57 AM
bking renamed this task from Restart Elasticsearch services for Java 8 / Java 11 security updates to Restart Search Platform-owned services for Java 8 / Java 11 security updates.Nov 9 2023, 2:25 PM
bking updated the task description. (Show Details)

I've also extended this task to cover the restarts for WCSQ and WQDS, I've just rolled out the respective Java 8 security updates.

Also, for cloudelastic* I'm still seeing some logstash processes using the old JRE, maybe that's something that needs to be covered in the cookbook as well?

Thanks! I'll get a task started for adding that, in the meantime I have restarted the logstash services on all elastic hosts. Let me know if I missed anything.

bking moved this task from In Progress to Done on the Data-Platform-SRE board.

I believe this work is complete. Closing, but please reopen if we missed anything.

I believe this work is complete. Closing, but please reopen if we missed anything.

The restarts for the Elastic-based roles seems all good to me, but it seems there are still some issues with the cookbooks for Blazegraph?

E.g. looking at wdqs1007, there's a blazegraph process (pid 2106782) dating back to October 26 and streaming-updater-consumer-0.3.129-jar-with-dependencies.jar also hasn't been restarted.

There is one more Search service I had initially missed, the apifeatureusage* Logstash cluster. I've extended the task description to add it.

Mentioned in SAL (#wikimedia-operations) [2024-01-04T20:41:11Z] <ryankemper> [apifeatureusage] T350703 Restarted logstash on apifeatureusage[1,2]001

@MoritzMuehlenhoff This should be all done. Let us know if you see any rogue java processes hanging around!

@MoritzMuehlenhoff This should be all done. Let us know if you see any rogue java processes hanging around!

Actually there are :-) See below for a list of blazegraph processes, hostnames and PIDs as currently found. But it may also be a case of Blazegraph completing an ongoing query before terminating, I'll recheck the status on Monday.

wdqs1011.eqiad.wmnet:       3751057 blazegraph
wdqs1011.eqiad.wmnet:       3752211 blazegraph
wdqs1011.eqiad.wmnet:       3758645 blazegraph
wdqs1012.eqiad.wmnet:        323958 blazegraph
wdqs1015.eqiad.wmnet:        766658 blazegraph
wdqs1023.eqiad.wmnet:       3347193 blazegraph
wdqs2008.codfw.wmnet:       3638763 blazegraph
wdqs2008.codfw.wmnet:       3639435 blazegraph
wdqs2008.codfw.wmnet:       3639896 blazegraph
wdqs2014.codfw.wmnet:       3743403 blazegraph
wdqs2014.codfw.wmnet:       3747222 blazegraph
wdqs2014.codfw.wmnet:       3748731 blazegraph
wdqs2015.codfw.wmnet:       3760276 blazegraph
wdqs2015.codfw.wmnet:       3761897 blazegraph
wdqs2015.codfw.wmnet:       3765913 blazegraph

@MoritzMuehlenhoff Oops it appears we made the same mistake twice :P Can you do one more check for us? I think everything is all set now:

ryankemper@cumin2002:~$ sudo -E cumin wdqs* 'lsof -nXd DEL | grep jdk & true'
33 hosts will be targeted:
wdqs[2007-2025].codfw.wmnet,wdqs[1011-1024].eqiad.wmnet
OK to proceed on 33 hosts? Enter the number of affected hosts to confirm or "q" to quit: 33
===== NO OUTPUT =====
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (33/33) [00:01<00:00, 23.57hosts/s]
FAIL |                                                                                                                                                                                                                  |   0% (0/33) [00:01<?, ?hosts/s]
100.0% (33/33) success ratio (>= 100.0% threshold) for command: 'lsof -nXd DEL | grep jdk & true'.
100.0% (33/33) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
ryankemper@cumin2002:~$ sudo -E cumin api* 'lsof -nXd DEL | grep jdk & true'
2 hosts will be targeted:
apifeatureusage2001.codfw.wmnet,apifeatureusage1001.eqiad.wmnet
OK to proceed on 2 hosts? Enter the number of affected hosts to confirm or "q" to quit: 2
===== NO OUTPUT =====
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (2/2) [00:00<00:00,  2.20hosts/s]
FAIL |                                                                                                                                                                                                                   |   0% (0/2) [00:00<?, ?hosts/s]
100.0% (2/2) success ratio (>= 100.0% threshold) for command: 'lsof -nXd DEL | grep jdk & true'.
100.0% (2/2) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
ryankemper@cumin2002:~$ sudo -E cumin api* 'elasticsear^C grep jdk & true'
ryankemper@cumin2002:~$ sudo -E cumin elastic* 'lsof -nXd DEL | grep jdk & true'
127 hosts will be targeted:
elastic[2037-2048,2050-2109].codfw.wmnet,elastic[1053-1107].eqiad.wmnet
OK to proceed on 127 hosts? Enter the number of affected hosts to confirm or "q" to quit: 127
===== NO OUTPUT =====
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (127/127) [00:02<00:00, 44.67hosts/s]
FAIL |                                                                                                                                                                                                                 |   0% (0/127) [00:02<?, ?hosts/s]
100.0% (127/127) success ratio (>= 100.0% threshold) for command: 'lsof -nXd DEL | grep jdk & true'.
100.0% (127/127) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
ryankemper@cumin2002:~$ sudo -E cumin wcqs* 'lsof -nXd DEL | grep jdk & true'
6 hosts will be targeted:
wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet
OK to proceed on 6 hosts? Enter the number of affected hosts to confirm or "q" to quit: 6
===== NO OUTPUT =====
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (6/6) [00:01<00:00,  6.00hosts/s]
FAIL |                                                                                                                                                                                                                   |   0% (0/6) [00:00<?, ?hosts/s]
100.0% (6/6) success ratio (>= 100.0% threshold) for command: 'lsof -nXd DEL | grep jdk & true'.
100.0% (6/6) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.