Page MenuHomePhabricator

opensearch OOM on logstash102[34]
Closed, ResolvedPublic

Description

Today we have observed opensearch OOM'ing on logstash102[34] (only!)

09:56 -wikibugs:#wikimedia-operations- (CR) Clément Goubert: [V: +1 C: +2] P:logstash::production: mediawiki-php-fpm-slowlog [puppet] - https://gerrit.wikimedia.org/r/879417 
          (https://phabricator.wikimedia.org/T326794) (owner: Clément Goubert)
09:58 -icinga-wm:#wikimedia-operations- PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly 
          https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
09:59 -icinga-wm:#wikimedia-operations- RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
10:07 -icinga-wm:#wikimedia-operations- RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
10:08 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service 
          https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
10:08 -icinga-wm:#wikimedia-operations- PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: 
          HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fb16ae6d280: Failed to 
          establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi
10:08 -icinga-wm:#wikimedia-operations- org/wiki/Search%23Administration
10:08 -icinga-wm:#wikimedia-operations- RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly 
          https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
10:10 -icinga-wm:#wikimedia-operations- PROBLEM - OpenSearch health check for shards on 9200 on logstash1024 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: 
          HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fa688a4a280: Failed to 
          establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi
10:10 -icinga-wm:#wikimedia-operations- org/wiki/Search%23Administration
10:10 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on logstash1024 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service 
          https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
10:11 -logmsgbot:#wikimedia-operations- !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.19  refs T325582 (duration: 42m 26s)
10:11 -stashbot:#wikimedia-operations- T325582: 1.40.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T325582
10:12 -logmsgbot:#wikimedia-operations- !log jnuche@deploy1002 scap failed: average error rate on 9/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for 
          details)
10:12 -jinxer-wm:#wikimedia-operations- (LogstashNoLogsIndexed) firing: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - 
          https://grafana.wikimedia.org/d/000000561/logstash?var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashNoLogsIndexed
10:13  <godog> mmhh not sure what happened yet with logstash there, cc jnuche as it might be related to the canaries check
10:14 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_startupregistrystats-testwiki.service 
          https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
10:16  <godog> !log restart opensearch_2@production-elk7-eqiad.service on logstash102[34]
10:16 -stashbot:#wikimedia-operations- Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
10:16  <godog> (OOM)
10:16 -jinxer-wm:#wikimedia-operations- (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - 
          https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
10:17 -icinga-wm:#wikimedia-operations- RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
10:18 -icinga-wm:#wikimedia-operations- RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: 
          green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 660, active_shards: 1489, relocating_shards: 0, 
          initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar
10:18 -icinga-wm:#wikimedia-operations- umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 
          https://wikitech.wikimedia.org/wiki/Search%23Administration

I'm not yet sure of the exact cause, though puppet applying https://gerrit.wikimedia.org/r/879417 seems to be related to an increase in memory, and maybe that pushed 102[34] over the edge, but not 1025 half an hour later:

logstash1023/syslog.log:Jan 17 10:05:26 logstash1023 puppet-agent[3407472]: (/Stage[main]/Profile::Logstash::Production/Logstash::Input::Kafka[mediawiki-php-fpm-slowlog-eqiad]/Logstash::Conf[input-kafka-mediawiki-php-fpm-slowlog-eqiad]/File[/etc/logstash/conf.d/10-input-kafka-mediawiki-php-fpm-slowlog-eqiad.conf]) Scheduling refresh of Service[logstash]
logstash1024/syslog.log:Jan 17 10:06:48 logstash1024 puppet-agent[3414332]: (/Stage[main]/Profile::Logstash::Production/Logstash::Input::Kafka[mediawiki-php-fpm-slowlog-eqiad]/Logstash::Conf[input-kafka-mediawiki-php-fpm-slowlog-eqiad]/File[/etc/logstash/conf.d/10-input-kafka-mediawiki-php-fpm-slowlog-eqiad.conf]) Scheduling refresh of Service[logstash]
logstash1025/syslog.log:Jan 17 10:26:24 logstash1025 puppet-agent[3438606]: (/Stage[main]/Profile::Logstash::Production/Logstash::Input::Kafka[mediawiki-php-fpm-slowlog-eqiad]/Logstash::Conf[input-kafka-mediawiki-php-fpm-slowlog-eqiad]/File[/etc/logstash/conf.d/10-input-kafka-mediawiki-php-fpm-slowlog-eqiad.conf]) Scheduling refresh of Service[logstash]

2023-01-17-121246_1203x512_scrot.png (512×1 px, 83 KB)

I'm filing a task to keep track of this, although I'm not sure there's any immediate actionable.

Event Timeline

A significant difference I found between 102[345] and the rest of the logstash collectors (logstash103[012]) is that the former are pooled for dashboards traffic while the latter are not, and as expected there's more memory usage on pooled hosts, my suspicion is that that's enough to push them over the edge:

root@logstash1030:~# ps fwwwaux | grep -i dashboards
opensea+   61526  0.4  1.2 965504 99992 ?        Ssl   2022 278:17 /usr/share/opensearch-dashboards/bin/..
/node/bin/node /usr/share/opensearch-dashboards/bin/../src/cli/dist                                       root     3633789  0.0  0.0   6244   652 pts/1    S+   10:48   0:00                                  \_ gre
p -i dashboards
root@logstash1023:~# ps fwwwaux | grep -i dashboards
opensea+ 1649202  0.9 10.4 1703876 853664 ?      Ssl   2022 501:59 /usr/share/opensearch-dashboards/bin/..
/node/bin/node /usr/share/opensearch-dashboards/bin/../src/cli/dist                                       root     3585178  0.0  0.0   6244   644 pts/2    S+   10:48   0:00                                  \_ gre
p -i dashboards

It doesn't explain why logstash1024 OOM'd, but the apparent regularity of logstash1023 OOMs coincide with the dashboards backup run (00:30Z).

Mentioned in SAL (#wikimedia-operations) [2023-01-19T15:42:14Z] <godog> bounce opensearch on logstash102[34] - T327161

I noticed that opensearch-dashboards / node is using quite some more memory (2x) on 1024 and 1023, but not 1025 (also pooled for traffic). Since ATM I don't have many more ideas I'm going to test a roll-restart of opensearch-dashboards across logstash collectors in eqiad, and we can keep monitoring memory usage.

Mentioned in SAL (#wikimedia-operations) [2023-01-19T16:48:22Z] <godog> roll-restart opensearch-dashboards in logstash collectors eqiad - T327161

For reference this is the memory usage post-restart:

# cumin 'O:logging::opensearch::collector AND *.eqiad.wmnet'   'systemctl show opensearch-dashboards -p MemoryCurrent'
6 hosts will be targeted:
logstash[1023-1025,1030-1032].eqiad.wmnet
OK to proceed on 6 hosts? Enter the number of affected hosts to confirm or "q" to quit: 6
===== NODE GROUP =====
(1) logstash1032.eqiad.wmnet
----- OUTPUT of 'systemctl show o...-p MemoryCurrent' -----
MemoryCurrent=141156352
===== NODE GROUP =====
(1) logstash1030.eqiad.wmnet
----- OUTPUT of 'systemctl show o...-p MemoryCurrent' -----
MemoryCurrent=140750848
===== NODE GROUP =====
(1) logstash1031.eqiad.wmnet
----- OUTPUT of 'systemctl show o...-p MemoryCurrent' -----
MemoryCurrent=140951552
===== NODE GROUP =====
(1) logstash1024.eqiad.wmnet
----- OUTPUT of 'systemctl show o...-p MemoryCurrent' -----
MemoryCurrent=154124288
===== NODE GROUP =====
(1) logstash1025.eqiad.wmnet
----- OUTPUT of 'systemctl show o...-p MemoryCurrent' -----
MemoryCurrent=139943936
===== NODE GROUP =====
(1) logstash1023.eqiad.wmnet
----- OUTPUT of 'systemctl show o...-p MemoryCurrent' -----
MemoryCurrent=128262144
================

Usage seems relatively stable so far

# cumin 'O:logging::opensearch::collector AND *.eqiad.wmnet'   'systemctl show opensearch-
dashboards -p MemoryCurrent'                                                                              6 hosts will be targeted:
logstash[1023-1025,1030-1032].eqiad.wmnet
OK to proceed on 6 hosts? Enter the number of affected hosts to confirm or "q" to quit: 6
===== NODE GROUP =====                                                                                    
(1) logstash1030.eqiad.wmnet                                                                              
----- OUTPUT of 'systemctl show o...-p MemoryCurrent' -----                                               
MemoryCurrent=101892096                                                                                   
===== NODE GROUP =====                                                                                    
(1) logstash1032.eqiad.wmnet                                                                              
----- OUTPUT of 'systemctl show o...-p MemoryCurrent' -----                                               
MemoryCurrent=95625216                                                                                    
===== NODE GROUP =====                                                                                    
(1) logstash1031.eqiad.wmnet                                                                              
----- OUTPUT of 'systemctl show o...-p MemoryCurrent' -----                                               
MemoryCurrent=100487168                                                                                   
===== NODE GROUP =====                                                                                    
(1) logstash1023.eqiad.wmnet                                                                              
----- OUTPUT of 'systemctl show o...-p MemoryCurrent' -----                                               
MemoryCurrent=105054208                                                                                   
===== NODE GROUP =====                                                                                    
(1) logstash1024.eqiad.wmnet                                                                              
----- OUTPUT of 'systemctl show o...-p MemoryCurrent' -----                                               
MemoryCurrent=120676352                                                                                   
===== NODE GROUP =====                                                                                    
(1) logstash1025.eqiad.wmnet                                                                              
----- OUTPUT of 'systemctl show o...-p MemoryCurrent' -----                                               
MemoryCurrent=118587392                                                                                   
================

We completed the upgrade to 2.4.1 yesterday. No OOMs thus far.

Can definitely see the difference in memory usage for pooled and not pooled hosts, and perhaps a slow creep in memory usage?

# cumin -o txt 'O:logging::opensearch::collector AND *.eqiad.wmnet'   'systemctl show opensearch-dashboards -p MemoryCurrent' 
6 hosts will be targeted:
logstash[1023-1025,1030-1032].eqiad.wmnet
OK to proceed on 6 hosts? Enter the number of affected hosts to confirm or "q" to quit: 6
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (6/6) [00:00<00:00,  9.39hosts/s]
FAIL |                                                                                                                                                                              |   0% (0/6) [00:00<?, ?hosts/s]
100.0% (6/6) success ratio (>= 100.0% threshold) for command: 'systemctl show o...-p MemoryCurrent'.
100.0% (6/6) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
_____FORMATTED_OUTPUT_____
logstash1023.eqiad.wmnet: MemoryCurrent=137084928
logstash1024.eqiad.wmnet: MemoryCurrent=145866752
logstash1025.eqiad.wmnet: MemoryCurrent=139403264
logstash1030.eqiad.wmnet: MemoryCurrent=97632256
logstash1031.eqiad.wmnet: MemoryCurrent=99639296
logstash1032.eqiad.wmnet: MemoryCurrent=94941184

Easy to see now that there's indeed a (smallish) leak in opensearch dashboards: logstash102[345] are pooled for dashboards queries whereas logstash103[012] are not

# cumin -o txt 'O:logging::opensearch::collector AND *.eqiad.wmnet'   'systemctl show opensearch-dashboards -p MemoryCurrent' 
6 hosts will be targeted:
logstash[1023-1025,1030-1032].eqiad.wmnet
OK to proceed on 6 hosts? Enter the number of affected hosts to confirm or "q" to quit: 6
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (6/6) [00:00<00:00,  8.65hosts/s]
FAIL |                                                                                                                         |   0% (0/6) [00:00<?, ?hosts/s]
100.0% (6/6) success ratio (>= 100.0% threshold) for command: 'systemctl show o...-p MemoryCurrent'.
100.0% (6/6) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
_____FORMATTED_OUTPUT_____
logstash1023.eqiad.wmnet: MemoryCurrent=174604288
logstash1024.eqiad.wmnet: MemoryCurrent=179200000
logstash1025.eqiad.wmnet: MemoryCurrent=204603392
logstash1030.eqiad.wmnet: MemoryCurrent=105046016
logstash1031.eqiad.wmnet: MemoryCurrent=105181184
logstash1032.eqiad.wmnet: MemoryCurrent=104992768

Dashboards has been running since the 19th. Another datapoint for today.

logstash1023.eqiad.wmnet: MemoryCurrent=194437120
logstash1024.eqiad.wmnet: MemoryCurrent=203722752
logstash1025.eqiad.wmnet: MemoryCurrent=252903424
logstash1030.eqiad.wmnet: MemoryCurrent=100864000
logstash1031.eqiad.wmnet: MemoryCurrent=103632896
logstash1032.eqiad.wmnet: MemoryCurrent=131219456

Another datapoint for today. Still running since the 19th.

logstash1023.eqiad.wmnet: MemoryCurrent=218132480
logstash1024.eqiad.wmnet: MemoryCurrent=237465600
logstash1025.eqiad.wmnet: MemoryCurrent=308187136
logstash1030.eqiad.wmnet: MemoryCurrent=106106880
logstash1031.eqiad.wmnet: MemoryCurrent=108367872
logstash1032.eqiad.wmnet: MemoryCurrent=109477888

Change 887767 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] opensearch_dashboards: enforce memory limit

https://gerrit.wikimedia.org/r/887767

Change 887767 merged by Filippo Giunchedi:

[operations/puppet@production] opensearch_dashboards: enforce memory limit

https://gerrit.wikimedia.org/r/887767

Mentioned in SAL (#wikimedia-operations) [2023-02-09T09:32:31Z] <godog> roll-restart opensearch-dashboards to apply memory limit - T327161

fgiunchedi changed the task status from Open to Stalled.Feb 9 2023, 9:50 AM

I have applied the memory limit and restarted the service:

logstash1023.eqiad.wmnet: MemoryCurrent=166506496
logstash1024.eqiad.wmnet: MemoryCurrent=167198720
logstash1025.eqiad.wmnet: MemoryCurrent=169095168
logstash1030.eqiad.wmnet: MemoryCurrent=161697792
logstash1031.eqiad.wmnet: MemoryCurrent=167231488
logstash1032.eqiad.wmnet: MemoryCurrent=164466688

Stalling the task and I'll check back in three weeks or so

Current memory limit of 350M causes systemd to kill opensearch-dashboards when trying to run backups. In testing, 650M allows the backup to succeed.

Change 888165 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] opensearch_dashboards: bump memory limit

https://gerrit.wikimedia.org/r/888165

Change 888165 merged by Filippo Giunchedi:

[operations/puppet@production] opensearch_dashboards: bump memory limit

https://gerrit.wikimedia.org/r/888165

The MemoryMax setting appears to regularly interfere with the dashboards backup process. Until we can dig in and properly find the leak, maybe lift the max memory constraint and try restarting on an interval? Will try this on Beta first.

Change 891394 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] opensearch_dashboards: restart every 7days

https://gerrit.wikimedia.org/r/891394

Change 891394 merged by Cwhite:

[operations/puppet@production] toil: restart opensearch-dashboards every wednesday

https://gerrit.wikimedia.org/r/891394

fgiunchedi claimed this task.

I'm optimistically resolving this since the current mitigation, while imperfect, has worked fine