Today we have observed opensearch OOM'ing on logstash102[34] (only!)
09:56 -wikibugs:#wikimedia-operations- (CR) Clément Goubert: [V: +1 C: +2] P:logstash::production: mediawiki-php-fpm-slowlog [puppet] - https://gerrit.wikimedia.org/r/879417 (https://phabricator.wikimedia.org/T326794) (owner: Clément Goubert) 09:58 -icinga-wm:#wikimedia-operations- PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state 09:59 -icinga-wm:#wikimedia-operations- RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state 10:07 -icinga-wm:#wikimedia-operations- RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state 10:08 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state 10:08 -icinga-wm:#wikimedia-operations- PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fb16ae6d280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi 10:08 -icinga-wm:#wikimedia-operations- org/wiki/Search%23Administration 10:08 -icinga-wm:#wikimedia-operations- RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state 10:10 -icinga-wm:#wikimedia-operations- PROBLEM - OpenSearch health check for shards on 9200 on logstash1024 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fa688a4a280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi 10:10 -icinga-wm:#wikimedia-operations- org/wiki/Search%23Administration 10:10 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on logstash1024 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state 10:11 -logmsgbot:#wikimedia-operations- !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.19 refs T325582 (duration: 42m 26s) 10:11 -stashbot:#wikimedia-operations- T325582: 1.40.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T325582 10:12 -logmsgbot:#wikimedia-operations- !log jnuche@deploy1002 scap failed: average error rate on 9/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details) 10:12 -jinxer-wm:#wikimedia-operations- (LogstashNoLogsIndexed) firing: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash?var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashNoLogsIndexed 10:13 <godog> mmhh not sure what happened yet with logstash there, cc jnuche as it might be related to the canaries check 10:14 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_startupregistrystats-testwiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state 10:16 <godog> !log restart opensearch_2@production-elk7-eqiad.service on logstash102[34] 10:16 -stashbot:#wikimedia-operations- Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log 10:16 <godog> (OOM) 10:16 -jinxer-wm:#wikimedia-operations- (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag 10:17 -icinga-wm:#wikimedia-operations- RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state 10:18 -icinga-wm:#wikimedia-operations- RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 660, active_shards: 1489, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar 10:18 -icinga-wm:#wikimedia-operations- umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
I'm not yet sure of the exact cause, though puppet applying https://gerrit.wikimedia.org/r/879417 seems to be related to an increase in memory, and maybe that pushed 102[34] over the edge, but not 1025 half an hour later:
logstash1023/syslog.log:Jan 17 10:05:26 logstash1023 puppet-agent[3407472]: (/Stage[main]/Profile::Logstash::Production/Logstash::Input::Kafka[mediawiki-php-fpm-slowlog-eqiad]/Logstash::Conf[input-kafka-mediawiki-php-fpm-slowlog-eqiad]/File[/etc/logstash/conf.d/10-input-kafka-mediawiki-php-fpm-slowlog-eqiad.conf]) Scheduling refresh of Service[logstash] logstash1024/syslog.log:Jan 17 10:06:48 logstash1024 puppet-agent[3414332]: (/Stage[main]/Profile::Logstash::Production/Logstash::Input::Kafka[mediawiki-php-fpm-slowlog-eqiad]/Logstash::Conf[input-kafka-mediawiki-php-fpm-slowlog-eqiad]/File[/etc/logstash/conf.d/10-input-kafka-mediawiki-php-fpm-slowlog-eqiad.conf]) Scheduling refresh of Service[logstash] logstash1025/syslog.log:Jan 17 10:26:24 logstash1025 puppet-agent[3438606]: (/Stage[main]/Profile::Logstash::Production/Logstash::Input::Kafka[mediawiki-php-fpm-slowlog-eqiad]/Logstash::Conf[input-kafka-mediawiki-php-fpm-slowlog-eqiad]/File[/etc/logstash/conf.d/10-input-kafka-mediawiki-php-fpm-slowlog-eqiad.conf]) Scheduling refresh of Service[logstash]
I'm filing a task to keep track of this, although I'm not sure there's any immediate actionable.