Page MenuHomePhabricator

logstash-beta.wmflabs.org does not receive any mediawiki events
Closed, ResolvedPublic

Description

https://logstash-beta.wmflabs.org/ is mostly empty. If you've come here looking how to find log events on beta... You can ssh to deployment-fluorine02.deployment-prep.eqiad.wmflabs!

Looking for events from hosts deployment-logstash2 or deployment-logstash03 , the logstash process there started failing around 2019-08-16 13:25

The first error I found:

[2019-08-16T13:26:00,667][ERROR][logstash.plugins.registry] Problems loading a plugin with


 {:type=>"filter", :name=>"truncate", :path=>"logstash/filters/truncate", :error_message=>"NameError", :error_class=>NameError, :error_backtrace=>["/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:226:in `namespace_lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:162:in `legacy_lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:138:in `lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:180:in `lookup_pipeline_plugin'",
"/usr/share/logstash/logstash-core/lib/logstash/plugin.rb:140:in `lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:103:in `plugin'",
"(eval):157:in `initialize'",
"org/jruby/RubyKernel.java:1079:in `eval'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:75:in `initialize'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:165:in `initialize'",
"/usr/share/logstash/logstash-core/lib/logstash/agent.rb:296:in `create_pipeline'",
"/usr/share/logstash/logstash-core/lib/logstash/agent.rb:95:in `register_pipeline'",
"/usr/share/logstash/logstash-core/lib/logstash/runner.rb:313:in `execute'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'",
"/usr/share/logstash/logstash-core/lib/logstash/runner.rb:204:in `run'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'",
"/usr/share/logstash/lib/bootstrap/environment.rb:71:in `(root)'"]}

For recent errors:

[2019-09-17T17:24:33,329][INFO ][logstash.inputs.udp      ] Starting UDP listener {:address=>"0.0.0.0:11514"}
[2019-09-17T17:24:28,325][ERROR][logstash.inputs.udp      ] UDP listener died {
:exception=>#<Errno::EADDRINUSE: Address already in use - bind - Address already in use>,
:backtrace=>[
"org/jruby/ext/socket/RubyUDPSocket.java:161:in `bind'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-udp-3.3.4/lib/logstash/inputs/udp.rb:116:in `udp_listener'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-udp-3.3.4/lib/logstash/inputs/udp.rb:68:in `run'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:470:in `inputworker'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:463:in `start_input'"]
}

rsyslog is already listening on 0.0.0.0:11514

deployment-logstash2 is the instance serving https://logstash-beta.wmflabs.org/

https://horizon.wikimedia.org/project/instances/15f1bf2e-30e0-487a-ab2a-f6fcc3e4c170/

Puppet classes:

role::logstash::apifeatureusage
role::logstash::collector prometheus_nodes: hiera('prometheus_nodes', [])
role::logstash::puppetreports
role::labs::lvm::srv
role::kafka::logging
role::kibana auth_type: None; vhost: None; auth_realm: undef; auth_file: undef; ldap_authurl: undef; ldap_binddn: undef; require_ssl: true; ldap_groups: []; serveradmin: None
profile:elastalert

Hiera:

elasticsearch::graylog_hosts: none
elasticsearch::version: 5
profile::elasticsearch::common_settings:
  auto_create_index: +logstash-*,-*
  cluster_hosts:
  - deployment-logstash2.deployment-prep.eqiad.wmflabs
  cluster_name: labs-logstash-eqiad
  expected_nodes: 1
  heap_memory: 5G
  minimum_master_nodes: 1
  recover_after_nodes: 1
  recover_after_time: 1m
  short_cluster_name: labs-logstash-eqiad
  unicast_hosts:
  - deployment-logstash2.deployment-prep.eqiad.wmflabs
profile::kafka::broker::kafka_cluster_name: logging-eqiad
profile::kafka::broker::ssl_enabled: true
profile::kafka::broker::ssl_password: qwerty
profile::logstash::collector::jmx_exporter_port: 7801
puppetmaster: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 540189 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] elasticsearch: add toggle for rsyslog udp logback compat include

https://gerrit.wikimedia.org/r/540189

Change 540189 abandoned by Herron:
elasticsearch: add toggle for rsyslog udp logback compat include

Reason:
implemented in a preferred way with I278486fb96878a7a45a608d785625d7db74e9953

https://gerrit.wikimedia.org/r/540189

herron claimed this task.

This occurred in prod as well after host reboots, and a fix has been deployed in puppet. The fix moves include ::profile::rsyslog::udp_json_logback_compat from profile::elasticsearch to profile::elasticsearch::cirrus which prevents the rsyslog udp 11514 listener from being deployed to logstash collectors where it conflicts with logstash. I think we're in good shape here now, setting this to resolved.

Reopening since:

re: mediawiki events, searching for type:mediawiki does yield results for me (e.g. https://logstash-beta.wmflabs.org/goto/e617fd053ba33c26815ed92f1e51ba07 ) is there a more specific query that does not yield the expected results?

re: deployment-logstash03 I believe its migration is tracked at T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster but AFAICS seems stalled

Ran into what is probably this this morning while trying to investigate T238253 and T238256. I don't see the exceptions mentioned in those tasks on either deployment-fluorine02 or in logstash-beta.wmflabs.org.

logstash-beta now points at deployment-logstash03, deployment-logstash2 is still accessible at https://old-logstash-beta.wmflabs.org/ for the time being (I expect people will want to copy some stuff over, not sure what to do about that yet). Is stuff better now?
I do see results at https://logstash-beta.wmflabs.org/app/kibana#/discover?_g=()&_a=(columns:!(_source),index:AW6GShqJw8E2oQN1xWTl,interval:auto,query:(query_string:(query:'type:mediawiki')),sort:!('@timestamp',desc))

Krinkle added subscribers: Krinkle, colewhite.

Looks like there still, or again, no messages under type:mediawiki. No specific filter, just type:mediawiki overall. I tried e.g. enabling WikimediaDebug with verbose logging, that also didn't make anything show up there:

https://logstash-beta.wmflabs.org/app/kibana#/dashboard/mediawiki-errors

seeing rsyslog complaining about "omkafka: kafka delivery FAIL" on deployment-prep hosts.

seeing rsyslog complaining about "omkafka: kafka delivery FAIL" on deployment-prep hosts.

That was me. Will have another look at this today

Looking more closely the problem was due to a Broker: Leader not available issue in the deployment-prep kafka logging cluster. After starting deployment-logstash2 back up (the instance had been stopped) logs are flowing again. Longer term we'll likely need another logstash stretch instance and to migrate over the broker id from deployment-logstash2 to the new instance.

What exactly is needed? Do we need to make a second logstash stretch host?
Do we need to move something else over from the old instance? How does the
Kafka cluster interact with this beyond firing logs at it?

Do we just need to point some more things at the new instance we made (see also T238707)? Do we just need to remove deployment-logstash2 from the kafka_clusters / logging-eqiad hiera data set project-wide in horizon?

Yes, we will need a second logstash stretch instance, and to migrate the Kafka broker ID from deployment-logstash2 to the new host.

Yes, we will need a second logstash stretch instance, and to migrate the Kafka broker ID from deployment-logstash2 to the new host.

Well we have deployment-logstash03, do we need a deployment-logstash04 as well? Why?
How does one migrate a Kafka broker ID?

Mentioned in SAL (#wikimedia-releng) [2020-09-15T17:49:34Z] <Krinkle> Logstash in beta cluster appears broken. No messages from Scap or MediaWiki in several days. ref T233134

can this be closed now? Its happened a number of times since this was originally reported, and is now working

I don't know if the issue we saw this week and last month are the same as the one detailed in this task. If it is the same, then it is not resolved as it keeps coming back (and this task is not about the one incident in 2019).

If the root cause from then was solved and we are dealing with a new recurring issue, then we could close this and start another. I don't know if that's the case.

This happened again, it broke when I ran puppet on logstash03 when investigating T276521: deployment-logstash03 puppet errors. I tried restarting logstash and it managed to bind port 11514 but still no events :(

Found this error message in /var/log/logstash/logstash-plain-2021-03-05.log:

Plugin: <LogStash::Inputs::Kafka tags=>["input-kafka-eventgate-main-validation-error-eqiad", "kafka", "es", "eventgate"], topics=>["eqiad.eventgate-main.error.validation"], type=>"eventgate_validation_error", codec=><LogStash::Codecs::JSON id=>"json_670029b3-b883-4e89-ab57-4e6e3a793d68", enable_metric=>true, charset=>"UTF-8">, consumer_threads=>3, bootstrap_servers=>"deployment-logstash03.deployment-prep.eqiad.wmflabs:9093", id=>"input/kafka/eventgate-main-validation-error-eqiad", security_protocol=>"SSL", ssl_truststore_location=>"/etc/logstash/kafka_logstash-eqiad.truststore.jks", ssl_truststore_password=><password>, enable_metric=>true, auto_commit_interval_ms=>"5000", client_id=>"logstash", enable_auto_commit=>"true", group_id=>"logstash", key_deserializer_class=>"org.apache.kafka.common.serialization.StringDeserializer", value_deserializer_class=>"org.apache.kafka.common.serialization.StringDeserializer", poll_timeout_ms=>100, ssl=>false, sasl_mechanism=>"GSSAPI", decorate_events=>false>
Error: uncaught throw Failed to construct kafka consumer in thread 0x6af5f10e

I'm not exactly sure what's causing it to break but that looks like the cause for no events

Mentioned in SAL (#wikimedia-releng) [2021-04-09T18:41:20Z] <Krinkle> Logstash in beta shows no messages from mediawiki for 24 hours, T233134.

Change 683695 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: move kafka input configs to profile::logstash::kafka_inputs

https://gerrit.wikimedia.org/r/683695

Change 683695 abandoned by Herron:

[operations/puppet@production] logstash: move kafka input configs to profile::logstash::kafka_inputs

Reason:

did not work as expected

https://gerrit.wikimedia.org/r/683695

I'm hoping this was fixed for good with the ELK7 upgrade (T283013)

taavi removed taavi as the assignee of this task.Aug 8 2021, 5:48 PM
taavi removed a project: User-Majavah.

This is these days caused by ELK7 nodes filling their disk from useless messages in logstash's own log.

Here's a recent entry from deployment-logstash04.deployment-prep.eqiad1.wikimedia.cloud. There is an endless stream of these.

% tail -1 /var/log/logstash/logstash-json.log | jq .logEvent.message -r

A plugin had an unrecoverable error. Will restart this plugin.
  Pipeline_id:main
  Plugin: <LogStash::Inputs::Kafka codec=><LogStash::Codecs::JSON id=>"json_a9bc0ea4-d94e-418b-9880-21f1e4048406", enable_metric=>true, charset=>"UTF-8">, topics=>["codfw.eventgate-analytics.error.validation"], ssl_truststore_location=>"/etc/logstash/kafka_jumbo-eqiad.truststore.jks", ssl_truststore_password=><password>, consumer_threads=>3, security_protocol=>"SSL", id=>"input/kafka/eventgate-analytics-validation-error-codfw", type=>"eventgate_validation_error", bootstrap_servers=>"deployment-kafka-jumbo-1.deployment-prep.eqiad1.wikimedia.cloud:9093,deployment-kafka-jumbo-2.deployment-prep.eqiad1.wikimedia.cloud:9093,deployment-kafka-jumbo-3.deployment-prep.eqiad1.wikimedia.cloud:9093", tags=>["input-kafka-eventgate-analytics-validation-error-codfw", "kafka", "es", "eventgate"], enable_metric=>true, auto_commit_interval_ms=>5000, check_crcs=>true, client_dns_lookup=>"default", client_id=>"logstash", connections_max_idle_ms=>540000, enable_auto_commit=>true, fetch_max_bytes=>52428800, fetch_max_wait_ms=>500, group_id=>"logstash", heartbeat_interval_ms=>3000, isolation_level=>"read_uncommitted", key_deserializer_class=>"org.apache.kafka.common.serialization.StringDeserializer", max_poll_interval_ms=>300000, max_partition_fetch_bytes=>1048576, max_poll_records=>500, metadata_max_age_ms=>300000, receive_buffer_bytes=>32768, reconnect_backoff_ms=>50, request_timeout_ms=>40000, retry_backoff_ms=>100, send_buffer_bytes=>131072, session_timeout_ms=>10000, value_deserializer_class=>"org.apache.kafka.common.serialization.StringDeserializer", poll_timeout_ms=>100, sasl_mechanism=>"GSSAPI", decorate_events=>false>
  Error: Failed to construct kafka consumer
  Exception: Java::OrgApacheKafkaCommon::KafkaException
  Stack: org.apache.kafka.clients.consumer.KafkaConsumer.<init>(org/apache/kafka/clients/consumer/KafkaConsumer.java:820)
org.apache.kafka.clients.consumer.KafkaConsumer.<init>(org/apache/kafka/clients/consumer/KafkaConsumer.java:666)
org.apache.kafka.clients.consumer.KafkaConsumer.<init>(org/apache/kafka/clients/consumer/KafkaConsumer.java:646)
jdk.internal.reflect.GeneratedConstructorAccessor64.newInstance(jdk/internal/reflect/GeneratedConstructorAccessor64)
jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(jdk/internal/reflect/DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(java/lang/reflect/Constructor.java:490)
org.jruby.javasupport.JavaConstructor.newInstanceDirect(org/jruby/javasupport/JavaConstructor.java:285)
org.jruby.RubyClass.newInstance(org/jruby/RubyClass.java:918)
org.jruby.RubyClass$INVOKER$i$newInstance.call(org/jruby/RubyClass$INVOKER$i$newInstance.gen)
usr.share.logstash.vendor.bundle.jruby.$2_dot_5_dot_0.gems.logstash_minus_integration_minus_kafka_minus_10_dot_5_dot_1_minus_java.lib.logstash.inputs.kafka.invokeOther152:new(usr/share/logstash/vendor/bundle/jruby/$2_dot_5_dot_0/gems/logstash_minus_integration_minus_kafka_minus_10_dot_5_dot_1_minus_java/lib/logstash/inputs//usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-integration-kafka-10.5.1-java/lib/logstash/inputs/kafka.rb:346)
usr.share.logstash.vendor.bundle.jruby.$2_dot_5_dot_0.gems.logstash_minus_integration_minus_kafka_minus_10_dot_5_dot_1_minus_java.lib.logstash.inputs.kafka.create_consumer(/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-integration-kafka-10.5.1-java/lib/logstash/inputs/kafka.rb:346)
usr.share.logstash.vendor.bundle.jruby.$2_dot_5_dot_0.gems.logstash_minus_integration_minus_kafka_minus_10_dot_5_dot_1_minus_java.lib.logstash.inputs.kafka.invokeOther1:create_consumer(usr/share/logstash/vendor/bundle/jruby/$2_dot_5_dot_0/gems/logstash_minus_integration_minus_kafka_minus_10_dot_5_dot_1_minus_java/lib/logstash/inputs//usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-integration-kafka-10.5.1-java/lib/logstash/inputs/kafka.rb:243)
usr.share.logstash.vendor.bundle.jruby.$2_dot_5_dot_0.gems.logstash_minus_integration_minus_kafka_minus_10_dot_5_dot_1_minus_java.lib.logstash.inputs.kafka.run(/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-integration-kafka-10.5.1-java/lib/logstash/inputs/kafka.rb:243)
org.jruby.RubyEnumerable$22.call(org/jruby/RubyEnumerable.java:902)
org.jruby.RubyEnumerator$2.call(org/jruby/RubyEnumerator.java:404)
org.jruby.RubyFixnum.times(org/jruby/RubyFixnum.java:291)
org.jruby.RubyInteger$INVOKER$i$0$0$times.call(org/jruby/RubyInteger$INVOKER$i$0$0$times.gen)
org.jruby.RubyClass.finvokeWithRefinements(org/jruby/RubyClass.java:514)
org.jruby.RubyClass.finvoke(org/jruby/RubyClass.java:502)
org.jruby.RubyBasicObject.callMethod(org/jruby/RubyBasicObject.java:393)
org.jruby.RubyEnumerator.__each__(org/jruby/RubyEnumerator.java:400)
org.jruby.RubyEnumerator.each(org/jruby/RubyEnumerator.java:396)
org.jruby.RubyEnumerator$INVOKER$i$each.call(org/jruby/RubyEnumerator$INVOKER$i$each.gen)
org.jruby.RubyClass.finvokeWithRefinements(org/jruby/RubyClass.java:497)
org.jruby.RubyClass.finvoke(org/jruby/RubyClass.java:487)
org.jruby.RubyEnumerable.callEach19(org/jruby/RubyEnumerable.java:119)
org.jruby.RubyEnumerable.collectCommon(org/jruby/RubyEnumerable.java:894)
org.jruby.RubyEnumerable.map(org/jruby/RubyEnumerable.java:886)
org.jruby.RubyEnumerable$INVOKER$s$0$0$map.call(org/jruby/RubyEnumerable$INVOKER$s$0$0$map.gen)
usr.share.logstash.vendor.bundle.jruby.$2_dot_5_dot_0.gems.logstash_minus_integration_minus_kafka_minus_10_dot_5_dot_1_minus_java.lib.logstash.inputs.kafka.invokeOther6:map(usr/share/logstash/vendor/bundle/jruby/$2_dot_5_dot_0/gems/logstash_minus_integration_minus_kafka_minus_10_dot_5_dot_1_minus_java/lib/logstash/inputs//usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-integration-kafka-10.5.1-java/lib/logstash/inputs/kafka.rb:243)
usr.share.logstash.vendor.bundle.jruby.$2_dot_5_dot_0.gems.logstash_minus_integration_minus_kafka_minus_10_dot_5_dot_1_minus_java.lib.logstash.inputs.kafka.run(/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-integration-kafka-10.5.1-java/lib/logstash/inputs/kafka.rb:243)
RUBY.inputworker(/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:405)
RUBY.start_input(/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:396)
org.jruby.RubyProc.call(org/jruby/RubyProc.java:318)
java.lang.Thread.run(java/lang/Thread.java:834)

As part T288618 work, we've set up a separate cluster that ingests deployment-prep's logs here: https://beta-logs.wmcloud.org/

See https://wikitech.wikimedia.org/wiki/Logstash#Beta_Cluster_Logstash for how to get login details (same login as logstash-beta.wmcloud.org).

As part T288618 work, we've set up a separate cluster that ingests deployment-prep's logs here: https://beta-logs.wmcloud.org/

See https://wikitech.wikimedia.org/wiki/Logstash#Beta_Cluster_Logstash for how to get login details (same login as logstash-beta.wmcloud.org).

Thanks @colewhite I don't yet understand the beta logging "stuff" very well. Can you explain what needs to happen next with deployment-logstash04? Can it be turned off? It's unclear to me if anything references it. Its root filesystem fills up every day, making it impossible to run error free commands across all deploymen-prep nodes.

Mention: T288989

colewhite claimed this task.

This task is fairly old and the landscape has changed since it was submitted. If this problem recurs on the new cluster, please send us a new task.