Page MenuHomePhabricator

logstash-beta.wmflabs.org does not receive any mediawiki events
Open, HighPublic

Description

https://logstash-beta.wmflabs.org/ is mostly empty. If you've come here looking how to find log events on beta... You can ssh to deployment-fluorine02.deployment-prep.eqiad.wmflabs!

Looking for events from hosts deployment-logstash2 or deployment-logstash03 , the logstash process there started failing around 2019-08-16 13:25

The first error I found:

[2019-08-16T13:26:00,667][ERROR][logstash.plugins.registry] Problems loading a plugin with


 {:type=>"filter", :name=>"truncate", :path=>"logstash/filters/truncate", :error_message=>"NameError", :error_class=>NameError, :error_backtrace=>["/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:226:in `namespace_lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:162:in `legacy_lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:138:in `lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:180:in `lookup_pipeline_plugin'",
"/usr/share/logstash/logstash-core/lib/logstash/plugin.rb:140:in `lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:103:in `plugin'",
"(eval):157:in `initialize'",
"org/jruby/RubyKernel.java:1079:in `eval'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:75:in `initialize'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:165:in `initialize'",
"/usr/share/logstash/logstash-core/lib/logstash/agent.rb:296:in `create_pipeline'",
"/usr/share/logstash/logstash-core/lib/logstash/agent.rb:95:in `register_pipeline'",
"/usr/share/logstash/logstash-core/lib/logstash/runner.rb:313:in `execute'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'",
"/usr/share/logstash/logstash-core/lib/logstash/runner.rb:204:in `run'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'",
"/usr/share/logstash/lib/bootstrap/environment.rb:71:in `(root)'"]}

For recent errors:

[2019-09-17T17:24:33,329][INFO ][logstash.inputs.udp      ] Starting UDP listener {:address=>"0.0.0.0:11514"}
[2019-09-17T17:24:28,325][ERROR][logstash.inputs.udp      ] UDP listener died {
:exception=>#<Errno::EADDRINUSE: Address already in use - bind - Address already in use>,
:backtrace=>[
"org/jruby/ext/socket/RubyUDPSocket.java:161:in `bind'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-udp-3.3.4/lib/logstash/inputs/udp.rb:116:in `udp_listener'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-udp-3.3.4/lib/logstash/inputs/udp.rb:68:in `run'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:470:in `inputworker'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:463:in `start_input'"]
}

rsyslog is already listening on 0.0.0.0:11514

deployment-logstash2 is the instance serving https://logstash-beta.wmflabs.org/

https://horizon.wikimedia.org/project/instances/15f1bf2e-30e0-487a-ab2a-f6fcc3e4c170/

Puppet classes:

role::logstash::apifeatureusage
role::logstash::collector prometheus_nodes: hiera('prometheus_nodes', [])
role::logstash::puppetreports
role::labs::lvm::srv
role::kafka::logging
role::kibana auth_type: None; vhost: None; auth_realm: undef; auth_file: undef; ldap_authurl: undef; ldap_binddn: undef; require_ssl: true; ldap_groups: []; serveradmin: None
profile:elastalert

Hiera:

elasticsearch::graylog_hosts: none
elasticsearch::version: 5
profile::elasticsearch::common_settings:
  auto_create_index: +logstash-*,-*
  cluster_hosts:
  - deployment-logstash2.deployment-prep.eqiad.wmflabs
  cluster_name: labs-logstash-eqiad
  expected_nodes: 1
  heap_memory: 5G
  minimum_master_nodes: 1
  recover_after_nodes: 1
  recover_after_time: 1m
  short_cluster_name: labs-logstash-eqiad
  unicast_hosts:
  - deployment-logstash2.deployment-prep.eqiad.wmflabs
profile::kafka::broker::kafka_cluster_name: logging-eqiad
profile::kafka::broker::ssl_enabled: true
profile::kafka::broker::ssl_password: qwerty
profile::logstash::collector::jmx_exporter_port: 7801
puppetmaster: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs

Related Objects

Mentioned In
T285978: AbuseFilterCachingParser: PHP Warning: mb_strpos(): Offset not contained in string
T264604: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe
T281152: 1.37.0-wmf.11 deployment blockers
T278641: Migrate deployment-prep away from Debian Stretch to Buster/Bullseye
T275975: Search broken on beta cluster wikis
T274593: Logstash beta is not getting any events
T273238: Write a script to parse StopForumSpam logs and make some stats
T263617: Cannot login to beta cluster: "There seems to be a problem with your login session..."
T256722: Logstash-beta doesn't include any deprecation notices
T239868: Logstash in Beta Cluster stopped ingesting messages from MediaWiki
T238707: Migrate from deployment-logstash2 (jessie) to deployment-logstash03 (stretch)
T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster
T228088: BetaCluster: ExternalStoreException - Unable to store text to external storage
T233610: Internal error from MachineVision on action=info for files: argument to RandomWikidataIdHandler must be WikidataDepictsSetter, LabelResolver given
T232796: [betalabs] memcached listens solely on 127.0.0.1 (was: Cannot create a new user account)
Mentioned Here
T283013: Migrate beta cluster to ELK7
T276521: deployment-logstash03 puppet errors
T211984: Logstash in beta fails periodically
T242770: Logstash for MediaWiki is down in Beta Cluster
T238707: Migrate from deployment-logstash2 (jessie) to deployment-logstash03 (stretch)
T238253: [Regression pre-wmf.6] Error appears while trying to undo an edit on Beta cluster
T238256: [Regression pre-wmf.6 ]Error "[Xcxc@awQBGoAAGq1uCYAAAAM] Exception caught: Unable to store text to external storage" appears on Beta cluster while trying to edit a page after undoing an edit
T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster
T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs
rOPUP23e26e9ca399: logstash: add udp json logback localhost compatibility endpoint

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
hashar added subscribers: herron, Krenair.

So the logstash input for udp tries to bind on 11514 but rsyslogd is already listening there.

# netstat -ulnp|grep :11514
udp        0      0 127.0.0.1:11514         0.0.0.0:*                           9929/rsyslogd       
udp        0      0 127.0.0.1:11514         0.0.0.0:*                           9929/rsyslogd       
udp6       0      0 ::1:11514               :::*                                9929/rsyslogd       
udp6       0      0 ::1:11514               :::*                                9929/rsyslogd
grep -R 11514 /etc/rsyslog.d/
/etc/rsyslog.d/50-udp-json-logback-compat.conf:input(type="imudp" port="11514" address="localhost" ruleset="ruleset_udp_json_logback_compat")
/etc/rsyslog.d/50-udp-localhost-compat.conf:input(type="imudp" port="11514" address="localhost" ruleset="udp_localhost_to_kafka")

And the same port is used by logstash:

/etc/logstash/conf.d/10-input-udp-logback.conf:    port => 11514
/etc/logstash/conf.d/10-input-udp-logback.conf:    id => "input/udp/11514"
/etc/logstash/conf.d/10-input-udp-logback.conf:    tags => ["input-udp-logback-11514", "deprecated-input"]
/etc/logstash/conf.d/10-input-tcp-json_lines.conf:    port => 11514
/etc/logstash/conf.d/10-input-tcp-json_lines.conf:        id => "input/tcp/11514"
/etc/logstash/conf.d/10-input-tcp-json_lines.conf:    tags => ["input-tcp-json_lines-11514", "deprecated-input"]

Which are meant to relay to kafka? The config files lead me to 23e26e9ca39911305e8af033450f1964a3ade4f5 "logstash: add udp json logback localhost compatibility endpoint" and T213899.

So maybe @herron would know more as to why deployment-logstash03 (Horizon) is broken or maybe @Krenair who created it in May 2019.

Note that deployment-logstash2 is in a similar shape.

The details of where we got to with this logstash03 instance are in the thread from T218729#5153739, I'm slightly surprised this host is logstash-beta.wmflabs.org?

In fact according to openstack-browser, logstash-beta.wmflabs.org still points at deployment-logstash2 and not deployment-logstash03 at all?

I had not checked which instances was used for logstash-beta.wmflabs.org. One sure thing they are both broken in the same way due to rsyslog / logstash udp input conflicting on port 11514 :]

Mentioned in SAL (#wikimedia-releng) [2019-09-24T15:27:36Z] <hashar> deployment-logstash2: removing class profile::elastalert , does not exist and breaks puppet # T233134

On deployment-logstash2 I have killed rsyslogd to let logstash listens on UDP port 11514.

rsyslog listens on that port due to config files:

/etc/rsyslog.d/50-udp-json-logback-compat.conf
/etc/rsyslog.d/50-udp-localhost-compat.conf

Which I guess should replace the logstash input which is configured by:

/etc/logstash/conf.d/10-input-udp-logback.conf
/etc/logstash/conf.d/10-input-tcp-json_lines.conf

I have no idea how those files are provisioned, sounds to me one of the role applied to the instance should be removed, or maybe the logstash one does not purge old config files.

That is all I get for now. Leaving the instance with logstash listening on 11514.

hashar triaged this task as High priority.Sep 24 2019, 3:43 PM

My two cents: since deployment-logstash03 has been setup in T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster as the 02 stretch replacement, my suggestion would be to work to bring up 03 and ditch 02 in the process

Change 540189 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] elasticsearch: add toggle for rsyslog udp logback compat include

https://gerrit.wikimedia.org/r/540189

Change 540189 abandoned by Herron:
elasticsearch: add toggle for rsyslog udp logback compat include

Reason:
implemented in a preferred way with I278486fb96878a7a45a608d785625d7db74e9953

https://gerrit.wikimedia.org/r/540189

herron claimed this task.

This occurred in prod as well after host reboots, and a fix has been deployed in puppet. The fix moves include ::profile::rsyslog::udp_json_logback_compat from profile::elasticsearch to profile::elasticsearch::cirrus which prevents the rsyslog udp 11514 listener from being deployed to logstash collectors where it conflicts with logstash. I think we're in good shape here now, setting this to resolved.

Reopening since:

re: mediawiki events, searching for type:mediawiki does yield results for me (e.g. https://logstash-beta.wmflabs.org/goto/e617fd053ba33c26815ed92f1e51ba07 ) is there a more specific query that does not yield the expected results?

re: deployment-logstash03 I believe its migration is tracked at T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster but AFAICS seems stalled

Ran into what is probably this this morning while trying to investigate T238253 and T238256. I don't see the exceptions mentioned in those tasks on either deployment-fluorine02 or in logstash-beta.wmflabs.org.

logstash-beta now points at deployment-logstash03, deployment-logstash2 is still accessible at https://old-logstash-beta.wmflabs.org/ for the time being (I expect people will want to copy some stuff over, not sure what to do about that yet). Is stuff better now?
I do see results at https://logstash-beta.wmflabs.org/app/kibana#/discover?_g=()&_a=(columns:!(_source),index:AW6GShqJw8E2oQN1xWTl,interval:auto,query:(query_string:(query:'type:mediawiki')),sort:!('@timestamp',desc))

Krinkle added subscribers: Krinkle, colewhite.

Looks like there still, or again, no messages under type:mediawiki. No specific filter, just type:mediawiki overall. I tried e.g. enabling WikimediaDebug with verbose logging, that also didn't make anything show up there:

https://logstash-beta.wmflabs.org/app/kibana#/dashboard/mediawiki-errors

seeing rsyslog complaining about "omkafka: kafka delivery FAIL" on deployment-prep hosts.

seeing rsyslog complaining about "omkafka: kafka delivery FAIL" on deployment-prep hosts.

That was me. Will have another look at this today

Looking more closely the problem was due to a Broker: Leader not available issue in the deployment-prep kafka logging cluster. After starting deployment-logstash2 back up (the instance had been stopped) logs are flowing again. Longer term we'll likely need another logstash stretch instance and to migrate over the broker id from deployment-logstash2 to the new instance.

What exactly is needed? Do we need to make a second logstash stretch host?
Do we need to move something else over from the old instance? How does the
Kafka cluster interact with this beyond firing logs at it?

Do we just need to point some more things at the new instance we made (see also T238707)? Do we just need to remove deployment-logstash2 from the kafka_clusters / logging-eqiad hiera data set project-wide in horizon?

Yes, we will need a second logstash stretch instance, and to migrate the Kafka broker ID from deployment-logstash2 to the new host.

Yes, we will need a second logstash stretch instance, and to migrate the Kafka broker ID from deployment-logstash2 to the new host.

Well we have deployment-logstash03, do we need a deployment-logstash04 as well? Why?
How does one migrate a Kafka broker ID?

Mentioned in SAL (#wikimedia-releng) [2020-09-15T17:49:34Z] <Krinkle> Logstash in beta cluster appears broken. No messages from Scap or MediaWiki in several days. ref T233134

can this be closed now? Its happened a number of times since this was originally reported, and is now working

I don't know if the issue we saw this week and last month are the same as the one detailed in this task. If it is the same, then it is not resolved as it keeps coming back (and this task is not about the one incident in 2019).

If the root cause from then was solved and we are dealing with a new recurring issue, then we could close this and start another. I don't know if that's the case.

This happened again, it broke when I ran puppet on logstash03 when investigating T276521: deployment-logstash03 puppet errors. I tried restarting logstash and it managed to bind port 11514 but still no events :(

Found this error message in /var/log/logstash/logstash-plain-2021-03-05.log:

Plugin: <LogStash::Inputs::Kafka tags=>["input-kafka-eventgate-main-validation-error-eqiad", "kafka", "es", "eventgate"], topics=>["eqiad.eventgate-main.error.validation"], type=>"eventgate_validation_error", codec=><LogStash::Codecs::JSON id=>"json_670029b3-b883-4e89-ab57-4e6e3a793d68", enable_metric=>true, charset=>"UTF-8">, consumer_threads=>3, bootstrap_servers=>"deployment-logstash03.deployment-prep.eqiad.wmflabs:9093", id=>"input/kafka/eventgate-main-validation-error-eqiad", security_protocol=>"SSL", ssl_truststore_location=>"/etc/logstash/kafka_logstash-eqiad.truststore.jks", ssl_truststore_password=><password>, enable_metric=>true, auto_commit_interval_ms=>"5000", client_id=>"logstash", enable_auto_commit=>"true", group_id=>"logstash", key_deserializer_class=>"org.apache.kafka.common.serialization.StringDeserializer", value_deserializer_class=>"org.apache.kafka.common.serialization.StringDeserializer", poll_timeout_ms=>100, ssl=>false, sasl_mechanism=>"GSSAPI", decorate_events=>false>
Error: uncaught throw Failed to construct kafka consumer in thread 0x6af5f10e

I'm not exactly sure what's causing it to break but that looks like the cause for no events

Mentioned in SAL (#wikimedia-releng) [2021-04-09T18:41:20Z] <Krinkle> Logstash in beta shows no messages from mediawiki for 24 hours, T233134.

Change 683695 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: move kafka input configs to profile::logstash::kafka_inputs

https://gerrit.wikimedia.org/r/683695

Change 683695 abandoned by Herron:

[operations/puppet@production] logstash: move kafka input configs to profile::logstash::kafka_inputs

Reason:

did not work as expected

https://gerrit.wikimedia.org/r/683695

I'm hoping this was fixed for good with the ELK7 upgrade (T283013)