Page MenuHomePhabricator

logstash-beta.wmflabs.org does not receive any mediawiki events
Open, HighPublic

Description

https://logstash-beta.wmflabs.org/ is mostly empty.

Looking for events from hosts deployment-logstash2 or deployment-logstash03 , the logstash process there started failing around 2019-08-16 13:25

The first error I found:

[2019-08-16T13:26:00,667][ERROR][logstash.plugins.registry] Problems loading a plugin with


 {:type=>"filter", :name=>"truncate", :path=>"logstash/filters/truncate", :error_message=>"NameError", :error_class=>NameError, :error_backtrace=>["/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:226:in `namespace_lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:162:in `legacy_lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:138:in `lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:180:in `lookup_pipeline_plugin'",
"/usr/share/logstash/logstash-core/lib/logstash/plugin.rb:140:in `lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:103:in `plugin'",
"(eval):157:in `initialize'",
"org/jruby/RubyKernel.java:1079:in `eval'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:75:in `initialize'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:165:in `initialize'",
"/usr/share/logstash/logstash-core/lib/logstash/agent.rb:296:in `create_pipeline'",
"/usr/share/logstash/logstash-core/lib/logstash/agent.rb:95:in `register_pipeline'",
"/usr/share/logstash/logstash-core/lib/logstash/runner.rb:313:in `execute'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'",
"/usr/share/logstash/logstash-core/lib/logstash/runner.rb:204:in `run'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'",
"/usr/share/logstash/lib/bootstrap/environment.rb:71:in `(root)'"]}

For recent errors:

[2019-09-17T17:24:33,329][INFO ][logstash.inputs.udp      ] Starting UDP listener {:address=>"0.0.0.0:11514"}
[2019-09-17T17:24:28,325][ERROR][logstash.inputs.udp      ] UDP listener died {
:exception=>#<Errno::EADDRINUSE: Address already in use - bind - Address already in use>,
:backtrace=>[
"org/jruby/ext/socket/RubyUDPSocket.java:161:in `bind'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-udp-3.3.4/lib/logstash/inputs/udp.rb:116:in `udp_listener'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-udp-3.3.4/lib/logstash/inputs/udp.rb:68:in `run'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:470:in `inputworker'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:463:in `start_input'"]
}

rsyslog is already listening on 0.0.0.0:11514

deployment-logstash2 is the instance serving https://logstash-beta.wmflabs.org/

https://horizon.wikimedia.org/project/instances/15f1bf2e-30e0-487a-ab2a-f6fcc3e4c170/

Puppet classes:

role::logstash::apifeatureusage
role::logstash::collector prometheus_nodes: hiera('prometheus_nodes', [])
role::logstash::puppetreports
role::labs::lvm::srv
role::kafka::logging
role::kibana auth_type: None; vhost: None; auth_realm: undef; auth_file: undef; ldap_authurl: undef; ldap_binddn: undef; require_ssl: true; ldap_groups: []; serveradmin: None
profile:elastalert

Hiera:

elasticsearch::graylog_hosts: none
elasticsearch::version: 5
profile::elasticsearch::common_settings:
  auto_create_index: +logstash-*,-*
  cluster_hosts:
  - deployment-logstash2.deployment-prep.eqiad.wmflabs
  cluster_name: labs-logstash-eqiad
  expected_nodes: 1
  heap_memory: 5G
  minimum_master_nodes: 1
  recover_after_nodes: 1
  recover_after_time: 1m
  short_cluster_name: labs-logstash-eqiad
  unicast_hosts:
  - deployment-logstash2.deployment-prep.eqiad.wmflabs
profile::kafka::broker::kafka_cluster_name: logging-eqiad
profile::kafka::broker::ssl_enabled: true
profile::kafka::broker::ssl_password: qwerty
profile::logstash::collector::jmx_exporter_port: 7801
puppetmaster: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs

Related Objects

Mentioned In
T263617: Cannot login to beta cluster: "There seems to be a problem with your login session..."
T256722: Logstash-beta doesn't include any deprecation notices
T239868: Logstash in Beta Cluster stopped ingesting messages from MediaWiki
T238707: Migrate from deployment-logstash2 (jessie) to deployment-logstash03 (stretch)
T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster
T228088: BetaCluster: ExternalStoreException - Unable to store text to external storage
T233610: Internal error from MachineVision on action=info for files: argument to RandomWikidataIdHandler must be WikidataDepictsSetter, LabelResolver given
T232796: [betalabs] memcached listens solely on 127.0.0.1 (was: Cannot create a new user account)
Mentioned Here
T211984: Logstash in beta fails periodically
T242770: Logstash for MediaWiki is down in Beta Cluster
T238707: Migrate from deployment-logstash2 (jessie) to deployment-logstash03 (stretch)
T238253: [Regression pre-wmf.6] Error appears while trying to undo an edit on Beta cluster
T238256: [Regression pre-wmf.6 ]Error "[Xcxc@awQBGoAAGq1uCYAAAAM] Exception caught: Unable to store text to external storage" appears on Beta cluster while trying to edit a page after undoing an edit
T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster
T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs
rOPUP23e26e9ca399: logstash: add udp json logback localhost compatibility endpoint

Event Timeline

hashar created this task.Sep 17 2019, 5:26 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 17 2019, 5:26 PM
hashar added subscribers: herron, Krenair.

So the logstash input for udp tries to bind on 11514 but rsyslogd is already listening there.

# netstat -ulnp|grep :11514
udp        0      0 127.0.0.1:11514         0.0.0.0:*                           9929/rsyslogd       
udp        0      0 127.0.0.1:11514         0.0.0.0:*                           9929/rsyslogd       
udp6       0      0 ::1:11514               :::*                                9929/rsyslogd       
udp6       0      0 ::1:11514               :::*                                9929/rsyslogd
grep -R 11514 /etc/rsyslog.d/
/etc/rsyslog.d/50-udp-json-logback-compat.conf:input(type="imudp" port="11514" address="localhost" ruleset="ruleset_udp_json_logback_compat")
/etc/rsyslog.d/50-udp-localhost-compat.conf:input(type="imudp" port="11514" address="localhost" ruleset="udp_localhost_to_kafka")

And the same port is used by logstash:

/etc/logstash/conf.d/10-input-udp-logback.conf:    port => 11514
/etc/logstash/conf.d/10-input-udp-logback.conf:    id => "input/udp/11514"
/etc/logstash/conf.d/10-input-udp-logback.conf:    tags => ["input-udp-logback-11514", "deprecated-input"]
/etc/logstash/conf.d/10-input-tcp-json_lines.conf:    port => 11514
/etc/logstash/conf.d/10-input-tcp-json_lines.conf:        id => "input/tcp/11514"
/etc/logstash/conf.d/10-input-tcp-json_lines.conf:    tags => ["input-tcp-json_lines-11514", "deprecated-input"]

Which are meant to relay to kafka? The config files lead me to 23e26e9ca39911305e8af033450f1964a3ade4f5 "logstash: add udp json logback localhost compatibility endpoint" and T213899.

So maybe @herron would know more as to why deployment-logstash03 (Horizon) is broken or maybe @Krenair who created it in May 2019.

Note that deployment-logstash2 is in a similar shape.

The details of where we got to with this logstash03 instance are in the thread from T218729#5153739, I'm slightly surprised this host is logstash-beta.wmflabs.org?

In fact according to openstack-browser, logstash-beta.wmflabs.org still points at deployment-logstash2 and not deployment-logstash03 at all?

I had not checked which instances was used for logstash-beta.wmflabs.org. One sure thing they are both broken in the same way due to rsyslog / logstash udp input conflicting on port 11514 :]

awight added a subscriber: awight.Sep 18 2019, 9:04 AM
hashar updated the task description. (Show Details)Sep 24 2019, 3:25 PM

Mentioned in SAL (#wikimedia-releng) [2019-09-24T15:27:36Z] <hashar> deployment-logstash2: removing class profile::elastalert , does not exist and breaks puppet # T233134

On deployment-logstash2 I have killed rsyslogd to let logstash listens on UDP port 11514.

rsyslog listens on that port due to config files:

/etc/rsyslog.d/50-udp-json-logback-compat.conf
/etc/rsyslog.d/50-udp-localhost-compat.conf

Which I guess should replace the logstash input which is configured by:

/etc/logstash/conf.d/10-input-udp-logback.conf
/etc/logstash/conf.d/10-input-tcp-json_lines.conf

I have no idea how those files are provisioned, sounds to me one of the role applied to the instance should be removed, or maybe the logstash one does not purge old config files.

That is all I get for now. Leaving the instance with logstash listening on 11514.

hashar triaged this task as High priority.Sep 24 2019, 3:43 PM

My two cents: since deployment-logstash03 has been setup in T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster as the 02 stretch replacement, my suggestion would be to work to bring up 03 and ditch 02 in the process

Change 540189 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] elasticsearch: add toggle for rsyslog udp logback compat include

https://gerrit.wikimedia.org/r/540189

Change 540189 abandoned by Herron:
elasticsearch: add toggle for rsyslog udp logback compat include

Reason:
implemented in a preferred way with I278486fb96878a7a45a608d785625d7db74e9953

https://gerrit.wikimedia.org/r/540189

herron closed this task as Resolved.Oct 2 2019, 3:49 PM
herron claimed this task.

This occurred in prod as well after host reboots, and a fix has been deployed in puppet. The fix moves include ::profile::rsyslog::udp_json_logback_compat from profile::elasticsearch to profile::elasticsearch::cirrus which prevents the rsyslog udp 11514 listener from being deployed to logstash collectors where it conflicts with logstash. I think we're in good shape here now, setting this to resolved.

hashar reopened this task as Open.Oct 2 2019, 5:33 PM

Reopening since:

re: mediawiki events, searching for type:mediawiki does yield results for me (e.g. https://logstash-beta.wmflabs.org/goto/e617fd053ba33c26815ed92f1e51ba07 ) is there a more specific query that does not yield the expected results?

re: deployment-logstash03 I believe its migration is tracked at T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster but AFAICS seems stalled

Anomie added a subscriber: Anomie.Nov 15 2019, 2:32 PM

Ran into what is probably this this morning while trying to investigate T238253 and T238256. I don't see the exceptions mentioned in those tasks on either deployment-fluorine02 or in logstash-beta.wmflabs.org.

Krenair added a comment.EditedNov 20 2019, 1:01 AM

logstash-beta now points at deployment-logstash03, deployment-logstash2 is still accessible at https://old-logstash-beta.wmflabs.org/ for the time being (I expect people will want to copy some stuff over, not sure what to do about that yet). Is stuff better now?
I do see results at https://logstash-beta.wmflabs.org/app/kibana#/discover?_g=()&_a=(columns:!(_source),index:AW6GShqJw8E2oQN1xWTl,interval:auto,query:(query_string:(query:'type:mediawiki')),sort:!('@timestamp',desc))

Krinkle added subscribers: Krinkle, colewhite.

Looks like there still, or again, no messages under type:mediawiki. No specific filter, just type:mediawiki overall. I tried e.g. enabling WikimediaDebug with verbose logging, that also didn't make anything show up there:

https://logstash-beta.wmflabs.org/app/kibana#/dashboard/mediawiki-errors

seeing rsyslog complaining about "omkafka: kafka delivery FAIL" on deployment-prep hosts.

herron added a comment.Dec 5 2019, 2:57 PM

seeing rsyslog complaining about "omkafka: kafka delivery FAIL" on deployment-prep hosts.

That was me. Will have another look at this today

herron added a comment.Dec 5 2019, 5:33 PM

Looking more closely the problem was due to a Broker: Leader not available issue in the deployment-prep kafka logging cluster. After starting deployment-logstash2 back up (the instance had been stopped) logs are flowing again. Longer term we'll likely need another logstash stretch instance and to migrate over the broker id from deployment-logstash2 to the new instance.

What exactly is needed? Do we need to make a second logstash stretch host?
Do we need to move something else over from the old instance? How does the
Kafka cluster interact with this beyond firing logs at it?

Do we just need to point some more things at the new instance we made (see also T238707)? Do we just need to remove deployment-logstash2 from the kafka_clusters / logging-eqiad hiera data set project-wide in horizon?

Yes, we will need a second logstash stretch instance, and to migrate the Kafka broker ID from deployment-logstash2 to the new host.

Yes, we will need a second logstash stretch instance, and to migrate the Kafka broker ID from deployment-logstash2 to the new host.

Well we have deployment-logstash03, do we need a deployment-logstash04 as well? Why?
How does one migrate a Kafka broker ID?

fgiunchedi moved this task from Inbox to Backlog on the observability board.Jul 20 2020, 1:13 PM

Mentioned in SAL (#wikimedia-releng) [2020-09-15T17:49:34Z] <Krinkle> Logstash in beta cluster appears broken. No messages from Scap or MediaWiki in several days. ref T233134

Aklapper removed a subscriber: Anomie.Fri, Oct 16, 5:01 PM