Page MenuHomePhabricator

logstash-beta.wmflabs.org does not receive any mediawiki events
Open, HighPublic

Description

https://logstash-beta.wmflabs.org/ is mostly empty.

Looking for events from hosts deployment-logstash2 or deployment-logstash03 , the logstash process there started failing around 2019-08-16 13:25

The first error I found:

[2019-08-16T13:26:00,667][ERROR][logstash.plugins.registry] Problems loading a plugin with


 {:type=>"filter", :name=>"truncate", :path=>"logstash/filters/truncate", :error_message=>"NameError", :error_class=>NameError, :error_backtrace=>["/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:226:in `namespace_lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:162:in `legacy_lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:138:in `lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:180:in `lookup_pipeline_plugin'",
"/usr/share/logstash/logstash-core/lib/logstash/plugin.rb:140:in `lookup'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:103:in `plugin'",
"(eval):157:in `initialize'",
"org/jruby/RubyKernel.java:1079:in `eval'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:75:in `initialize'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:165:in `initialize'",
"/usr/share/logstash/logstash-core/lib/logstash/agent.rb:296:in `create_pipeline'",
"/usr/share/logstash/logstash-core/lib/logstash/agent.rb:95:in `register_pipeline'",
"/usr/share/logstash/logstash-core/lib/logstash/runner.rb:313:in `execute'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'",
"/usr/share/logstash/logstash-core/lib/logstash/runner.rb:204:in `run'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'",
"/usr/share/logstash/lib/bootstrap/environment.rb:71:in `(root)'"]}

For recent errors:

[2019-09-17T17:24:33,329][INFO ][logstash.inputs.udp      ] Starting UDP listener {:address=>"0.0.0.0:11514"}
[2019-09-17T17:24:28,325][ERROR][logstash.inputs.udp      ] UDP listener died {
:exception=>#<Errno::EADDRINUSE: Address already in use - bind - Address already in use>,
:backtrace=>[
"org/jruby/ext/socket/RubyUDPSocket.java:161:in `bind'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-udp-3.3.4/lib/logstash/inputs/udp.rb:116:in `udp_listener'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-udp-3.3.4/lib/logstash/inputs/udp.rb:68:in `run'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:470:in `inputworker'",
"/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:463:in `start_input'"]
}

rsyslog is already listening on 0.0.0.0:11514

deployment-logstash2 is the instance serving https://logstash-beta.wmflabs.org/

https://horizon.wikimedia.org/project/instances/15f1bf2e-30e0-487a-ab2a-f6fcc3e4c170/

Puppet classes:

role::logstash::apifeatureusage
role::logstash::collector prometheus_nodes: hiera('prometheus_nodes', [])
role::logstash::puppetreports
role::labs::lvm::srv
role::kafka::logging
role::kibana auth_type: None; vhost: None; auth_realm: undef; auth_file: undef; ldap_authurl: undef; ldap_binddn: undef; require_ssl: true; ldap_groups: []; serveradmin: None
profile:elastalert

Hiera:

elasticsearch::graylog_hosts: none
elasticsearch::version: 5
profile::elasticsearch::common_settings:
  auto_create_index: +logstash-*,-*
  cluster_hosts:
  - deployment-logstash2.deployment-prep.eqiad.wmflabs
  cluster_name: labs-logstash-eqiad
  expected_nodes: 1
  heap_memory: 5G
  minimum_master_nodes: 1
  recover_after_nodes: 1
  recover_after_time: 1m
  short_cluster_name: labs-logstash-eqiad
  unicast_hosts:
  - deployment-logstash2.deployment-prep.eqiad.wmflabs
profile::kafka::broker::kafka_cluster_name: logging-eqiad
profile::kafka::broker::ssl_enabled: true
profile::kafka::broker::ssl_password: qwerty
profile::logstash::collector::jmx_exporter_port: 7801
puppetmaster: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs

Details

Related Gerrit Patches:

Event Timeline

hashar created this task.Sep 17 2019, 5:26 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 17 2019, 5:26 PM
hashar added subscribers: herron, Krenair.

So the logstash input for udp tries to bind on 11514 but rsyslogd is already listening there.

# netstat -ulnp|grep :11514
udp        0      0 127.0.0.1:11514         0.0.0.0:*                           9929/rsyslogd       
udp        0      0 127.0.0.1:11514         0.0.0.0:*                           9929/rsyslogd       
udp6       0      0 ::1:11514               :::*                                9929/rsyslogd       
udp6       0      0 ::1:11514               :::*                                9929/rsyslogd
grep -R 11514 /etc/rsyslog.d/
/etc/rsyslog.d/50-udp-json-logback-compat.conf:input(type="imudp" port="11514" address="localhost" ruleset="ruleset_udp_json_logback_compat")
/etc/rsyslog.d/50-udp-localhost-compat.conf:input(type="imudp" port="11514" address="localhost" ruleset="udp_localhost_to_kafka")

And the same port is used by logstash:

/etc/logstash/conf.d/10-input-udp-logback.conf:    port => 11514
/etc/logstash/conf.d/10-input-udp-logback.conf:    id => "input/udp/11514"
/etc/logstash/conf.d/10-input-udp-logback.conf:    tags => ["input-udp-logback-11514", "deprecated-input"]
/etc/logstash/conf.d/10-input-tcp-json_lines.conf:    port => 11514
/etc/logstash/conf.d/10-input-tcp-json_lines.conf:        id => "input/tcp/11514"
/etc/logstash/conf.d/10-input-tcp-json_lines.conf:    tags => ["input-tcp-json_lines-11514", "deprecated-input"]

Which are meant to relay to kafka? The config files lead me to 23e26e9ca39911305e8af033450f1964a3ade4f5 "logstash: add udp json logback localhost compatibility endpoint" and T213899.

So maybe @herron would know more as to why deployment-logstash03 (Horizon) is broken or maybe @Krenair who created it in May 2019.

Note that deployment-logstash2 is in a similar shape.

The details of where we got to with this logstash03 instance are in the thread from T218729#5153739, I'm slightly surprised this host is logstash-beta.wmflabs.org?

In fact according to openstack-browser, logstash-beta.wmflabs.org still points at deployment-logstash2 and not deployment-logstash03 at all?

I had not checked which instances was used for logstash-beta.wmflabs.org. One sure thing they are both broken in the same way due to rsyslog / logstash udp input conflicting on port 11514 :]

awight added a subscriber: awight.Sep 18 2019, 9:04 AM
hashar updated the task description. (Show Details)Sep 24 2019, 3:25 PM

Mentioned in SAL (#wikimedia-releng) [2019-09-24T15:27:36Z] <hashar> deployment-logstash2: removing class profile::elastalert , does not exist and breaks puppet # T233134

On deployment-logstash2 I have killed rsyslogd to let logstash listens on UDP port 11514.

rsyslog listens on that port due to config files:

/etc/rsyslog.d/50-udp-json-logback-compat.conf
/etc/rsyslog.d/50-udp-localhost-compat.conf

Which I guess should replace the logstash input which is configured by:

/etc/logstash/conf.d/10-input-udp-logback.conf
/etc/logstash/conf.d/10-input-tcp-json_lines.conf

I have no idea how those files are provisioned, sounds to me one of the role applied to the instance should be removed, or maybe the logstash one does not purge old config files.

That is all I get for now. Leaving the instance with logstash listening on 11514.

hashar triaged this task as High priority.Sep 24 2019, 3:43 PM

My two cents: since deployment-logstash03 has been setup in T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster as the 02 stretch replacement, my suggestion would be to work to bring up 03 and ditch 02 in the process

Change 540189 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] elasticsearch: add toggle for rsyslog udp logback compat include

https://gerrit.wikimedia.org/r/540189

Change 540189 abandoned by Herron:
elasticsearch: add toggle for rsyslog udp logback compat include

Reason:
implemented in a preferred way with I278486fb96878a7a45a608d785625d7db74e9953

https://gerrit.wikimedia.org/r/540189

herron closed this task as Resolved.Oct 2 2019, 3:49 PM
herron claimed this task.

This occurred in prod as well after host reboots, and a fix has been deployed in puppet. The fix moves include ::profile::rsyslog::udp_json_logback_compat from profile::elasticsearch to profile::elasticsearch::cirrus which prevents the rsyslog udp 11514 listener from being deployed to logstash collectors where it conflicts with logstash. I think we're in good shape here now, setting this to resolved.

hashar reopened this task as Open.Oct 2 2019, 5:33 PM

Reopening since:

re: mediawiki events, searching for type:mediawiki does yield results for me (e.g. https://logstash-beta.wmflabs.org/goto/e617fd053ba33c26815ed92f1e51ba07 ) is there a more specific query that does not yield the expected results?

re: deployment-logstash03 I believe its migration is tracked at T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster but AFAICS seems stalled

Anomie added a subscriber: Anomie.Fri, Nov 15, 2:32 PM

Ran into what is probably this this morning while trying to investigate T238253 and T238256. I don't see the exceptions mentioned in those tasks on either deployment-fluorine02 or in logstash-beta.wmflabs.org.

Krenair added a comment.EditedWed, Nov 20, 1:01 AM

logstash-beta now points at deployment-logstash03, deployment-logstash2 is still accessible at https://old-logstash-beta.wmflabs.org/ for the time being (I expect people will want to copy some stuff over, not sure what to do about that yet). Is stuff better now?
I do see results at https://logstash-beta.wmflabs.org/app/kibana#/discover?_g=()&_a=(columns:!(_source),index:AW6GShqJw8E2oQN1xWTl,interval:auto,query:(query_string:(query:'type:mediawiki')),sort:!('@timestamp',desc))

Krinkle added subscribers: Krinkle, colewhite.

Looks like there still, or again, no messages under type:mediawiki. No specific filter, just type:mediawiki overall. I tried e.g. enabling WikimediaDebug with verbose logging, that also didn't make anything show up there:

https://logstash-beta.wmflabs.org/app/kibana#/dashboard/mediawiki-errors

seeing rsyslog complaining about "omkafka: kafka delivery FAIL" on deployment-prep hosts.

herron added a comment.Thu, Dec 5, 2:57 PM

seeing rsyslog complaining about "omkafka: kafka delivery FAIL" on deployment-prep hosts.

That was me. Will have another look at this today

herron added a comment.Thu, Dec 5, 5:33 PM

Looking more closely the problem was due to a Broker: Leader not available issue in the deployment-prep kafka logging cluster. After starting deployment-logstash2 back up (the instance had been stopped) logs are flowing again. Longer term we'll likely need another logstash stretch instance and to migrate over the broker id from deployment-logstash2 to the new instance.

What exactly is needed? Do we need to make a second logstash stretch host?
Do we need to move something else over from the old instance? How does the
Kafka cluster interact with this beyond firing logs at it?