Page MenuHomePhabricator

No entries at all in beta-logs.wmcloud.org since 2023-11-06 Z 12:15:39
Closed, ResolvedPublic

Description

Noticed whilst debugging T350777. Usual traffic is ~ 130k per three hours.

Event Timeline

In the past, restarting Logstash (T274593: Logstash beta is not getting any events) got things working again.

kostajh triaged this task as High priority.Nov 16 2023, 1:16 PM

I don't know if this should be considered a train blocking task (tagging Release-Engineering-Team for triaging) but not having visibility into MediaWiki errors in beta cluster would increase the likelihood of us finding out about issues only in group0 deployment, if not later.

i have logged into logging-logstash-02.logging.eqiad1.wikimedia.cloud and ran systemctl restart logstash.service hopefully that has fixed this
Add Comment

@jbond discovered that Logstash is in a crash loop, see

1Nov 16 13:39:03 logging-logstash-02 systemd[1]: Started logstash.
2Nov 16 13:39:03 logging-logstash-02 logstash[2669694]: Using LS_JAVA_HOME defined java: /usr/lib/jvm/java-11-openjdk-amd64.
3Nov 16 13:39:03 logging-logstash-02 logstash[2669694]: WARNING: Using LS_JAVA_HOME while Logstash distribution comes with a bundled JDK.
4Nov 16 13:39:12 logging-logstash-02 logstash[2669694]: Sending Logstash logs to /var/log/logstash which is now configured via log4j2.properties
5Nov 16 13:39:18 logging-logstash-02 logstash[2669694]: warning: thread "Converge PipelineAction::Create<main>" terminated with exception (report_on_exception is true):
6Nov 16 13:39:18 logging-logstash-02 logstash[2669694]: LogStash::Error: Don't know how to handle `Java::JavaLang::IllegalStateException` for `PipelineAction::Create<main>`
7Nov 16 13:39:18 logging-logstash-02 logstash[2669694]: create at org/logstash/execution/ConvergeResultExt.java:135
8Nov 16 13:39:18 logging-logstash-02 logstash[2669694]: add at org/logstash/execution/ConvergeResultExt.java:60
9Nov 16 13:39:18 logging-logstash-02 logstash[2669694]: converge_state at /usr/share/logstash/logstash-core/lib/logstash/agent.rb:396
10Nov 16 13:39:18 logging-logstash-02 systemd[1]: logstash.service: Main process exited, code=exited, status=1/FAILURE
11Nov 16 13:39:18 logging-logstash-02 systemd[1]: logstash.service: Failed with result 'exit-code'.
12Nov 16 13:39:18 logging-logstash-02 systemd[1]: logstash.service: Consumed 51.043s CPU time.

Change 974635 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: beta-logs to use current w3creportingapi template

https://gerrit.wikimedia.org/r/974635

Change 974635 merged by Cwhite:

[operations/puppet@production] logstash: beta-logs to use current w3creportingapi template

https://gerrit.wikimedia.org/r/974635

colewhite claimed this task.
colewhite added subscribers: herron, colewhite.

Logstash was crashlooping because it was attempting to load a template that did not exist on the host anymore. Now that it is using the right template, logs are flowing again.

Logstash was crashlooping because it was attempting to load a template that did not exist on the host anymore. Now that it is using the right template, logs are flowing again.

Thank you!

@colewhite thank you! Would you mind adding notes about how you found and fixed the issue, either here or in https://wikitech.wikimedia.org/wiki/Logs#Beta_cluster?

I am still not seeing MediaWiki application events in beta-logs.wmcloud.org. Maybe that is worth tracking separately; but for now I am re-opening this one. I'm also tagging Quality-and-Test-Engineering-Team, seems relevant to them.

Definitely a different problem.

Restarted rsyslog on deployment-mediawiki11 and I saw some logs come through. Going back through syslog, it seems the kafka output plugin crashed and rsyslog never recovered.

It makes me suspicious about the state of rsyslog on the rest of the deployment hosts. Went ahead and restarted rsyslog on all of deployment prep to see if that would restore things.

Optimistically resolving, but please let us know if you find anything we missed!

Definitely a different problem.

Restarted rsyslog on deployment-mediawiki11 and I saw some logs come through. Going back through syslog, it seems the kafka output plugin crashed and rsyslog never recovered.

It makes me suspicious about the state of rsyslog on the rest of the deployment hosts. Went ahead and restarted rsyslog on all of deployment prep to see if that would restore things.

Optimistically resolving, but please let us know if you find anything we missed!

@colewhite thanks, looks good at the moment

image.png (1×2 px, 267 KB)