No entries at all in beta-logs.wmcloud.org since 2023-11-06 Z 12:15:39
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Jdforrester-WMF
	Nov 8 2023, 2:55 PM

Description

Noticed whilst debugging T350777. Usual traffic is ~ 130k per three hours.

Details

	Subject	Repo	Branch	Lines +/-
	logstash: beta-logs to use current w3creportingapi template	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T351193: RESTbase broken on beta cluster
T351131: Creating a Wikistory on English betacluster results in a 502 error
Mentioned Here: P53517 (An Untitled Masterwork)
T274593: Logstash beta is not getting any events
T350777: 1.42.0-wmf.4: Structured Data on Wikimedia Commons not longer available

Event Timeline

Jdforrester-WMF created this task.Nov 8 2023, 2:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 8 2023, 2:55 PM

Nikerabbit mentioned this in T351131: Creating a Wikistory on English betacluster results in a 502 error.Nov 14 2023, 10:27 AM

TheresNoTime mentioned this in T351193: RESTbase broken on beta cluster.Nov 14 2023, 11:34 AM

kostajh subscribed.Nov 14 2023, 4:36 PM

In the past, restarting Logstash (T274593: Logstash beta is not getting any events) got things working again.

I don't know if this should be considered a train blocking task (tagging Release-Engineering-Team for triaging) but not having visibility into MediaWiki errors in beta cluster would increase the likelihood of us finding out about issues only in group0 deployment, if not later.

i have logged into logging-logstash-02.logging.eqiad1.wikimedia.cloud and ran systemctl restart logstash.service hopefully that has fixed this
Add Comment

@jbond discovered that Logstash is in a crash loop, see

P53517 (An Untitled Masterwork)

1	Nov 16 13:39:03 logging-logstash-02 systemd[1]: Started logstash.
2	Nov 16 13:39:03 logging-logstash-02 logstash[2669694]: Using LS_JAVA_HOME defined java: /usr/lib/jvm/java-11-openjdk-amd64.
3	Nov 16 13:39:03 logging-logstash-02 logstash[2669694]: WARNING: Using LS_JAVA_HOME while Logstash distribution comes with a bundled JDK.
4	Nov 16 13:39:12 logging-logstash-02 logstash[2669694]: Sending Logstash logs to /var/log/logstash which is now configured via log4j2.properties
5	Nov 16 13:39:18 logging-logstash-02 logstash[2669694]: warning: thread "Converge PipelineAction::Create<main>" terminated with exception (report_on_exception is true):
6	Nov 16 13:39:18 logging-logstash-02 logstash[2669694]: LogStash::Error: Don't know how to handle `Java::JavaLang::IllegalStateException` for `PipelineAction::Create<main>`
7	Nov 16 13:39:18 logging-logstash-02 logstash[2669694]: create at org/logstash/execution/ConvergeResultExt.java:135
8	Nov 16 13:39:18 logging-logstash-02 logstash[2669694]: add at org/logstash/execution/ConvergeResultExt.java:60
9	Nov 16 13:39:18 logging-logstash-02 logstash[2669694]: converge_state at /usr/share/logstash/logstash-core/lib/logstash/agent.rb:396
10	Nov 16 13:39:18 logging-logstash-02 systemd[1]: logstash.service: Main process exited, code=exited, status=1/FAILURE
11	Nov 16 13:39:18 logging-logstash-02 systemd[1]: logstash.service: Failed with result 'exit-code'.
12	Nov 16 13:39:18 logging-logstash-02 systemd[1]: logstash.service: Consumed 51.043s CPU time.

Change 974635 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: beta-logs to use current w3creportingapi template

https://gerrit.wikimedia.org/r/974635

gerritbot added a project: Patch-For-Review.Nov 16 2023, 3:23 PM

Change 974635 merged by Cwhite:

[operations/puppet@production] logstash: beta-logs to use current w3creportingapi template

https://gerrit.wikimedia.org/r/974635

Maintenance_bot removed a project: Patch-For-Review.Nov 16 2023, 3:30 PM

Logstash was crashlooping because it was attempting to load a template that did not exist on the host anymore. Now that it is using the right template, logs are flowing again.

In T350786#9337868, @colewhite wrote:

Logstash was crashlooping because it was attempting to load a template that did not exist on the host anymore. Now that it is using the right template, logs are flowing again.

Thank you!

@colewhite thank you! Would you mind adding notes about how you found and fixed the issue, either here or in https://wikitech.wikimedia.org/wiki/Logs#Beta_cluster?

I am still not seeing MediaWiki application events in beta-logs.wmcloud.org. Maybe that is worth tracking separately; but for now I am re-opening this one. I'm also tagging Quality-and-Test-Engineering-Team, seems relevant to them.

Definitely a different problem.

Restarted rsyslog on deployment-mediawiki11 and I saw some logs come through. Going back through syslog, it seems the kafka output plugin crashed and rsyslog never recovered.

It makes me suspicious about the state of rsyslog on the rest of the deployment hosts. Went ahead and restarted rsyslog on all of deployment prep to see if that would restore things.

Optimistically resolving, but please let us know if you find anything we missed!

In T350786#9341449, @colewhite wrote:

Definitely a different problem.

Restarted rsyslog on deployment-mediawiki11 and I saw some logs come through. Going back through syslog, it seems the kafka output plugin crashed and rsyslog never recovered.

It makes me suspicious about the state of rsyslog on the rest of the deployment hosts. Went ahead and restarted rsyslog on all of deployment prep to see if that would restore things.

Optimistically resolving, but please let us know if you find anything we missed!

@colewhite thanks, looks good at the moment

lmata moved this task from Inbox to Done on the SRE Observability (FY2023/2024-Q2) board.Jan 26 2024, 1:08 AM

	F41514048: image.png
	Nov 17 2023, 5:00 PM

No entries at all in beta-logs.wmcloud.org since 2023-11-06 Z 12:15:39Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

No entries at all in beta-logs.wmcloud.org since 2023-11-06 Z 12:15:39
Closed, ResolvedPublic
Actions