Noticed whilst debugging T350777. Usual traffic is ~ 130k per three hours.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
logstash: beta-logs to use current w3creportingapi template | operations/puppet | production | +1 -1 |
Related Objects
- Mentioned In
- T351193: RESTbase broken on beta cluster
T351131: Creating a Wikistory on English betacluster results in a 502 error - Mentioned Here
- P53517 (An Untitled Masterwork)
T274593: Logstash beta is not getting any events
T350777: 1.42.0-wmf.4: Structured Data on Wikimedia Commons not longer available
Event Timeline
In the past, restarting Logstash (T274593: Logstash beta is not getting any events) got things working again.
I don't know if this should be considered a train blocking task (tagging Release-Engineering-Team for triaging) but not having visibility into MediaWiki errors in beta cluster would increase the likelihood of us finding out about issues only in group0 deployment, if not later.
i have logged into logging-logstash-02.logging.eqiad1.wikimedia.cloud and ran systemctl restart logstash.service hopefully that has fixed this
Add Comment
Change 974635 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: beta-logs to use current w3creportingapi template
Change 974635 merged by Cwhite:
[operations/puppet@production] logstash: beta-logs to use current w3creportingapi template
Logstash was crashlooping because it was attempting to load a template that did not exist on the host anymore. Now that it is using the right template, logs are flowing again.
@colewhite thank you! Would you mind adding notes about how you found and fixed the issue, either here or in https://wikitech.wikimedia.org/wiki/Logs#Beta_cluster?
I am still not seeing MediaWiki application events in beta-logs.wmcloud.org. Maybe that is worth tracking separately; but for now I am re-opening this one. I'm also tagging Quality-and-Test-Engineering-Team, seems relevant to them.
Definitely a different problem.
Restarted rsyslog on deployment-mediawiki11 and I saw some logs come through. Going back through syslog, it seems the kafka output plugin crashed and rsyslog never recovered.
It makes me suspicious about the state of rsyslog on the rest of the deployment hosts. Went ahead and restarted rsyslog on all of deployment prep to see if that would restore things.
Optimistically resolving, but please let us know if you find anything we missed!