Page MenuHomePhabricator

Flow internal error on frwiki not in logstash
Open, Needs TriagePublic

Description

We're debugging an internal error of Flow/StructuredDiscussions on frwiki (T371582), and we noticed that the internal error that we're seeing in the UI apparently can not be found in logstash:

[f35cef65-cef5-40c5-ba91-1672a4263061] 2024-08-01 10:22:20: Fatal exception of type "Flow\Exception\InvalidDataException"

image.png (350×998 px, 34 KB)

This is bad because we might miss errors from Flow that are not being highlighted to us by the Community or that we notice on-wiki ourselves.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@DMburugu FYI, this is a fairly problematic maintenance task that we should likely look into fairly soon. If some errors are missing from Logstash, then we might be already missing other (possibly important) errors, and not knowing about it. I'm not sure if Observability can help with tackling this – I'm not quite sure on where to start investigating from, as I'm not overly familiar with the logging infrastructure we employ.

@Urbanecm_WMF I looked for the exception ID on mwlog and wasn't able to find it there either, were you or maybe I'm looking in the wrong file?

mwlog1002:~$ zgrep -F f35cef65-cef5-40c5-ba91-1672a4263061 /srv/mw-log/archive/exception.log-20240731.gz /srv/mw-log/archive/exception.log-20240801.gz 
mwlog1002:~$

@Urbanecm_WMF I looked for the exception ID on mwlog and wasn't able to find it there either, were you or maybe I'm looking in the wrong file?

No, I wasn't able to find that exception on mwlog either. I resorted to temporarily changing MediaWiki configuration to include a stacktrace in its response (which enabled me to investigate the problem). I copied the trace I received that way to T371582#10034787, if you're interested. But, even for that other exception id, I was unable to locate it in our logs.

@Urbanecm_WMF I looked for the exception ID on mwlog and wasn't able to find it there either, were you or maybe I'm looking in the wrong file?

No, I wasn't able to find that exception on mwlog either. I resorted to temporarily changing MediaWiki configuration to include a stacktrace in its response (which enabled me to investigate the problem). I copied the trace I received that way to T371582#10034787, if you're interested. But, even for that other exception id, I was unable to locate it in our logs.

Ok thank you, it seems to me that mw didn't log the exception as such at all, and thus we can't find it either in mwlog nor logstash.

I wanted to confirm and make sure this is not a problem with the logging pipeline itself, rather with mw not logging the exception as expected, does that make sense ?

@Urbanecm_WMF Can we tag editing on this so that they are aware as well? We can facilitate a fix and it would be great for @VPuffetMichel to also be in the loop since editing is leading the efforts around flows sunset.

Ok thank you, it seems to me that mw didn't log the exception as such at all, and thus we can't find it either in mwlog nor logstash.

Unless there's some kind of step between mwlog/logstash and MediaWIki that could fail, this sounds plausible.

I wanted to confirm and make sure this is not a problem with the logging pipeline itself, rather with mw not logging the exception as expected, does that make sense ?

I'm not sure where the problem lies, to be honest. Something between MediaWiki and mwlog/logstash (inclusive) fails, but I'm not sure what that is. Adding MediaWiki-Debug-Logger as the relevant MediaWiki tag as well.

@Urbanecm_WMF Can we tag editing on this so that they are aware as well? We can facilitate a fix and it would be great for @VPuffetMichel to also be in the loop since editing is leading the efforts around flows sunset.

I don't see why not :). Done.

Good to know. We are about to start on a spike to Investigate Flow automatic migration approaches with T371738. I also tagged this task as subtask for the Epic T332022: Undeploying Flow.
Let us know what progress you make.

Michael moved this task from Inbox to Current Maintenance Focus on the Growth-Team board.

This seems to be the only place in non-maintenance code where this exception is caught:

includes/Actions/FlowAction.php
		} catch ( InvalidDataException $e ) {
			// FIXME: This isn't a real solution to the problem.
			// Pretend that we aren't generating 500 errors here by swallowing the
			// error, removing some log spam and avoiding pings to SRE and others
			// looking at the production logs.
			// The actual fix would be to prevent users from getting in a stuck
			// position with their user talk pages.
			MWExceptionRenderer::output( $e, MWExceptionRenderer::AS_PRETTY );
		}

Change #1101100 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/Flow@master] Revert "Catch InvalidDataException errors to avoid log spam"

https://gerrit.wikimedia.org/r/1101100

Thanks @Michael. I think the best course of action is to revert that change, as it is making it incredibly difficult to even attempt some rootcausing. Uploaded a revert of the original change.

Ideally, this should pretty much fix itself by the Flow deprecation.

@kostajh @Catrope As the engineers who made the original chance, CCing you, in case you have any concerns with the revert.

Change #1101100 merged by jenkins-bot:

[mediawiki/extensions/Flow@master] Revert "Catch InvalidDataException errors to avoid log spam"

https://gerrit.wikimedia.org/r/1101100