Page MenuHomePhabricator

Generate exception ID unique to the error (but not to the request)
Closed, DeclinedPublic

Description

MediaWiki exceptions are presented to the user with a webrequest ID; without access to Logstash that's not useful. Instead, generate an ID from the error message and stack trace (or maybe just the stack trace) to provide an alternative identifier that they can search Phabricator for.

Event Timeline

The use case for the ID is so that when reported (via Phabricator, mailing lists, IRC, etc) we can find it on Logstash or mwlog1001 with the full exception message, stack trace, and other private details from the request. And since 5360a3497f31, also other error messages or warnings from the same web request.

The end user does not even get to see the full error message, nor any stack trace or other private details. They only get "An error happened", the randomly generated ID, and sometimes the exception class name. Reporting the ID allows us to find the exact instance of that error.

After 5360a3497f31, it also enabled finding other error messages and warnings emitted from the same request, which can help understand the root cause or related impact.

Are you suggesting we go back to having a unique ID for each error/exception trigger within a request, or are you suggesting a shared ID for each distinct error message (which would lead to the same ID when triggered a second time or by another user).

Having the unique error instance ID, and request ID be separate seems useful. That would, for example, make it easy to find out if their error is not in the logs but other errors from the same request, are. We'd find matches for the request ID, but not for the error ID.

However, given the use case of searching Phabricator, I guess, means you meant the latter. Creating a deterministic ID for related errors seems hard to get right, but could be worth trying. The same logical error can be triggered from multiple different stacks (if indeterministic), and logically different errors could have the error message (if triggered by multiple code paths making a similar mistake). We do have normalised_message, which helps deduplicate messages without considering its parameters. Which could be a starting point, and maybe a with a limited depth of the stack added to the hash?

Change 654955 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] exception: Remove "exception_id" key in favour of reqId

https://gerrit.wikimedia.org/r/654955

However, given the use case of searching Phabricator, I guess, means you meant the latter. Creating a deterministic ID for related errors seems hard to get right, but could be worth trying. The same logical error can be triggered from multiple different stacks (if indeterministic), and logically different errors could have the error message (if triggered by multiple code paths making a similar mistake). We do have normalised_message, which helps deduplicate messages without considering its parameters. Which could be a starting point, and maybe a with a limited depth of the stack added to the hash?

Yeah, that's what I had in mind. That doesn't replace the request ID, but it would make error reporting more convenient: currently the the user has to report the error and someone has to look it up in Logstash before deduplication is possible, and a simple search for the error ID could replace that. Granted, we don't seem to have that many duplicates in practice...

I agree that a completely unique error ID would be useful too; it would make locating the Logstash record a little more convenient. (There are probably also some very fringe situations where the request ID is not that helpful, like in jobs or maintenance scripts.)

Change 654955 merged by jenkins-bot:
[mediawiki/core@master] exception: Remove "exception_id" key in favour of reqId

https://gerrit.wikimedia.org/r/654955

We don't have an ID that's determined by the error message / trace so the task should either be declined or left open.

I meant to decline indeed, sorry about that. Regarding your comment elsewhere, entries in Logstash do have permalinkable unique IDs for each message for when eg. a request has too many messages in it to be easily linkable. A somewhat stable ID to represent the logical/normalised error seems useful as well, which we could file a new task for and/or re-open this for. Right now we have normalised_message for that which works fairly well in Logstash to find which issues are common and to see how they distribute (e.g. which wikis/servers/dates/times etc.), and Phatality also has its own concept of a message checksum though in practice it rarely works.