Page MenuHomePhabricator

How to detect errors on VE
Closed, ResolvedPublic

Description

I'll be open to you: I am very disappointed that it took about 4 month from the first report and 5 reports in all until somebody cared about the HTTP 404 error in the VisualEditor (T230272). 2000 edits per day is not something less and if @KewGardens613, I and my triaging on T224384 (the first one) had been taken seriously, a lot could have been avoided. I thought about that a day. I see four options to improve that (at least errors in VE and SE):

  • Take reports seriously (will not work, worked never)
  • Use AI to detect suspicious logs on critical segments of mediawiki (must be collected, sorted and be reviewed)
  • Provide a button to send an anonymus crash report (must be collected, sorted and be reviewed, similar to the AI-option)
  • Provide more information than just "HTTP 404" as error. A line of a script, a function that returned an error or something else that can be used in phab)

What can be done to avoid repetition? And we need further measures, as the current ones are not sufficient.

Event Timeline

We do take reports seriously. Let me try to offer some explanation as to why nothing happened with this one for months: it's very difficult to solve issues if we have no idea what is causing them, and no way to reproduce them so that someone can observe what exactly happens. The HTTP 404 error was also happening quite rarely, until September 3 (T230272#5503065), which also coincided with the new reports on Phabricator. As soon as someone found a fairly consistent way to reproduce the problem on September 17, I started looking into it (T230272#5501716).

If we don't know what's the problem and we don't have any reasonable guesses and not very many users are affected, it's just not time-effective to investigate "ghost issues" when we could spend the same effort fixing other problems, where every hour of work means an actual improvement to our application.

This problem is also particularly difficult because it occurs at the intersection of three very different applications (VisualEditor, RESTBase, and MediaWiki API) and there isn't a single person who would be an expert in all three of them.

In reply to your suggestions:

Use AI to detect suspicious logs on critical segments of mediawiki (must be collected, sorted and be reviewed)

For MediaWiki itself, there's a pretty effective error logging system, although much simpler (AFAIK it just groups errors by the error message). New problems are usually noticed when a new version of MediaWiki is deployed, and filed under Wikimedia-production-error.

In this case, either our error is not being logged (API errors are usually not a cause for alarm – you will also get them e.g. if your edit is blocked by SpamBlacklist), or the number of errors wasn't big enough to be noticeable. I don't know, sorry.

Provide a button to send an anonymus crash report (must be collected, sorted and be reviewed, similar to the AI-option)

Let me just note that in this case nothing is crashing, various components are successfully detecting and reporting an error, so I don't think this would help. But I think folks are working on client-side crash logging: T226986 (I don't know much about the project, sorry again).

My past experience with this (I worked on crash logging for UploadWizard in 2016) is that there is a very high level of noise in such reports, mostly caused by browser extensions injecting broken scripts into our pages, and the errors are also hard to group reasonably because they arrive translated to the user's language. If you want to read more: T137660#2379881

Provide more information than just "HTTP 404" as error. A line of a script, a function that returned an error or something else that can be used in phab)

In this case there is no more information to return. RESTBase returns the error 404 when we ask it to convert HTML to wikitext, and that's all we know. Maybe we could have a better error message, but there isn't any hidden debugging information.

Many thanks for the detailed answer, I had not expected that at all. Yes, I can already imagine that with 20 billion questions a day there would be several hundred thousand errors so filtering this is difficult. It did not let me go and I thought about it further. So another idea: Wouldn't it be realistic to extend selected APIs (e.g. everything that has to do with editing, where errors are critical for the user) in such a way that they output an error ID under which they store the error information centrally for a certain time (e.g. 30d)?

In the case of the RESTBase API, RESTBase could simply send a token along with the "404". This can then be enclosed here on phab. The developers could then look under the record belonging to the token to see which log data was recorded. In the case of the RESTBase API:

{"errors":[{"code":"apierror-visualeditor-docserver-http","html":"HTTP 404","module":"visualeditoredit"}],"docref":"See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes.","servedby":"mw1287"}

(and maybe the input for reproducing)

Such a token could be $unixtime-$random-hex-string. To avoid the unlikely case of double randoms: $unixtime-$server-$random-hex-string.

If this would be implemented you maybe still have the problem that there is no known way to replicate it but you know where it failed so with the input maybe "why".

I'm sorry that I didn't reply earlier!

We have something similar to what you're thinking of, they're called "request IDs" and look like this: "XamB5wpAME4AAJJNymgAAABN" (I don't know how exactly they are generated). Every HTTP request has a request ID, and any errors or debugging data is logged with that ID, but they are only shown to users in case of PHP exceptions, and used by developers to look up details of the exception using https://wikitech.wikimedia.org/wiki/Logstash#Web_interface (for an example, see T235589).

I think we don't show them more widely because usually there is just nothing logged there. For example, I tried to look up "XamB5wpAME4AAJJNymgAAABN" and got no results (I got the ID by using mw.config.get('wgRequestId') in the browser console when viewing a page, that's another place where they are shown).

We don't want to log everything (for example, it would be bad if we accidentally logged someone's password or something; also, I have no idea how much data that system can handle), and logging for specific things is usually only added when needed (like I ended up doing for the HTTP 404 issue, e.g. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/540233). So even if we displayed the request ID with all error messages, it might not actually be useful for learning anything about obscure errors.

Not sure if this is a satisfying explanation, but I hope it helps… :)

Provide more information than just "HTTP 404" as error. A line of a script, a function that returned an error or something else that can be used in phab)

In this case there is no more information to return. RESTBase returns the error 404 when we ask it to convert HTML to wikitext, and that's all we know. Maybe we could have a better error message, but there isn't any hidden debugging information.

I thought that maybe we can do at least a little to help with this. I submitted a small patch to change the error message to "Error contacting the Parsoid/RESTBase server (HTTP 404)": https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/544152 – no new debugging information, but just a better message so that it's easier for us to understand bug reports, and easier for users to find related tasks in Phabricator if they get the error.

Der_Keks claimed this task.

I think we don't show them more widely because usually there is just nothing logged there.

Yeah that's the never-ending challenge to review what's going to be logged and what shouldn't. And of course it takes resources to log this more or less defined "everything".

I submitted a small patch to change the error message to "Error contacting the Parsoid/RESTBase server (HTTP 404)":

That's a nice alternative. If it throws such an error you can faster and on point do something like https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/VisualEditor/+/540233/ to discover the problem.

So thank you about discussing this. While grass is growing over the case it seems that it's a good compromise between resource handling and comprehensibility what's MW is doing as logging can be enabled purposefully. I hope such error messages are implemented everywhere otherwise we cannot locate the responsible scripts ergo cannot enable pointed logging.

I'll mark this as resolved for me all in all it's satisfying (remember the grass)