Page MenuHomePhabricator

[SPIKE] Investigate Check/Suggestion errors that prevent them from being shown
Closed, ResolvedPublic2 Estimated Story Points

Description

T420249 implemented the logging necessary for us to log cases when edit Checks and Suggestions could not be shown.

This task involves the work of investigating what could explain why we are seeing cases where Paste and TextMatch Checks are failing to generate:

image.png (1×2 px, 296 KB)

Requirements

  • Document what (if any) patterns you see among cases where Paste and TextMatch Checks are failing to generate
    • Documented in comments
  • Document hypotheses that could explain cases where Paste and TextMatch Checks are failing to generate
    • Documented in comments
  • File any follow-on ticket(s) that might be needed to stem these errors

Links

https://thanos.wikimedia.org/graph?g0.expr=mediawiki_editcheck_errors_total&g0.tab=0&g0.stacked=0&g0.range_input=2d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=1&g0.store_matches=%5B%5D&g0.engine=prometheus&g0.analyze=0&g0.tenant=


Thank you to @DLynch for spotting the need for this task.

Event Timeline

It's possible we'll need more logging for this -- we're currently just logging a count of "sessions in which this check experienced an error". So we know that a given type of check is erroring, but absolutely no details about why. So if investigating the code for the check in the knowledge that an error is occurring doesn't turn anything up, we might want to go and e.g. start pouring the actual errors into the normal site JS error logging.

ppelberg set the point value for this task to 2.Mar 30 2026, 5:43 PM

Per planning meeting, we're going to allocate no more than 1 day to to start.

ppelberg lowered the priority of this task from High to Medium.Mar 31 2026, 9:32 PM

Ok, I did some digging. Here's a summary of my observations so far. I'll continue to dig today/tomorrow.

> 1000 per day:

Tone

  • Highest error rate across all checks. Upwards of 5000 per day since April 2. I suspect it's because of rate-limiting.

Disambiguation

  • Second highest error rate across all checks. Upwards of 1400 per day since April 2. I haven't been able to catch this in the wild yet.

Around 200 per day:

TextMatch

  • I've seen errors randomly, and can't recreate it twice on the same article. But when I do observe them, it's because of an error in importing a match item config that's defined elsewhere. I saw this on enwiki with the MediaWiki:Editcheck-config-textmatch-british-english.json config and on ruwiki with MediaWiki:Editcheck-config-LLM.json.
    • I don't know yet why it's occasionally failing to import them, but I suspect the errors we're seeing are mostly because of this.

RequiredTemplateParam

Other:

ExternalLink and Paste

  • These have had a few (~20) errors per day, but I haven't seen them in the wild and haven't been able to characterize them yet.

One note: it'd be helpful if we could add wiki as a label on the metric.

Ok, I've filed T423021 for TextMatch and T423022 for RequiredTemplateParams.

The few times that I did encounter errors with Disambiguation, it seemed to be because of a network timeout/some other network error with the api call. I don't think the fact that the api call is occasionally rejecting is concerning, but maybe we could have some re-queue logic in ApiResponseCache in the case of an 'http' code. That way it's not just returning the rejected promise every time. Not sure though - it'd be helpful to get input from another engineer for this.