Page MenuHomePhabricator

Log Check generation errors to the server
Closed, ResolvedPublic2 Estimated Story Points

Description

When there's an error while generating checks it results in local logging to the console, but no persistent logging. This means we don't have any visibility into what checks are experiencing errors, outside of when a developer is paying attention to a specific pageload.

This would be useful in general for knowing about edge cases that cause unexpected issues, and would have resulted in us detecting issues like T418173 much more quickly.

Because of the architecture of editcheck, where we're constantly regenerating the list of checks, suggestions are likely to produce a lot of repeat errors if we individually logged each error. As such, it would probably be best to track "edit sessions in which errors occurred" rather than absolute error counts.

I thus propose that we track a counter, which is incremented once in each session where errors of a check type are seen. Naming would be stats.mediawiki_editcheck_errors_total with a payload of {kind: '[check-name]'}. (Or something equivalent -- I need to double-check what we can easily summarize together, since I presume we'd want this on our dashboard.)

Event Timeline

ACTION: @DLynch to populate task description with requirements

ppelberg set the point value for this task to 2.Mar 18 2026, 5:40 PM

Change #1255054 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/VisualEditor@master] EditCheckFactory: count check-generation errors once per session

https://gerrit.wikimedia.org/r/1255054

Change #1255054 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] EditCheckFactory: count check-generation errors once per session

https://gerrit.wikimedia.org/r/1255054

DLynch moved this task from Code Review to QA on the Editing-team (Editing-18Mar-27Mar-2026) board.

This one may have to involve waiting for the train and then checking for network requests and/or on prometheus.

I'm stepping in since this is a high priority ticket and Rummana is out. Apologies if I'm missing something simple.

Out of curiosity: Why was this moved to High Priority? Aside from is being ready to test currently and seeming like a fairly small surface area to test, I'm not seeing anything that makes this more urgent.

I created a PatchDemo environment for change 1255054 to investigate this.

I triggered tone check on a visual editor sentence. The initial call was to edit-check:predict with the following response:

{"message":"","batchId":"f303f1c8-339b-4484-822e-81f0c223967e","predictions":[{"check_type":"tone","details":{},"language":"en","model_name":"edit-check","model_version":"v1","page_title":"New York City","prediction":true,"probability":0.86,"status_code":200}]}

This triggered the Revise Tone.

When it was resolved, there was a separate network call to edit-check:predict with this response

{"message":"","batchId":"c73142ff-94d1-4307-859a-ef9fa7f5e327","predictions":[{"check_type":"tone","details":{},"language":"en","model_name":"edit-check","model_version":"v1","page_title":"New York City","prediction":false,"probability":0.574,"status_code":200}]}

On full save and publish there was a call to events_for_edit (for reference:

curl 'https://09a5e120dc.catalyst.wmcloud.org/w/rest.php/campaignevents/v0/participant/self/events_for_edit' \
  -H 'accept: */*' \
  -H 'accept-language: en-US,en;q=0.9' \
  -b 'VEE=visualeditor; wiki_09a5e120dc__main_session=6r2frh1ghuvsqfa9v1q4ns0o25vlfuv1; wiki_09a5e120dc__mainUserID=8; wiki_09a5e120dc__mainUserName=~2026-1; wiki_09a5e120dc__mainToken=b94ae27279467ec1653b9ac847ed7527; UseDC=master; cpPosIndex=2%401774556759%23f947da77b15042931f2be8923b0781a1' \
  -H 'priority: u=1, i' \
  -H 'referer: https://09a5e120dc.catalyst.wmcloud.org/w/index.php?title=New_York_City&ecenable=2' \
  -H 'sec-ch-ua: "Chromium";v="146", "Not-A.Brand";v="24", "Google Chrome";v="146"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: empty' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-site: same-origin' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.0.0 Safari/537.36' \
  -H 'x-requested-with: XMLHttpRequest')

with the following response:

[]

I see other API.php calls, but none that match the payload/response here. My next step is to check on enWiki.

I'm stepping in since this is a high priority ticket and Rummana is out. Apologies if I'm missing something simple.

Out of curiosity: Why was this moved to High Priority? Aside from is being ready to test currently and seeming like a fairly small surface area to test, I'm not seeing anything that makes this more urgent.

I'm glad you asked, @SLong-WMF...

I marked this as high priority because validating this work is a blocker to the start of the Suggestion Mode controlled experiment (T404600). Which the Editing Team is working to deploy in the sprint that begins on Monday.

I see other API.php calls, but none that match the payload/response here. My next step is to check on enWiki.

The trick is that you'd need to actually have an edit check experience an error to see anything related to this, and you can't guarantee that on any particular session (without, say, deliberately creating a check that just throws errors -- which we do in some of our unit tests). It's why I said that just validating that the events are actually flowing into prometheus might be the simplest approach to this one.

Looks like we're getting said events:

CleanShot 2026-03-27 at 16.06.41@2x.png (2,498×1,708 px, 354 KB)

Resolving this task since David verified it.