Page MenuHomePhabricator

Add metrics to track async content (Wikifunctions) SLO
Closed, ResolvedPublic

Description

We should add instrumentation code so that we can track (eg) the number of additional refreshlinks jobs caused by asynchronous parsing content.

Details

Event Timeline

Change #1137061 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Track parses and jobs with asynchronous content

https://gerrit.wikimedia.org/r/1137061

Change #1137061 merged by jenkins-bot:

[mediawiki/core@master] Track parses and jobs with asynchronous content

https://gerrit.wikimedia.org/r/1137061

DSantamaria changed the task status from Open to In Progress.Apr 29 2025, 10:34 AM

To pick up from the meeting, I think you said this would allow us to measure both these SLIs:

Content availability: The minimum percentage of (post-parser-cache) page impressions that trigger Parsoid's parse with at least one Wikifunctions call on them are served fully, without placeholders.
MediaWiki system load: Extra update MediaWiki jobs due to Wikifunctions content triggering from asynchronous content not being ready will represent less than this threshold of total jobs run, including during an outage and recovery from an outage.

Is that so? Is there a metric name in grafana where we can see the data?

"Extra update MedaiWiki jobs due to Wikifunctions content" will be the mediawiki_refreshlinks_parsercache_operations_total metric with status=cache_miss and has_async_content=true. This the total number of refresh links jobs with async content. If you look at the label async_not_ready=true then these are jobs which are going to need to be repeated once the async content is ready, and so they are "extra" update jobs. In addition, there will be a few extra update jobs with async_not_ready=false when entries fall out of the parser cache, do to the way we currently handle updating async content. So the range of "extra" jobs is between the lower bound of the # of jobs with async_not_ready=true and the upper bound of the # of jobs with has_async_content=true. We can refine this metric further if/when the upper bound gets close to our SLI limit.

Unfortunately, this metric is fixed at 0 right now because we don't use Parsoid to trigger refreshlinks jobs, only legacy parses. Since WF is Parsoid-only at the moment, WF can't affect page metadata and therefore there are currently no extra refreshlinksjobs being created for WF. This will change in the future though (T393716: [EPIC] RefreshLinksJob should use Parsoid-generated metadata).

The other metric, "number of post-parser-cache page impressions that trigger Parsoid with at least one WF call, that are served w/o placeholders" can be seen at https://grafana-rw.wikimedia.org/d/cemkpcajzkdfkb/asychronous-content-monitoring and currently hovers around 60%. The grafana metric is mediawiki_Parsoid_parse_total with has_async_content=true and then looking at the percentage with the async_not_ready label set to true/false.

"Extra update MedaiWiki jobs due to Wikifunctions content" will be the mediawiki_refreshlinks_parsercache_operations_total metric with status=cache_miss and has_async_content=true. This the total number of refresh links jobs with async content. If you look at the label async_not_ready=true then these are jobs which are going to need to be repeated once the async content is ready, and so they are "extra" update jobs. In addition, there will be a few extra update jobs with async_not_ready=false when entries fall out of the parser cache, do to the way we currently handle updating async content. So the range of "extra" jobs is between the lower bound of the # of jobs with async_not_ready=true and the upper bound of the # of jobs with has_async_content=true. We can refine this metric further if/when the upper bound gets close to our SLI limit.

Unfortunately, this metric is fixed at 0 right now because we don't use Parsoid to trigger refreshlinks jobs, only legacy parses. Since WF is Parsoid-only at the moment, WF can't affect page metadata and therefore there are currently no extra refreshlinksjobs being created for WF. This will change in the future though (T393716: [EPIC] RefreshLinksJob should use Parsoid-generated metadata).

Ack.

The other metric, "number of post-parser-cache page impressions that trigger Parsoid with at least one WF call, that are served w/o placeholders" can be seen at https://grafana-rw.wikimedia.org/d/cemkpcajzkdfkb/asychronous-content-monitoring and currently hovers around 60%. The grafana metric is mediawiki_Parsoid_parse_total with has_async_content=true and then looking at the percentage with the async_not_ready label set to true/false.

Excellent, thank you! [Edit] Actually, that's all Parsoid renders, not page impressions that trigger Parsoid renders, and so includes e.g. VE edits and edit stashes, I think? Should we exclude?

Re-opening pending above discussion.

DSantamaria changed the task status from Open to In Progress.May 29 2025, 5:16 AM

Resolved in the meeting, we care about page-view values only.