Page MenuHomePhabricator

Fix Parsoid metrics
Closed, ResolvedPublic

Description

Parsoid/JS (and hence Parsoid/PHP) had metrics instrumentation in both the API layer as well as the transformation layers (that used to be called by the API).

Now that Parsoid/PHP is integrated into MediaWiki a bit more closely, a number of internal clients (like VE) no longer call the API layer and instead make direct calls to the Parsoid handling code in core.

At this time, this primarily impacts the html2wt transformations. Yesterday, VE switched away from using RESTBase (which in turn called the Parsoid API) to calling Parsoid directly. Since then, Parsoid's html2wt dashboard stopped recording all these html2wt calls because the instrumentation for some of the metrics lived in the REST API layer.

We should use this opportunity to think through what we want to instrument, what additional pieces of information we want to log, and fix the instrumentation to add new metrics to the Parsoid's library code.

Event Timeline

Arlolra triaged this task as High priority.
Arlolra moved this task from Backlog to In Progress on the Content-Transform-Team-WIP board.

At first blush, it's unclear why switching VE away from RESTBase would only affect the html2wt metrics since the DirectParsoidClient calls Parsoid directly in both directions.

In the serialization direction, looking at DirectParsoidClient::transformHTML, it constructs a newHtmlInputTransformHelper, just like in ParsoidHandler::html2wt. The metrics are emitted in HtmlToContentTransform::htmlToText which gets indirectly called by HtmlInputTransformHelper::getContent in both paths. So, the data should be there.

However, in the parse direction, the metrics are only emitted in ParsoidHandler::wt2html, so calls to DirectParsoidClient::getPageHtml / transformWikitext aren't recording any data. They should be moved somewhere inside HtmlOutputRendererHelper or whatever it calls.

The ParsoidCachePrewarmJob calls parsoidOutputAccess->getParserOutput directly, so it doesn't seem responsible for the continued wt2html metrics.

That leaves RESTBase. It's still being populated for the other endpoints that depend on it, which continues to make call to the ParsoidHandler, accounting for the data we see.

So, the data should be there.

But where did it go.

Previous to the switch, all serialization was done on the Parsoid cluster. Now, each app server is doing that work and all those servers aren't instantiating statsd with the same prefix as they used to (MediaWiki vs MediaWiki.Parsoid).

MediaWiki.Parsoid.html2wt.total.sample_rate

Screenshot 2023-08-30 at 3.36.29 PM.png (1×1 px, 320 KB)

MediaWiki.html2wt.total.sample_rate

Screenshot 2023-08-30 at 3.35.23 PM.png (1×1 px, 279 KB)

Both, on the same graph

Screenshot 2023-08-30 at 3.44.59 PM.png (1×1 px, 485 KB)

I tried adding equivalent queries with both prefixes to the graphs, for continuity, but because not all the data goes to zero for all graphs, it was kind of a mess. Instead, I duplicated the dashboard with a -direct postfix,
https://grafana.wikimedia.org/d/000000046/parsoid-timing-html2wt?orgId=1&refresh=30s&from=now-6M&to=now
https://grafana.wikimedia.org/d/xxRCkIzIk/parsoid-timing-html2wt-direct?orgId=1&refresh=30s&from=now-6M&to=now

Something is amiss with selser and needs investigating next.

Change 953741 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/core@master] Remove unnecessary setMetrics call

https://gerrit.wikimedia.org/r/953741

Interesting, that patch fails because of the test added in,
https://github.com/wikimedia/mediawiki/commit/79cc21beaf34c82f368ef88125bcdd2567a8f389

that says,

ParsoidHandler should pass the metrics object from the SiteConfig to HtmlInputTransformHelper, instead of using the global metrics instance. Otherwise, the metricsPrefix defined in the parsoid settings is ignored.

Change 953743 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/extensions/VisualEditor@master] Use metrics from SiteConfig to restore the Parsoid prefix

https://gerrit.wikimedia.org/r/953743

Previous to the switch, all serialization was done on the Parsoid cluster. Now, each app server is doing that work and all those servers aren't instantiating statsd with the same prefix as they used to (MediaWiki vs MediaWiki.Parsoid).

So the prefix isn't from the cluster

Change 953741 abandoned by Arlolra:

[mediawiki/core@master] Remove unnecessary setMetrics call

Reason:

In favour of I99c15876424a454309116fa1269d0a1a456e0d46

https://gerrit.wikimedia.org/r/953741

Change 953743 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Use metrics from SiteConfig to restore the Parsoid prefix

https://gerrit.wikimedia.org/r/953743

Change 954048 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/extensions/VisualEditor@wmf/1.41.0-wmf.23] Use metrics from SiteConfig to restore the Parsoid prefix

https://gerrit.wikimedia.org/r/954048

Change 954049 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/extensions/VisualEditor@wmf/1.41.0-wmf.24] Use metrics from SiteConfig to restore the Parsoid prefix

https://gerrit.wikimedia.org/r/954049

Change 954048 abandoned by Arlolra:

[mediawiki/extensions/VisualEditor@wmf/1.41.0-wmf.23] Use metrics from SiteConfig to restore the Parsoid prefix

Reason:

group2 wikis are going to 24 now so I99c15876424a454309116fa1269d0a1a456e0d46 will be enough

https://gerrit.wikimedia.org/r/954048

Something is amiss with selser and needs investigating next.

I see, those metrics are internal to the Parsoid library,
https://github.com/wikimedia/mediawiki-services-parsoid/blob/master/docs/metrics.md#html2wt-direction

so always got the prefix from the siteconfig metrics,
https://github.com/wikimedia/mediawiki-services-parsoid/blob/master/src/Wikitext/ContentModelHandler.php#L211

Change 954049 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@wmf/1.41.0-wmf.24] Use metrics from SiteConfig to restore the Parsoid prefix

https://gerrit.wikimedia.org/r/954049

Mentioned in SAL (#wikimedia-operations) [2023-08-31T20:51:14Z] <jhuneidi@deploy1002> Started scap: Backport for [[gerrit:954049|Use metrics from SiteConfig to restore the Parsoid prefix (T339365)]]

Mentioned in SAL (#wikimedia-operations) [2023-08-31T20:52:38Z] <jhuneidi@deploy1002> arlolra and jhuneidi: Backport for [[gerrit:954049|Use metrics from SiteConfig to restore the Parsoid prefix (T339365)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)

Mentioned in SAL (#wikimedia-operations) [2023-08-31T21:01:18Z] <jhuneidi@deploy1002> Finished scap: Backport for [[gerrit:954049|Use metrics from SiteConfig to restore the Parsoid prefix (T339365)]] (duration: 10m 03s)

However, in the parse direction, the metrics are only emitted in ParsoidHandler::wt2html, so calls to DirectParsoidClient::getPageHtml / transformWikitext aren't recording any data. They should be moved somewhere inside HtmlOutputRendererHelper or whatever it calls.

...

That leaves RESTBase. It's still being populated for the other endpoints that depend on it, which continues to make call to the ParsoidHandler, accounting for the data we see.

It's only a matter of time before that breaks.

We should use this opportunity to think through what we want to instrument, what additional pieces of information we want to log, and fix the instrumentation to add new metrics to the Parsoid's library code.

I agree that, for the mostpart, these metrics should be moved internal to the Parsoid library entrypoint.

Change 954324 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] [WIP] Instrument entrypoint based on ParsoidHandler metrics

https://gerrit.wikimedia.org/r/954324

Change 954324 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Instrument wikitext2html entrypoint based on ParsoidHandler metrics

https://gerrit.wikimedia.org/r/954324

Change 955843 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] Instrument dom2wikitext entrypoint based on HtmlToContentTransform metrics

https://gerrit.wikimedia.org/r/955843

Change 956459 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.18.0-v23

https://gerrit.wikimedia.org/r/956459

Change 956459 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.18.0-v23

https://gerrit.wikimedia.org/r/956459

Change 955843 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Instrument dom2wikitext entrypoint based on HtmlToContentTransform metrics

https://gerrit.wikimedia.org/r/955843

Change 958428 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/vendor@master] Bump parsoid to 0.18.0-a24

https://gerrit.wikimedia.org/r/958428

Change 958428 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.18.0-a24

https://gerrit.wikimedia.org/r/958428

What's left to do here is,

  • Create new grafana dashboards with the new .entry prefixed metrics and compare the dashboards for accuracy
  • Remove the old metrics emitting code from the rest endpoints and delete the associated grafana dashboards

Change 963790 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/core@master] [WIP] Remove Parsoid wt2html / html2wt metrics from Rest handlers

https://gerrit.wikimedia.org/r/963790

Change 963826 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] Fix calling Timing::start without $metrics

https://gerrit.wikimedia.org/r/963826

Create new grafana dashboards with the new .entry prefixed metrics and compare the dashboards for accuracy

Notes the differences in wt2html

old - https://grafana.wikimedia.org/d/000000048/parsoid-timing-wt2html?orgId=1&refresh=30s&from=now-7d&to=now
vs
new - https://grafana.wikimedia.org/d/TCWoeymIk/parsoid-timing-wt2html-entry?orgId=1&refresh=1m&from=now-7d&to=now

API request rate - The req rate in the new is slightly higher because it includes the parse requests that are coming from elsewhere than ParsoidHandler::wt2html, as described in T339365#9137004. However, the peaks of the old are higher because they include parsercache hits, and the new only records when the request calls Parsoid's entrypoint.

Total time / Parse time - The wt2html.total metric isn't emitted from the entrypoint because, apart from the parse, it just included the init time which isn't being captured. The graph therefore gets a name change. A comparison of the graphs is blocked on deploying the patch in T339365#9229478

Time per output KB - Needs T339365#9229478 to compare

oldid: output size (in KB) - The old output size looks like it's almost double that of the new, but can be accounted for. In the pagebundle case, the old code took the strlen of the entire json serialized pagebundle, whereas the new is just the html strlen. So, the bloat from the json encoding and the added external data-parsoid information, and then whatever metadata is also included in the pagebundle. But note also the input size differences below would be correlated in the output size, since this isn't normalized.

oldid: input size (in KB) - The shape of these graphs is similar and the code in both cases is doing strlen( $pageConfig->getPageMainContent() ) but the new input sizes are smaller, which is odd because these are only the oldid cases. A few ideas on what could be going on,

  • As in T339365#9137004 mentioned above, the old only captures the requests going through rest api so it could be something particular to the different sets of pages.
  • The old input and output size metrics fire regardless of whether the output is coming from the parsercache, whereas the new only fire if we're going a parse. Could be difference in page sets there.

Linting Time / Language Converter Time - These already came from inside the library

Parse time (Mean) / Parse time (99%tile) - Needs T339365#9229478 to compare

Init time (Mean) / Init time (99%tile) - Init time wasn't part of time spent in the entrypoint

Change 963826 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Fix calling Timing::start without $metrics

https://gerrit.wikimedia.org/r/963826

Create new grafana dashboards with the new .entry prefixed metrics and compare the dashboards for accuracy

Notes the differences in html2wt

old - https://grafana.wikimedia.org/d/000000046/parsoid-timing-html2wt?orgId=1&refresh=30s&from=now-7d&to=now
vs
new - https://grafana.wikimedia.org/d/jmaNtVMIz/parsoid-timing-html2wt-entry?orgId=1&refresh=30s&from=now-7d&to=now

html2wt rate - The rate is slightly higher in the new. In T339365#9132910, the parsoid prefix was restored so it seems like they should be the same. However, looking at the dashboard that was created for the non-prefixed metrics, it seems like there was non-prefixed metrics before and after the change that resulted in filing this ticket. That likely account for the difference.

input html size - Is roughly the same, working on a slightly different set.

Total time - Needs T339365#9229478 to compare

Processing time per KB of HTML - Needs T339365#9229478 to compare

html2wt components (p75) - These are the same because they already came from internal to library. That is, except for init, which has been dropped because it was capturing time outside the entrypoint.

The rest of the graphs follow the same reasoning as in html2wt components

Change 964566 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/vendor@master] Bump wikimedia/parsoid to v0.18.0-a28

https://gerrit.wikimedia.org/r/964566

Change 964566 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to v0.18.0-a28

https://gerrit.wikimedia.org/r/964566

Change 963790 merged by jenkins-bot:

[mediawiki/core@master] Remove Parsoid wt2html / html2wt metrics from Rest handlers

https://gerrit.wikimedia.org/r/963790

Now that it has been a week since the fix in T339365#9229478 has been deployed, I took a look at the timed metrics in T339365#9229984 and T339365#9232593 and they seemed comparable to the old ones. As was described, some of the differences could be because of the differences in sets of pages processed, either due to the incompleteness of the old metrics or the inclusion of cache hits in them, or init time or bloat from json serialization. Nothing stands out as worth investigation beyond what was already done above.

We should use this opportunity to think through what we want to instrument, what additional pieces of information we want to log, and fix the instrumentation to add new metrics to the Parsoid's library code.

So, we can conclude that the latter part of that is done. As for rethinking what we're instrumenting, I'm going to leave that as the need arises.