Page MenuHomePhabricator

Avoid uncached action=discussiontoolspageinfo API calls on page load
Closed, ResolvedPublic

Description

Follow-up to T325477:

Does the response have any private data in it? I think if ApiDiscussionToolsPageInfo::execute() called $this->getMain()->setCacheMode( 'public' ) and you set the query string parameters maxage=86400 and smaxage=86400, then the response would have an Expires header and would be cached in the CDN. You have the oldid in there already for invalidation.

Event Timeline

I recall highlighting earlier this year (at T299583#7637939) that DT currently makes 3 uncached API requests directly on-load for every pageview. Naturally that was meant to be limited to talk pages, but even at the time it already leaked into main space of several wikis due to how the "extra signatures" signal is used.

@Krinkle wrote:

It appears the DT javascript layload is making […] three different API queries, each one private/uncacheable.

I'm not sure whether it's by design that this is needed during the critical path. If not, it should be defered until interaction. Otherwise, if it is critical for the initial binding of click handlers, then it seems like this information may be worth including into the page metadata directly from the server-side.

It's only making those requests on discussion pages; however, on Meta-Wiki and MediaWiki.org, the main namespace is treated like a discussion page for annoying historical reasons (this was previously discussed in T291630). This is not the case on most wikis.

The API requests are not strictly needed, but pre-loading them allows the reply tool to load faster […]. (There's actually a TODO comment in the code to reconsider this). If you think this is bad, please file a bug. It doesn't seem too bad to me.

Replying to this task has been on my todo list since January... my reply would been essentially repeated the suggestion I already wrote in the above quoted comment: Defer the API fetches to when the user interacts with DT, or preload it through zero-RTT page metadata (e.g. using the new mechansism Bartosz initiated at T41813.)

Either of these approaches (deferral, or preloading) naturally counterweight to way DT currently overshoots its enablement target (i.e. exposing Parsoid HTML and initialising DT code on article pages of some wikis), as these approaches don't incur noticable costs on pages without discussions (ref T249293 and T291630). I believe the topic of DT enablement vs "extra sigs" has been disucssed a lot and I haven't read the discussions recently, nor do I know the latest state and decisions/trade-offs. On the face of it, my instinct would be to ignore this signal and rely on other existing per-page signals instead (like NEWSECTIONLINK) and if there are additional edge cases to introduce a new magic word for those. Anyway, that's kind of tangential to this issue, given we also care about actual talk page performance, not just avoiding applying of suboptimal perf/load from talk pages onto article pages.

Making API requests directly on-pageview is unprecedented to my knowledge, even if it were CDN cached. It wasn't a priority in January with the limited roll-out and the assumption it would be limited to talk pages (with leaks only on low traffic wikis like Meta-Wiki and mediawiki.org). The load of these requests, if limited to talk pages, is manageable under normal traffic conditions. Though, to avoid the site going down when a talk page URL goes viral, you'd still want to CDN-cache those API requests, at least for some number of minutes, at least for anons.

Change 869163 had a related patch set uploaded (by Krinkle; author: Tim Starling):

[mediawiki/extensions/DiscussionTools@master] Clean up ApiDiscussionToolsPageInfo hack

https://gerrit.wikimedia.org/r/869163

I recall highlighting earlier this year (at T299583#7637939) that DT currently makes 3 uncached API requests directly on-load for every pageview. Naturally that was meant to be limited to talk pages, but even at the time it already leaked into main space of several wikis due to how the "extra signatures" signal is used.

(…)

Replying to this task has been on my todo list since January... my reply would been essentially repeated the suggestion I already wrote in the above quoted comment: Defer the API fetches to when the user interacts with DT, or preload it through zero-RTT page metadata (e.g. using the new mechansism Bartosz initiated at T41813.)

We did that because some of that data is generated from Parsoid output, which makes it relatively slow, and we didn't want the user to have to wait so long after clicking "reply" to start typing. When cached Parsoid output is available, it's around 200-300 ms (this is time spent mostly in our processing, which isn't currently cached); when it isn't available, it can be several seconds. For the same reason we didn't want to make this a part of the initial page load.

(Back in 2019 when the project started, we were going to switch to Parsoid for normal page views any day, which would make it a lot easier to do this as part of the initial page load, and make it cached in parser cache or something. We're still going to switch to Parsoid any day now today ;) )

Maybe there's some middle ground between "load immediately" and "load at last second" though. I just had the idea that we could start loading when the relevant links are hovered, before they're clicked, which should give us a few hundred ms (depending on how quick the user is with their clicking). Similar tricks can be done for touch and keyboard interactions. I think we should try this and see how well it works.

Either of these approaches (deferral, or preloading) naturally counterweight to way DT currently overshoots its enablement target (i.e. exposing Parsoid HTML and initialising DT code on article pages of some wikis), as these approaches don't incur noticable costs on pages without discussions (ref T249293 and T291630). I believe the topic of DT enablement vs "extra sigs" has been disucssed a lot and I haven't read the discussions recently, nor do I know the latest state and decisions/trade-offs. On the face of it, my instinct would be to ignore this signal and rely on other existing per-page signals instead (like NEWSECTIONLINK) and if there are additional edge cases to introduce a new magic word for those. Anyway, that's kind of tangential to this issue, given we also care about actual talk page performance, not just avoiding applying of suboptimal perf/load from talk pages onto article pages.

There are pages without NEWSECTIONLINK which nevertheless host discussions, e.g.:

We don't really want to review lots of pages on lots of wikis and add magic words to them (or make the communities do it).

Making API requests directly on-pageview is unprecedented to my knowledge, even if it were CDN cached. It wasn't a priority in January with the limited roll-out and the assumption it would be limited to talk pages (with leaks only on low traffic wikis like Meta-Wiki and mediawiki.org). The load of these requests, if limited to talk pages, is manageable under normal traffic conditions. Though, to avoid the site going down when a talk page URL goes viral, you'd still want to CDN-cache those API requests, at least for some number of minutes, at least for anons.

I'm not suggesting it's a good idea, but it's not unprecedented. For example the "Multimedia" box on https://de.wikipedia.org/w/index.php?fulltext=1&search=test&ns0=1 (below the fold) is generated from an API request on page load, and that's code from 2017: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/332991/14/resources/src/mediawiki.special/mediawiki.special.search.commonsInterwikiWidget.js (although I only learned about it recently). This happens on every search on most projects (wgCirrusSearchCrossProjectShowMultimedia = true).

Change 869163 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Clean up ApiDiscussionToolsPageInfo hack

https://gerrit.wikimedia.org/r/869163

As for actually adding caching:

Does the response have any private data in it? I think if ApiDiscussionToolsPageInfo::execute() called $this->getMain()->setCacheMode( 'public' ) and you set the query string parameters maxage=86400 and smaxage=86400, then the response would have an Expires header and would be cached in the CDN. You have the oldid in there already for invalidation.

There's no private data, but invalidation depends on pages transcluded in the relevant page, not just on its oldid (and potentially also on changes to our server-side code, but that's less important).

I think we would have to make this cached request, then try to detect if the result is stale (e.g. it's missing data about the comment the user is trying to reply to), and retry it without caching. This would be possible, but right now I like Krinkle's idea to just delay the request more, and I'm not sure if it's worth the effort to do both, so I suggest that we try only doing that for now (but I'm open to changing my mind). I'll rephrase the task.

matmarex renamed this task from Add caching for our action=discussiontoolspageinfo calls to Avoid uncached action=discussiontoolspageinfo API calls on page load.Jan 5 2023, 11:43 PM

Change 876030 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Delay API requests for preloading metadata until user interaction

https://gerrit.wikimedia.org/r/876030

[…] Making API requests directly on-pageview is unprecedented to my knowledge, even if it were CDN cached. […]

I'm not suggesting it's a good idea, but it's not unprecedented. For example the "Multimedia" box on Special:Search (below the fold) is generated from an API request on page load, […]. This happens on every search on most projects (wgCirrusSearchCrossProjectShowMultimedia = true).

I was referring specifically to pageviews, not other/less-common web responses.

[…] Defer the API fetches […], or preload it through zero-RTT page metadata (e.g. using the new mechansism Bartosz initiated at T41813.)

We did that because some of that data is generated from Parsoid output, […].

(Back in 2019 when the project started, we were going to switch to Parsoid for normal page views any day, which would make it a lot easier to do this as part of the initial page load, and make it cached in parser cache or something. We're still going to switch to Parsoid any day now today ;) )

[…]

I'm confused about this point. To my knowledge, DT doesn't use just Parsoid in its API requests for JS purposes. It specifically uses Parsoid for the page render as well. This is why, after all, we split the ParserCache two years ago during the DT Beta phase (and incidentally, is why when DT false-positively activates itself server-side on a non-discussion page, it exposes Parsoid-related compat issues on live articles on wikis that use "extra sig" namespaces, due the wiki or Parsoid not being ready for that particular engagement yet).

What does DT need to extract from the Parsoid output that isn't available server-side when it renders the discussion page? Even if it's a ParserCache miss, it'll have parsed it in order to render it, right? I'm guessing we're talking about different kinds of Parsoid output, so I'm missing something here.

I'm not suggesting it's a good idea, but it's not unprecedented. For example the "Multimedia" box on Special:Search (below the fold) is generated from an API request on page load, […]. This happens on every search on most projects (wgCirrusSearchCrossProjectShowMultimedia = true).

I was referring specifically to pageviews, not other/less-common web responses.

Special:Search probably gets more page views than all talk pages combined ;)

I'm confused about this point. To my knowledge, DT doesn't use just Parsoid in its API requests for JS purposes. It specifically uses Parsoid for the page render as well. This is why, after all, we split the ParserCache two years ago during the DT Beta phase (and incidentally, is why when DT false-positively activates itself server-side on a non-discussion page, it exposes Parsoid-related compat issues on live articles on wikis that use "extra sig" namespaces, due the wiki or Parsoid not being ready for that particular engagement yet).

It doesn't use Parsoid for the page render. There has been some talk about doing that as the first steps towards Parsoid read views everywhere, but it hasn't been done. We used to split the parser cache (we no longer do) and caused some issues on articles (now fixed) only because of our own HTML transformations and additions. (We use a lot of Parsoid code for doing those, since Parsoid basically invented many HTML-processing utilities that PHP lacked, but we don't use Parsoid per se.)

Change 876030 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Delay API requests for preloading metadata until user interaction

https://gerrit.wikimedia.org/r/876030

@Krinkle @tstarling Please comment if you think some additional changes are worthwhile. Otherwise I will consider this done.

In lieu of QA, we should check what effect this change has on the load time for the tools, after it's deployed.

The data should be available in the EditAttemptStep schema, but I couldn't find a nice dashboard anywhere, so we might need to write some SQL queries (or make a dashboard).

In lieu of QA, we should check what effect this change has on the load time for the tools, after it's deployed.

The data should be available in the EditAttemptStep schema, but I couldn't find a nice dashboard anywhere, so we might need to write some SQL queries (or make a dashboard).

It's been live for a couple of days, so let's have a look.

Query is here: https://superset.wikimedia.org/superset/sqllab/?savedQueryId=634

select 
  date(from_iso8601_timestamp(dt)) as date,
  avg(event.loaded_timing) as time_ms
from editattemptstep
where event.integration='discussiontools' 
and event.loaded_timing IS NOT NULL
and event.loaded_timing < 10000 -- exclude extreme outliers - errors in the data?
group by year, month, day
order by year, month, day

image.png (2,048×1,152 px, 30 KB)

So we definitely have a regression on 2022-01-12, when the change was deployed to group2, with the average load time for the reply and new topic tools going from 0.4s to 0.6s.

So we definitely have a regression on 2022-01-12, when the change was deployed to group2, with the average load time for the reply and new topic tools going from 0.4s to 0.6s.

@matmarex is the avg. load time increase for the Reply and New Topic Tools something that we expected to result from this change? Is it unexpected? If so, is there additional work we ought to consider doing to address this apparent regression in load performance?

It is expected and it's up to you whether to prioritize working on this.