Page MenuHomePhabricator

parameter mathpurge=true should purge cache in restbase
Closed, ResolvedPublic

Description

If the HTTP get parameter mathpurge is set the extension math should set the header
Cache-Control: no-cache to the render request

Event Timeline

We discussed this a bit on IRC. It looks like the main use case is updating renders after mathoid code fixes.

To support this use case, I'm proposing to

  1. Version the content-type with something along the lines of https://github.com/wikimedia/restbase/pull/590 in RESTBase. This triggers a re-render whenever the content-type is outdated.
  2. Shorten the Varnish TTL to 2 weeks, in line with other HTML content.

This combination will show updated renders after at most two weeks. If a means to immediately purge the math on specific pages is needed, then we could consider respecting no-cache requests in Varnish. Currently they are unconditionally ignored, and I'm not entirely sure about the rationale. @BBlack, could you advise on whether you consider respecting no-cache headers from clients an option?

I'm not aware of the broader context of what mathpurge is about. That being said:

Why would a client send mathpurge=true, and how is math any different than any of the rest of our content, which is also subject to code changes cleaning up rendering of otherwise-identical content? I don't think we want to go down the path of "when mathoid code is updated, purge all math renders".

The regular MediaWiki code already has a ?action=purge that's used to force a one-off purge of an article, which could be used in isolated cases where buggy output was corrected by a code update. Are you asking for something similar for a mathoid output? Can we not use the same thing?

As for TTLs, we're already varnish-capping per-layer varnish TTLs at 2 weeks, and planning to drop that to 7 days. There are longer-term plans in the works to cap at 7 days across all layers and then keep pushing our standard TTLs even lower except in expectional cases of backend/network failure. Most of issues of this nature can basically be handwaved away once the standard TTLs get low enough.

I don't at all see how any of this interacts with whether we respect a client's CC:no-cache...

mathpurge=true is a workaround since extensions can not access the information if the action parameter was set to purge.
In the old png rendering mode this caues that all latex formulae were rendered for the tex code even though that might take serveral minutes for pages with a lot of math.
For Math but also for other extensions such as Graph or Cite it would be nice to have a way to get the most recent rendering for a particular page.
The idea was that the extension math would send a request with the CC:no-cache header to restbase to trigger a re-rendering from source without to use any cache.
I think problem is not specific to math at all, and there should be a way to render individual pages without to use caches at all.

mathpurge=true is a workaround since extensions can not access the information if the action parameter was set to purge.

How is the math render included in the page? Is it part of the built HTML output, or is it a sub-request? It seems like if it's the former, we should be able to hook up ?action=purge. If it's the latter, can we not create ?action=purge for the subrequest for administrative use to purge updated math outputs as one-offs?

In the old png rendering mode this caues that all latex formulae were rendered for the tex code even though that might take serveral minutes for pages with a lot of math.
For Math but also for other extensions such as Graph or Cite it would be nice to have a way to get the most recent rendering for a particular page.
The idea was that the extension math would send a request with the CC:no-cache header to restbase to trigger a re-rendering from source without to use any cache.

I'm lost in the layers here. "The extension would send a request" means...? Is the browser fetching the rendered math object independently of the main page?

I think problem is not specific to math at all

Agreed.

, and there should be a way to render individual pages without to use caches at all.

Client-side CC=no-cache and a request to purge caches accomplish different things. I have yet to see anything in this ticket where CC=no-cache sounds like an appropriate solution (not that we honor that at all, as gwicke said).

I think we are trying to put 2 (complementary, but different) problems in the same basket here. Let's clarify that.

Math render purges

On the Math(oid) side, we currently have the problem of not being able to re-render formulae when Mathoid's code changes in a way that affects them. Since the Math extension handles the initial render (on the server side), when it sees mathpurge=true all it has to do is to add Cache-Control: no-cache when sending the request to RESTBase (this step bypasses Varnish entirely). On the RESTBase side, we need issue a resource_change event when a new render is stored, which will be picked up by change propagation and purged.

Extension purges

As noted by @Physikerwelt , an extension cannot know a purge request has been issued, so it cannot act accordingly, in this instance, tell Varnish or some other entity to purge the content. I don't have a solution to propose for this one, but it seems sensible that it should be handled on the MW side, i.e. it could let extensions know it's a purge. They would in turn regenerate the content and, if needed, issue requests that would end up purging the cache.

In either case, unless I'm misunderstanding what is being talked about here, I don't think Varnish should be involved in the process at all (in the sense that it should be up to MW/services to issue the purge requests).

mathpurge=true is a workaround since extensions can not access the information if the action parameter was set to purge.

I don't understand the rest of the discussion here, but can't you use the ArticlePurge hook for that?

The task that math should support regular action=purge is certainly an orthogonal issue. Coincidently, this was some input on the generified issue Retrieve action get parameter from within an extension. Thus, we can assume for this discussion, that the Math extension was requested to purge the contents, regardless of how it got that information.

What is not clear to me is what would happen within restbase, if it would receive the CC:no-cache headers
a) on the check request
b) get MathML request
?
Or would it even have to send a get SVG and get PNG request in addition?

Moreover, I was confused to receive different results on
https://en.wikipedia.org/api/rest_v1/?doc#!/Math/post_media_math_check_type vs
https://www.mediawiki.org/api/rest_v1/?doc#!/Math/post_media_math_check_type

According to my idea of the infrastructure (picture below) a request sent from within restbase should not use cache and both web interfaces should access the same enty in the cassandra database.

mediawikiOverview.png (706×729 px, 61 KB)

What is not clear to me is what would happen within restbase, if it would receive the CC:no-cache headers
a) on the check request

It would ask Mathoid to re-check the formula.

b) get MathML request

It would request Mathoid to re-render all renders (MathML, SVG, PNG) using the /complete endpoint.

Or would it even have to send a get SVG and get PNG request in addition?

Nope, issuing a no-cache request for any of the renders prompts the regeneration of all of them.

Moreover, I was confused to receive different results on
https://en.wikipedia.org/api/rest_v1/?doc#!/Math/post_media_math_check_type vs
https://www.mediawiki.org/api/rest_v1/?doc#!/Math/post_media_math_check_type

That shouldn't happen. Are you saying that you supplied the same formula but received different results? Could you provide the formula?

You can try hash 6ad2fef5f3cad34f629a11c7f3d6cbf9010d8d2b
From the IRC log

(05:48:17 PM) physikerwelt1: gwicke, mobrovac am I right that the math is only stored once, independent of the domain? I was wondering why I get different results for curl https://en.wikipedia.org/api/rest_v1/media/math/render/svg/6ad2fef5f3cad34f629a11c7f3d6cbf9010d8d2b vs curl https://www.mediawiki.org/api/rest_v1/media/math/render/svg/6ad2fef5f3cad34f629a11c7f3d6cbf9010d8d2b
(06:39:58 PM) gwicke: physikerwelt1: as far as I can tell all backend requests in https://github.com/wikimedia/restbase/blob/master/v1/mathoid.yaml use wikimedia.org for the domain, so they *should* be shared
(06:41:16 PM) gwicke: also adding a random query string corrects the render: https://www.mediawiki.org/api/rest_v1/media/math/render/svg/6ad2fef5f3cad34f629a11c7f3d6cbf9010d8d2b?
(06:41:27 PM) gwicke: to me this looks like caching

You can try hash 6ad2fef5f3cad34f629a11c7f3d6cbf9010d8d2b
From the IRC log

(05:48:17 PM) physikerwelt1: gwicke, mobrovac am I right that the math is only stored once, independent of the domain? I was wondering why I get different results for curl https://en.wikipedia.org/api/rest_v1/media/math/render/svg/6ad2fef5f3cad34f629a11c7f3d6cbf9010d8d2b vs curl https://www.mediawiki.org/api/rest_v1/media/math/render/svg/6ad2fef5f3cad34f629a11c7f3d6cbf9010d8d2b
(06:39:58 PM) gwicke: physikerwelt1: as far as I can tell all backend requests in https://github.com/wikimedia/restbase/blob/master/v1/mathoid.yaml use wikimedia.org for the domain, so they *should* be shared
(06:41:16 PM) gwicke: also adding a random query string corrects the render: https://www.mediawiki.org/api/rest_v1/media/math/render/svg/6ad2fef5f3cad34f629a11c7f3d6cbf9010d8d2b?
(06:41:27 PM) gwicke: to me this looks like caching

Just tried to bypass Varnish, and I'm getting the same result. So, yeah, it's a caching issue, which we probably have to think about separately. Thank you @Physikerwelt for pointing out the issue!

@mobrovac does that mean that the picture of the architecture is outdated/wrong, and that there is a caching level between the MediaWiki instance and restbase, or is the cache build into restbase itself?

@mobrovac does that mean that the picture of the architecture is outdated/wrong, and that there is a caching level between the MediaWiki instance and restbase, or is the cache build into restbase itself?

Neither. When you request resources from the outside, everything passes through Varnish (the caching layer). However, when in the production network, it is completely bypassed. So, MW asks the renders directly from RB, while the clients from browesers actually ask Varnish, which fwd's the req to RB if it can't find it.

On the RESTBase side, we need issue a resource_change event when a new render is stored, which will be picked up by change propagation and purged.

This would only work if we constructed resource_change events for all 800+ domains exposing the same underlying render. We probably don't want to do this for a large number of SVGs after bumping the mathoid format version unless there is a strong need to get updates to show up more quickly than 2 weeks or so.

@BBlack, @ema would it be possible for Varnish to replace the domain in the request on the incoming side for /api/rest_v1/media/math/* to wikimedia.org ? Basically, what we want is for a request with the URI https://{domain}/api/rest_v1/media/math/{+rest} to be treated by Varnish as https://wikimedia.org/api/rest_v1/media/math/{+rest}. RESTBase already does this internally (no distinction between domains for math requests), so it would greatly improve caching for Varnish to do it too. Moreover, we could then reliably purge math objects from the cache as each object would be identified by a single URI.

PR #628 adds purging of renders for the global wikimedia.org domain.

PR #628 adds purging of renders for the global wikimedia.org domain.

The PR has been merged and deployed, so purges are now being issued for wikimedia.org. We still need the Varnish rewrites in order to make it work.

I still don't really have an answer for my earlier question, so I'll re-phrase: when a user views a page containing Math stuff, is the math render included via a separate browser fetch from RB, or is it part of the primary rendered page content and the fetch of the Math parts via RB happens during page render and gets cached in MW's parser cache?

On other topics:

  • We don't plan to honor CC:no-cache in Varnish at this time, and I don't see a compelling argument for doing so.
  • It definitely shouldn't be the case that shared Math objects which several articles include get purged when the including articles are purged because of unrelated content updates. What I meant was more about whether you could do a purge action on the Math URL itself...
  • If Mathoid URLs are project/language-independent, why are they hosted under project/language-specific URLs? That's pointless denormalization and seems like a URL schema issue. If it's meant to be used directly and publicly, it should be at e.g. https://mathoid.wikimedia.org/api/rest_v1/media/math/{+rest} and not accessed redundantly via other domains. Then we don't need Yet Another VCL Hack to work around the application layer. The uniqueness of URLs and the fact that the Hostname portion of the URL matters are fundamental concepts here, but I think this gets confused by the fact that internal/official RB URLs put domainnames into the URL path component to service things over the singular service hostname restbase.svc.eqiad.wmnet for several wikis. Still, under either the public or internal schema, math could have its own distinct domainname, such that one unique global resource has one public URL.

I still don't really have an answer for my earlier question, so I'll re-phrase: when a user views a page containing Math stuff, is the math render included via a separate browser fetch from RB, or is it part of the primary rendered page content and the fetch of the Math parts via RB happens during page render and gets cached in MW's parser cache?

Both, for better or worse. Each <math> tag embeds the MathML directly in the page (on page parse time), but also provides the link to the fall-back SVG image, which is to be fetched by the browser as a separate request to RB. Since the majority of browsers doesn't actually support MathML, in practice the fall-back SVG is retrieved by the client's browser.

On other topics:

  • We don't plan to honor CC:no-cache in Varnish at this time, and I don't see a compelling argument for doing so.
  • It definitely shouldn't be the case that shared Math objects which several articles include get purged when the including articles are purged because of unrelated content updates. What I meant was more about whether you could do a purge action on the Math URL itself...

On the one hand, you are right to say that the Math objects shouldn't be purged because the article is purged (as they might not have changed). However, that kind of depends on the user's intent. mathpurge=true can come to the rescue, but that narrows down the number of people that use it.

  • If Mathoid URLs are project/language-independent, why are they hosted under project/language-specific URLs? That's pointless denormalization and seems like a URL schema issue. If it's meant to be used directly and publicly, it should be at e.g. https://mathoid.wikimedia.org/api/rest_v1/media/math/{+rest} and not accessed redundantly via other domains. Then we don't need Yet Another VCL Hack to work around the application layer. The uniqueness of URLs and the fact that the Hostname portion of the URL matters are fundamental concepts here, but I think this gets confused by the fact that internal/official RB URLs put domainnames into the URL path component to service things over the singular service hostname restbase.svc.eqiad.wmnet for several wikis. Still, under either the public or internal schema, math could have its own distinct domainname, such that one unique global resource has one public URL.

The canonical URL for Math objects is restbase.svc.${::site}.wmnet/wikimedia.org/v1/media/math/*. All of the others are internal RB redirects, which we have gone with to decrease client latency since by using the same domain, they do not need to go through the TLS handshake when fetching them. While this matters very little on pages with a Math object or two, heavily-loaded pages would suffer greatly if we changed the links to point to wikimedia.org.

Change 292362 had a related patch set uploaded (by Mobrovac):
Send a no-cache content request on mathpurge=true

https://gerrit.wikimedia.org/r/292362

mobrovac triaged this task as High priority.
  • If Mathoid URLs are project/language-independent, why are they hosted under project/language-specific URLs? That's pointless denormalization and seems like a URL schema issue. If it's meant to be used directly and publicly, it should be at e.g. https://mathoid.wikimedia.org/api/rest_v1/media/math/{+rest} and not accessed redundantly via other domains. Then we don't need Yet Another VCL Hack to work around the application layer. The uniqueness of URLs and the fact that the Hostname portion of the URL matters are fundamental concepts here, but I think this gets confused by the fact that internal/official RB URLs put domainnames into the URL path component to service things over the singular service hostname restbase.svc.eqiad.wmnet for several wikis. Still, under either the public or internal schema, math could have its own distinct domainname, such that one unique global resource has one public URL.

The canonical URL for Math objects is restbase.svc.${::site}.wmnet/wikimedia.org/v1/media/math/*. All of the others are internal RB redirects, which we have gone with to decrease client latency since by using the same domain, they do not need to go through the TLS handshake when fetching them. While this matters very little on pages with a Math object or two, heavily-loaded pages would suffer greatly if we changed the links to point to wikimedia.org.

The majority of real-world browser clients are going to use HTTP/2, which coalesces that connection between the canonical wiki domain and wikimedia.org, as they share a cert and an IP address. Only older browsers are going to make a separate connection to load a wikimedia.org resource (or anything else mapped to the text cluster and in our set of canonical/project domainnames).

Change 292362 merged by jenkins-bot:
Send a no-cache content request on mathpurge=true

https://gerrit.wikimedia.org/r/292362

Change 292937 had a related patch set uploaded (by Mobrovac):
Math: Set wgMathFullRestbaseURL to point to wikimedia.org in production

https://gerrit.wikimedia.org/r/292937

Change 292937 merged by jenkins-bot:
Math: Set wgMathFullRestbaseURL to point to wikimedia.org in production

https://gerrit.wikimedia.org/r/292937

mobrovac removed a project: Patch-For-Review.
mobrovac removed a subscriber: gerritbot.

The Math extension will start sending no-cache requests to RESTBase on mathpurge=true when 1.28-wmf.5 goes out. Also, it has started instructing clients to request Math objects using the canonical URL, and RB issues purge requests for the same, canonical URL. Resolving.