Page MenuHomePhabricator

Request cached copy of machine generated Related Articles
Closed, ResolvedPublic1 Estimated Story Points

Description

As an engineering manager, I do not want to unnecessarily access origin servers for Related Articles, but instead want edge cached responses, so that the origin server capacity can be used for other purposes. As a reader, I want Related Articles to load quickly, so that I can more easily find other neat stuff at a glance.

This is the bridge solution that is a precursor to T125983: RESTbase cached morelike endpoint.

The api.php smaxage parameter should be set on the Related Articles morelike requests with a value of 24 hours.

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
Resolvedovasileva
Resolvedovasileva
Resolvedphuedx
Resolvedhoo
Resolveddebt
Resolveddcausse
Resolved JKatzWMF
Resolveddcausse
Resolvedphuedx
Resolvedovasileva
DuplicateNone
Resolved Moushira
Invalid Moushira
Resolved Jhernandez
Resolved Tbayer
Resolvedovasileva
ResolvedJdlrobson

Event Timeline

dr0ptp4kt raised the priority of this task from to Needs Triage.
dr0ptp4kt updated the task description. (Show Details)
dr0ptp4kt subscribed.

Heads up, @BBlack @elukey. Basically, upon sufficient scrolling on a page on the mobile web, this additional API call will be made. We want to ensure edge cached responses to avoid hitting the origin, both for the sake of the origin's resource usage and for the sake of the response time.

Change 307361 had a related patch set uploaded (by Bmansurov):
Cache morelike requrests

https://gerrit.wikimedia.org/r/307361

Change 307361 merged by jenkins-bot:
Cache morelike requrests

https://gerrit.wikimedia.org/r/307361

8afb02b6 meets the AC of this task. What @dr0ptp4kt is really aiming for, however, isn't met. We need to better understand the (HTTP) caching-related behaviour of the pageterms query module.


Making the API request currently made on the Beta Cluster to the production cluster results in the following Cache-Control response header:

Cache-Control: private, must-revalidate, max-age=0

Read: only the UA should cache the resource, the UA should re-request the resource from the origin once the cached resource is stale, and the resource is immediately stale.

[Adding the maxage=86400 query parameter to the request](https://en.m.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages%7Cpageterms&piprop=thumbnail&pithumbsize=160&wbptterms=description&pilimit=3&generator=search&gsrsearch=morelike%3ABlack-backed_jackal&gsrnamespace=0&gsrlimit=3&gsrqiprofile=classic_noboostlinks&maxage=86400&smaxage=86400) yields:

Cache-Control: private, must-revalidate, max-age=86400

Read: only the UA should cache the resource, the UA should re-request the resource from the origin once the cached resource is stale, and the resource is stale after a day.

To be clear, in both cases, Varnish isn't caching the response.

Why is the cache mode of the response private? I'm not sure of the exact cause but I can isolate it to the introduction of the prop=pageterms query parameter. Consider the following:

The API request without any prop query parameter yields:

Cache-Control: s-maxage=86400, max-age=86400, public

Read: caching proxies and the UA should cache the resource and re-request the resource from the origin once the cached resource is stale, and the resource is stale after a day.

[The API request with the prop=pageimages query parameter (and parameters required by the pageimages query module)](https://en.m.wikipedia.org/w/api.php?action=query&format=json&smaxage=86400&maxage=86400&generator=search&formatversion=2&gsrsearch=morelike%3ABlack-backed_jackal&gsrnamespace=0&gsrlimit=3&gsrqiprofile=classic_noboostlinks&prop=pageimages&piprop=thumbnail&pithumbsize=160&pilimit=3) yields:

Cache-Control: s-maxage=86400, max-age=86400, public

[The API request with the prop=pageterms query parameter (and parameters required by the pageterms query module)](https://en.m.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageterms&wbptterms=description&generator=search&gsrsearch=morelike%3ABlack-backed_jackal&gsrnamespace=0&gsrlimit=3&gsrqiprofile=classic_noboostlinks&maxage=86400&smaxage=86400) yields:

Cache-Control: private, must-revalidate, max-age=86400

In the first two cases, Varnish is caching the response – and the next request takes ~20 ms to complete!

phuedx removed phuedx as the assignee of this task.EditedAug 30 2016, 11:50 AM
phuedx subscribed.

The move to -1 (Needs More Work) is provisional.

@dr0ptp4kt, @ovasileva: How do you feel about increasing the scope of this task?

Edit

… to, potentially, patching the pageterms API query module, which is part of Wikibase.

@phuedx - (sorry for the ignorance) - how would current state change with increased scope? (or rather, would the change allow for a cached response in more cases (1, 2))?

If it's necessary to patch a module, I prefer that's done.

That said, @EBernhardson @Anomie @Tgr @BBlack do you have any insight why adding smaxage=86400 to the following URL doesn't result in edge side caching, and whether there's a different incantation that would achieve the desired edge side caching?

https://en.m.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages|pageterms&piprop=thumbnail&pithumbsize=160&wbptterms=description&pilimit=3&generator=search&gsrsearch=morelike:Siege_of_Sidney_Street&gsrnamespace=0&gsrlimit=3

We don't want to introduce extra unnecessary origin side load, hence the desire to cache at the edge (in addition to any other high-speed memory caching done origin side; we're still trying to save PHP processing above and beyond high-speed memory caching).

It wasn't immediately obvious to me from a scan of the following.

https://github.com/wikimedia/operations-puppet/search?utf8=%E2%9C%93&q=s-maxage&type=Code

https://github.com/wikimedia/mediawiki/search?utf8=%E2%9C%93&q=smaxage+extension%3Aphp&type=Code

https://github.com/wikimedia/mediawiki-extensions-Wikibase/search?utf8=%E2%9C%93&q=s-maxage

Note that at the moment we're not yet pursuing requests against the RESTbase related endpoint, which has edge side caching by configuration. That's for another day.

That said, @EBernhardson @Anomie @Tgr @BBlack do you have any insight why adding smaxage=86400 to the following URL doesn't result in edge side caching, and whether there's a different incantation that would achieve the desired edge side caching?

https://en.m.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages|pageterms&piprop=thumbnail&pithumbsize=160&wbptterms=description&pilimit=3&generator=search&gsrsearch=morelike:Siege_of_Sidney_Street&gsrnamespace=0&gsrlimit=3

The reason is because Wikibase\Client\Api\PageTerms doesn't override ApiQueryBase::getCacheMode(), so it uses the default 'private'. You'll likely want to create a subtask for the Wikibase people to look at whether that class can safely override the method to return 'public' or 'anon-public-user-private'.

Looks like anomie has you covered wrt edge side caching. I would also like to encourage collecting stats about the hit/miss rate of this edge side caching as well if possible. We know that at the origin hit rates are currently[1] in the 65-80% range, I'm not really sure how that will translate on the edge side.

[1] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=55&fullscreen

Thanks, @Anomie and @EBernhardson . @phuedx would you please arrange the subtask to inquire with the Wikibase people?

As far as the technical implementation of the updated caching support, @ovasileva, any problem with @phuedx doing the technical implementation if the Wikibase people are good with it? Irrespective of who does the technical implementation, the unavailability of the caching is a technical blocker for rollout.

The reason is because Wikibase\Client\Api\PageTerms doesn't override ApiQueryBase::getCacheMode(), so it uses the default 'private'. You'll likely want to create a subtask for the Wikibase people to look at whether that class can safely override the method to return 'public' or 'anon-public-user-private'.

Thanks for taking the time to look @Anomie. This is what I suspected but I didn't get around to confirming it.

@dr0ptp4kt, @phuedx - no problem with me of course, let's make sure the wikibase people are okay with it. @phuedx - do you know how large this would be in scope (relatively)? I'm assuming it will go into the next sprint.

@ovasileva: I created the task earlier and @hoo was good enough to pick it up almost immediately.

@phuedx - yup, just saw right after I commented, thank you!

With @hoo's change merged, I can verify that this works for anonymous users. For logged-in users, however, it doesn't and I still see the following response header:

Cache-Control: private, must-revalidate, max-age=0

Quoting from the API documentation:

Currently the API uses a logged-in user's language setting by default, so responses to logged-in users are always private. This can be avoided by adding the uselang=content API parameter (T97096).

I can verify that adding uselang=content to the query parameters makes this work for all users.

Change 308143 had a related patch set uploaded (by Phuedx):
Cache logged-in "morelike" requests too

https://gerrit.wikimedia.org/r/308143

Change 308143 merged by jenkins-bot:
Cache "morelike" requests for all users

https://gerrit.wikimedia.org/r/308143