Page MenuHomePhabricator

REST API: ETag missing from some responses from page HTML endpoint
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue:

  • Retrieve HTML from a REST API endpoint, e.g: https://en.wikipedia.org/w/rest.php/v1/page/Main_Page/html?redirect=no

What happens?:
Most responses from WMF sites contain an ETag, some do not.

Notes:

It was initially thought that the behaviour depends on whether or not a query parameter is present, but that does not seem to be the case. The first step to resolving the issue will be to understand what causes the ETag to be missing sometimes. It is suspected that the ETag is lost along the way through layers of frontend caches.

Note that frentend caching should currently not apply to responses from the page HTML endpoint: These endpoints do not (yet) see heavy loads, and requests are spread evently among all pages, so the cache hit rate would be low. This is likely to change once we see more traffic on thise endpoints, and can observe access patterns that allows us to determin effective caching parameters.

Details

Related Changes in Gerrit:

Event Timeline

@Anoop RESTBase API responds ETag properly when redirect=false is set. It works well. But rest.php is a part of mediawiki core, not RESTBase. This should be MediaWiki-REST-API .

This has become a much bigger issue because RESTbase is now redirecting to the REST API.

But it's only affecting some requests, e.g.

$ curl -I 'https://en.wikipedia.org/w/rest.php/v1/page/Main_Page/html?redirect=false' | grep etag
etag: W/"1267837350/700f3eff-db65-11ef-a156-934bb9987377/view/html"

is fine.

But here's a request that's failing:

$ curl -i 'https://en.wikipedia.org/w/rest.php/v1/page/Wikipedia%3ARequests_for_adminship%2Ftheleekycauldron_2/html?redirect=false'
HTTP/2 200 
date: Sat, 25 Jan 2025 22:31:01 GMT
server: mw-api-ext.eqiad.main-b576cd585-v9j2n
x-content-type-options: nosniff
cache-control: no-cache
content-language: en
content-revision-id: 1170967427
access-control-allow-origin: *
last-modified: Fri, 24 Jan 2025 18:49:47 GMT
content-type: text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/2.8.0"
vary: x-restbase-compat, Accept-Encoding
age: 2
x-cache: cp1112 hit, cp1112 miss
x-cache-status: hit-local
server-timing: cache;desc="hit-local", host;desc="cp1112"
strict-transport-security: max-age=106384710; includeSubDomains; preload
report-to: { "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
nel: { "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
set-cookie: WMF-Last-Access=25-Jan-2025;Path=/;HttpOnly;secure;Expires=Wed, 26 Feb 2025 12:00:00 GMT
set-cookie: WMF-Last-Access-Global=25-Jan-2025;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Wed, 26 Feb 2025 12:00:00 GMT
x-client-ip: 151.202.4.10
set-cookie: GeoIP=US:NY:Brooklyn:40.73:-73.94:v4; Path=/; secure; Domain=.wikipedia.org
set-cookie: NetworkProbeLimit=0.001;Path=/;Secure;SameSite=Lax;Max-Age=3600
accept-ranges: bytes

It's really odd, I don't see the pattern. It seems like to Etag is lost along the way for some requests.

Please see additional comments related to this topic here: https://phabricator.wikimedia.org/T374683

daniel renamed this task from ETag is not generated by REST API when query parameters are given to REST API: ETag missing from some responses from page HTML endpoint.Jan 29 2025, 6:20 AM
daniel updated the task description. (Show Details)

I can't reproduce this issue locally, the etag is always present for me. We also have a Mocha test checking the etag.

I do get etags for *most* pages, with or without query params. I wonder if the issue is with the frontend cache. But I don't really see the pattern:

No etag:
x-cache: cp3071 hit, cp3071 miss
x-cache-status: hit-local

With etag:
x-cache: cp3071 hit, cp3071 pass
x-cache-status: hit-local

With etag:
x-cache: cp3071 miss, cp3071 miss
x-cache-status: miss

With etag:
x-cache: cp3071 miss, cp3071 hit/1
x-cache-status: hit-front

No etag:
x-cache: cp3072 hit, cp3072 pass
x-cache-status: hit-local

No etag:
x-cache: cp3072 hit, cp3072 miss
x-cache-status: hit-local

Whether or not a query parameter is used does not seem to be the deciding factor, though it may change the outcome for a given page.

I note that the response contains cache-control: no-cache, which (despite the name) allows caching, while the raw response from the backend should have Cache-Control: private, max-age=0, s-maxage=0, according to local tests.

Pinging @akosiaris and the Traffic team in the hope that they can shed some light.

I 'll take a closer look tomorrow but this is indeed possibly older cached content. See for example

$ curl -s -I 'https://en.wikipedia.org/w/rest.php/v1/page/Wikipedia%3ARequests_for_adminship%2Ftheleekycauldron_2/html?ts=42'  |grep -i etag
etag: W/"1170967427/60903dfd-de62-11ef-b1ec-8dfa4061cb8f/view/html"

I note that the response contains cache-control: no-cache, which (despite the name) allows caching, while the raw response from the backend should have Cache-Control: private, max-age=0, s-maxage=0, according to local tests.

I don't think that's true. Per

bast3007:~$ curl -s -4 -o /dev/null -v --connect-to en.wikipedia.org:443:mw-api-int.discovery.wmnet:4446 'https://en.wikipedia.org/w/rest.php/v1/page/Wikipedia%3ARequests_for_adminship%2Ftheleekycauldron_2/html' |& grep cache-control
< cache-control: max-age=5

For what is worth, the above command is being run in the wikimedia infra and instructs curl to not go via the CDN but rather straight to mw-api-int

I 'll take a closer look tomorrow but this is indeed possibly older cached content. See for example

Do we have an idea why older responses would be missing the etag in the first place?

To expand on that, if we think it's older cached content, shouldn't there be a corresponding change that has fixed it for new entries? And given there's a max time limit on varnish content, said change would've needed to be deployed within that time window.

I 'll take a closer look tomorrow but this is indeed possibly older cached content.

@akosiaris Any news on that?

Old cached content seems unlikely - The behavior of this endpoints hasn't changed in months - years, probably. Also, the responses shouldn't even be cacheable, the response from MediaWiki should have s-maxage=0.

But it's curious that the issue pops up at random, it comes and goes even for the same URL, with no discernable patter.
Is it possible that there is a misconfiguration somewhere, causing the responses to be cached without an etag if the request is routed through certain boxes, but not when it goes through others?

( Just to record, in my tests, repeating querying one URL made its ETag disappear (:

Also, as a work around for this issue, mwbot-rs currently accepts E-Tag's being missing, since 09526ac99771 ("Fix MW REST API compatibility"). After resolving this issue, the related changes should be reverted.

Hi, sorry, I ended up having no time to look into this last week, will do my best to look into it this week.

( Just to record, in my tests, repeating querying one URL made its ETag disappear (:

Amazing! Do you happen to have the x-cache and x-cache-status headers before and after the "vanishing"?

What's the difference between hit-local and hit-front?

test1:
etag: W/"1267837350/179516a0-e2a2-11ef-8c61-3717988862a1/view/html"
vary: x-restbase-compat, Accept-Encoding
age: 2
x-cache: cp5023 miss, cp5023 miss
x-cache-status: miss

etag: W/"1267837350/179516a0-e2a2-11ef-8c61-3717988862a1/view/html"
vary: x-restbase-compat, Accept-Encoding
age: 4
x-cache: cp5023 miss, cp5023 hit/1
x-cache-status: hit-front

etag: W/"1267837350/179516a0-e2a2-11ef-8c61-3717988862a1/view/html"
vary: x-restbase-compat, Accept-Encoding
age: 8
x-cache: cp5023 miss, cp5023 hit/2
x-cache-status: hit-front

x-cache: cp5023 hit, cp5023 miss
x-cache-status: hit-local

test2:
etag: W/"1267837350/179516a0-e2a2-11ef-8c61-3717988862a1/view/html"
vary: x-restbase-compat, Accept-Encoding
age: 0
x-cache: cp5023 miss, cp5023 miss
x-cache-status: miss

etag: W/"1267837350/179516a0-e2a2-11ef-8c61-3717988862a1/view/html"
vary: x-restbase-compat, Accept-Encoding
age: 13
x-cache: cp5023 miss, cp5023 hit/1
x-cache-status: hit-front

age: 0
x-cache: cp5023 hit, cp5023 miss
x-cache-status: hit-local

...so responses with x-cache-status: hit-local are missing the etag?

not completely. after sending the request for more times, the ETag is continuously missing but hit-local seldomly become hit-front (also without etag), however, it will become hit-local again shortly (basically in next request).

I 've sent 200 requests in quick succession to the internal MW endpoint and I can confirm that it is setting the etag, consistently 100% of the time

for i in `seq 1 200`; do curl -s -4 -o /dev/null -v --connect-to en.wikipedia.org:443:mw-api-int.discovery.wmnet:4446 'https://en.wikipedia.org/w/rest.php/v1/page/Wikipedia%3ARequests_for_adminship%2Ftheleekycauldron_2/html' ; done |& grep -Ei etag | sort | uniq -c

  200 < etag: "1170967427/6862e135-df21-11ef-93b3-7502db1f6045/view/html"

I 've already did the same thing for Cache-Control and the results definitely says max-age=5 100% of the time

for i in `seq 1 200`; do curl -s -4 -o /dev/null -v --connect-to en.wikipedia.org:443:mw-api-int.discovery.wmnet:4446 'https://en.wikipedia.org/w/rest.php/v1/page/Wikipedia%3ARequests_for_adminship%2Ftheleekycauldron_2/html' ; done |& grep -Ei cache-control | sort | uniq -c
    200 < cache-control: max-age=5

I 've also decided to bypass envoy (and the internal TLS) and talk straight to MediaWiki's Apache. Same thing

for i in `seq 1 200` ; do curl -s -4 -o /dev/null -v --connect-to en.wikipedia.org:80:10.67.144.154:8080 'http://en.wikipedia.org/w/rest.php/v1/page/Wikipedia%3ARequests_for_adminship%2Ftheleekycauldron_2/html' ; done |& grep -Ei 'etag|cache-control' | sort | uniq -c
    200 < Cache-Control: max-age=5
    200 < ETag: "1170967427/6862e135-df21-11ef-93b3-7502db1f6045/view/html"

I think with this we can definitely rule out that it's MediaWiki or Envoy to blame here and indeed look at the CDN layer.

As a general point, the hit-local vs hit-front thing is set by Varnish. The corresponding code is in wikimedia-frontend.vcl.

But I witness the problem before Varnish, in fact, at Apache Traffic Server, which is the part of the double layered CDN that talks straight to MediaWiki. Judging from the following

cp3066:~$ for i in `seq 1 200` ; do sleep 0.3 ; curl -s -4 -o /dev/null -v --connect-to en.wikipedia.org:80:127.0.0.1:3128 'http://en.wikipedia.org/w/rest.php/v1/page/Wikipedia%3ARequests_for_adminship%2Ftheleekycauldron_2/html' ; done |& grep -Ei 'etag|cache-control' | sort | uniq -c
     18 < cache-control: max-age=5
    182 < cache-control: no-cache
     18 < etag: "1170967427/6862e135-df21-11ef-93b3-7502db1f6045/view/html"

which instructs curl to talk straight to ATS on localhost on a CDN server.

Re-running the above shortly after (a couple of minutes) produces the following

cp3066:~$ for i in `seq 1 200` ; do sleep 0.3 ; curl -s -4 -o /dev/null -v --connect-to en.wikipedia.org:80:127.0.0.1:3128 'http://en.wikipedia.org/w/rest.php/v1/page/Wikipedia%3ARequests_for_adminship%2Ftheleekycauldron_2/html' ; done |& grep -Ei 'etag|cache-control' | sort | uniq -c
    200 < cache-control: no-cache

Judging from the above, there appears to be a race condition here, where just 18 of the requests had an etag and the original cache-control headers. Afterwards, all subsequent requests don't have the etag header, nor the original cache-control header.

The 300ms sleep, is for some reason critical in reproducing this. Without it, the first 200 requests work out fine and the problem becomes visible in subsequent runs of the curl snippet above. The 300ms is arbitrary, it just felt like it would prove my theory and apparently it does. I am assuming higher numbers will reproduce this with a smaller number of requests being successful.

Adding @Vgutierrez for his input.

interesting issue, this is what I found so far after issuing the following cmd:

vgutierrez@cp7001:~$ for i in `seq 1 200`; do sleep 0.3 ; curl -s -o /dev/null -v --connect-to en.wikipedia.org:80:127.0.0.1:3128 'http://en.wikipedia.org/w/rest.php/v1/page/Wikipedia%3ARequests_for_adminship%2Ftheleekycauldron_2/html'; done |& fgrep -i etag | sort |uniq -c
     17 < etag: "1170967427/6862e135-df21-11ef-93b3-7502db1f6045/view/html"

those 17 requests getting an etag resulted on TCP_MISS, TCP_HIT or TCP_MEM_HIT during the first 5 seconds (remember that max-age=5). After those 5 seconds, the subsequent 183 requests triggered a TCP_REFRESH_HIT on ATS and those responses are missing an etag.

Sending a conditional request to mw-api-ext-ro I get the following response headers:

vgutierrez@cp7002:~$ curl -s -o /dev/null -H 'If-Modified-Since: Thu, 30 Jan 2025 15:46:45 GMT' -v --connect-to en.wikipedia.org:443:mw-api-ext-ro.discovery.wmnet:4447 'https://en.wikipedia.org/w/rest.php/v1/page/Wikipedia%3ARequests_for_adminship%2Ftheleekycauldron_2/html'
[...]
> GET /w/rest.php/v1/page/Wikipedia%3ARequests_for_adminship%2Ftheleekycauldron_2/html HTTP/1.1
> Host: en.wikipedia.org
> User-Agent: curl/7.74.0
> Accept: */*
> If-Modified-Since: Thu, 30 Jan 2025 15:46:45 GMT
> 
[...]
< HTTP/1.1 304 Not Modified
< date: Wed, 05 Feb 2025 15:34:45 GMT
< server: mw-api-ext.eqiad.main-86897d9bb9-tvlrq
< x-powered-by: PHP/7.4.33
< x-content-type-options: nosniff
< cache-control: no-cache
< vary: x-restbase-compat
< access-control-allow-origin: *
< traceparent: 00-fdd8ec16bc143eb185935750f1597b09-ebd549e33355e240-00
< x-request-id: 9b98f2c2-abcf-41ed-b5b0-d39be69c2b46
< backend-timing: D=87647 t=1738769685094715
< x-envoy-upstream-service-time: 87
< 
* Connection #0 to host mw-api-ext-ro.discovery.wmnet left intact

RFC 9110 (https://datatracker.ietf.org/doc/html/rfc9110#name-304-not-modified) says:

The server generating a 304 response MUST generate any of the following header fields that would have been sent in a 200 (OK) response to the same request:

Content-Location, Date, ETag, and Vary
Cache-Control and Expires (see [CACHING])
Since the goal of a 304 response is to minimize information transfer when the recipient already has one or more cached representations, a sender SHOULD NOT generate representation metadata other than the above listed fields unless said metadata exists for the purpose of guiding cache updates (e.g., Last-Modified might be useful if the response does not have an ETag field).

my understanding is that the etag header must be sent by mw-api-ext-ro in the 304 response to comply with RFC 9110

@daniel, looks like Valentin found the culprit, aka MediaWiki not sending the ETag header in 304 Not Modified responses. Any thoughts?

@daniel, looks like Valentin found the culprit, aka MediaWiki not sending the ETag header in 304 Not Modified responses. Any thoughts?

Working on a patch now! Thank you for the investigation.

Change #1117567 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] REST: include ETag in 304 responses

https://gerrit.wikimedia.org/r/1117567

Change #1117567 merged by jenkins-bot:

[mediawiki/core@master] REST: include ETag in 304 responses

https://gerrit.wikimedia.org/r/1117567

Tim merged the fix yesterday, so the problem should go away with this weeks train deployment.

@akosiaris @Vgutierrez Thank you again for digging in and identifying the root cause, this really had me stumped.

Thank all of you for looking into this (: