Page MenuHomePhabricator

ATS-BE Lua mitigations for cacheable responses w/ Set-Cookie seemingly not working
Closed, ResolvedPublic

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 632192 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: include response status in cacheable cookie logging

https://gerrit.wikimedia.org/r/632192

Change 632192 merged by Ema:
[operations/puppet@production] vcl: include response status in cacheable cookie logging

https://gerrit.wikimedia.org/r/632192

I've broadened the search to the past 2 months, and there are a total of 10 matching log entries, all of which are from Parsoid. All are related to edit counts:

2020-09-28 15:21:55 Cacheable object with Set-Cookie found! bereq.http.Host: de.wikipedia.org bereq.url: /w/rest.php/v1/page/Clemens_Arvay/history/counts/edits
2020-09-27 14:15:21 Cacheable object with Set-Cookie found! bereq.http.Host: tr.wikipedia.org bereq.url: /w/rest.php/v1/page/T%C3%BCrkiye'de_LGBT_haklar%C4%B1/history/counts/edits
2020-09-25 09:57:47 Cacheable object with Set-Cookie found! bereq.http.Host: en.wikipedia.org bereq.url: /w/rest.php/v1/page/Geronimo_Stilton/history/counts/edits
2020-09-17 16:18:57 Cacheable object with Set-Cookie found! bereq.http.Host: ja.wikipedia.org bereq.url: /w/rest.php/v1/page/%E3%82%B8%E3%83%A3%E3%82%BA/history/counts/edits
2020-09-01 10:58:15 Cacheable object with Set-Cookie found! bereq.http.Host: de.wikipedia.org bereq.url: /w/rest.php/v1/page/Brodecks_Bericht_(Comic)/history/counts/edits
2020-08-31 10:59:58 Cacheable object with Set-Cookie found! bereq.http.Host: ja.wikipedia.org bereq.url: /w/rest.php/v1/page/%E6%88%8C%E7%A5%9E%E3%81%93%E3%82%8D%E3%81%AD/history/counts/edits
2020-08-31 08:08:21 Cacheable object with Set-Cookie found! bereq.http.Host: en.wikipedia.org bereq.url: /w/rest.php/v1/page/Kent,_Washington/history/counts/edits
2020-08-30 18:09:45 Cacheable object with Set-Cookie found! bereq.http.Host: fr.wikipedia.org bereq.url: /w/rest.php/v1/page/Assembl%C3%A9e_de_Mac%C3%A9doine_du_Nord/history/counts/edits
2020-08-30 14:35:56 Cacheable object with Set-Cookie found! bereq.http.Host: ja.wikipedia.org bereq.url: /w/rest.php/v1/page/Tokyo_Rendez-Vous/history/counts/edits
2020-08-29 18:30:08 Cacheable object with Set-Cookie found! bereq.http.Host: de.wikipedia.org bereq.url: /w/rest.php/v1/page/Blast_(Comic)/history/counts/edits

It still isn't clear to me how the cookies are making it through ats-be, but this pattern seems like an interesting lead. I've added the response status code to the VCL logging code too, as @Tgr hinted that it could be useful to specifically look for error responses.

ema triaged this task as Medium priority.Oct 5 2020, 9:31 AM
ema moved this task from Backlog to Bug Reports on the Traffic board.

That endpoint are from the history API, not Parsoid. Also, I think it is only used by mobile apps (but we should double-check that via the webrequest data) so while definitely worth investigating, I don't think it was what caused the latest incident.

Nevertheless, would be interesting to know why is it the one endpoint that shows up in these logs. It is one of the endpoints that set public caching, but not the most frequently called one (according to HeaderCallback logs, see T256395#6298796).

That endpoint are from the history API, not Parsoid.

Ah, thanks for the clarification!

I don't think it was what caused the latest incident.

Yeah, at least based on "cacheable cookies" logs on both ATS and Varnish there is no indication that T264370 is related to the CDN. @CDanis only noticed the logstash entries mentioned here while looking for clues, I don't think he meant to suggest that there was a link.

Nevertheless, would be interesting to know why is it the one endpoint that shows up in these logs

Given that there is no matching log from ATS backends, I'm starting to suspect some complex interaction between Varnish and ATS on conditional requests issued by Varnish.

<full_speculation_mode>
Something along the lines of:

  1. the history API responds with Cache-Control: max-age=60 and no session cookie, thus no logging of "cacheable cookie" at the ATS level. The object gets cached by both ATS and Varnish
  2. Upon revalidation, Varnish sends a conditional request to ATS backend, which either has no cached object anymore or also has to revalidate from the origin
  3. The origin responds with a session cookie and no Cache-Control
  4. Varnish does the logging (on conditional requests, beresp isn't simply the current backend response as any sane person would think, it also includes all headers originally cached)

</full_speculation_mode>

I did verify some time ago that basic IMS works sanely, but perhaps there's some additional tricky detail to take into account. Adding beresp.was_304 to "cacheable cookie logs" would probably help clarify things a little bit.

@CDanis only noticed the logstash entries mentioned here while looking for clues, I don't think he meant to suggest that there was a link.

FTR, yes, precisely. I noticed it and it seemed inexplicable, so filed a task. But I didn't think it was directly related to the latest occurrence.

Change 632499 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: include beresp.was_304 in cacheable cookie logging

https://gerrit.wikimedia.org/r/632499

Change 632499 merged by Ema:
[operations/puppet@production] vcl: include beresp.was_304 in cacheable cookie logging

https://gerrit.wikimedia.org/r/632499

Change 643256 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: add 36-set-cookie-logging-false-positives.vtc

https://gerrit.wikimedia.org/r/643256

Change 643256 merged by Ema:
[operations/puppet@production] varnish: add 36-set-cookie-logging-false-positives.vtc

https://gerrit.wikimedia.org/r/643256

ema claimed this task.

<full_speculation_mode>

  1. the history API responds with Cache-Control: max-age=60 and no session cookie, thus no logging of "cacheable cookie" at the ATS level. The object gets cached by both ATS and Varnish
  2. Upon revalidation, Varnish sends a conditional request to ATS backend, which either has no cached object anymore or also has to revalidate from the origin
  3. The origin responds with a session cookie and no Cache-Control
  4. Varnish does the logging (on conditional requests, beresp isn't simply the current backend response as any sane person would think, it also includes all headers originally cached)

Since I added beresp.was_304 to the logs on October 6, we had 3 requests being logged:

All of the above have beresp.was_304: true, confirming my wild guess.

I could reproduce the false positive with https://gerrit.wikimedia.org/r/c/operations/puppet/+/643256:

  • The first request is a cache miss, the object returned by the origin is cacheable with max-age=2
  • We wait 3 seconds to make the object expire, then send a new request. This results in a cache hit under grace, triggering a bgfetch to refresh the object
  • The origin server responds to the conditional request issued by varnish claiming that the object has not changed (HTTP 304), but it also adds a session cookie
  • The Set-Cookie header is received by varnish, we create a hit-for-pass object and expire the previously cached object. Set-Cookie is not returned to the client
  • The following entry is added to syslog: Nov 24 13:50:26 vtcpcc varnishd[9377]: Cacheable object with Set-Cookie found! beresp.status: 200 beresp.was_304: true bereq.http.Host: en.wikipedia.org bereq.url: /1 Cache-Control: max-age=2 Set-Cookie: enwikiSession=42 Server: ReqId: X-Cache-Int:
  • A further request for the same object results in a pass due to the HFP object created above

The logs are thus not an indication of any issues with the ats-be Lua mitigations. They do show however that occasionally Mediawiki is returning session cookies while claiming that an object has not changed. Such events do not cause the cookies to be cached, on the contrary they invalidate the object previously cached.

Thanks for the thorough investigation!

So, should we try to prevent MediaWiki from emitting cookies in 304 responses, or it doesn't really matter? IIUC it will result in misleading logs but is otherwise harmless. Can the logging code be fixed to avoid false positives?

What's the web standards perspective - is returning cookies in a 304 response wrong? The HTTP spec does not seem to imply so. (It does say that sender SHOULD NOT generate representation metadata other than the above listed fields unless said metadata exists for the purpose of guiding cache updates, but while "representation metadata" doesn't seem to be defined anywhere, I don't think it would include Set-Cookie headers, which are a state management mechanism orthogonal to how resources are represented. I think we even use 304 + Set-Cookie intentionally in some cross-domain login mechanisms.

The spec on cache updates says that caches should use other header fields [than Warning] provided in the 304 (Not Modified) response to replace all instances of the corresponding header fields in the stored response so emitting a 304 with cookies will result in those cookies being cached (not on Varnish due to the custom protections, but maybe on some intermediary cache?). But then so will anything else with cookies, so it does not seem like an especially risky thing to do.

So, should we try to prevent MediaWiki from emitting cookies in 304 responses, or it doesn't really matter? IIUC it will result in misleading logs but is otherwise harmless.

It seems to me that setting session cookies on a 304 response is at best useless: varnish does not return them to the user anyways as the VTC above indicates, and they might be hinting at some logic issue in Mediawiki? Returning a cached object should not create a new session (I'm assuming that this is what Set-Cookie: ss0-dewikiSession=blabla does, but correct me if I'm wrong).

What's the web standards perspective - is returning cookies in a 304 response wrong?

My understanding of rfc7232#section-4.1 is that 304s should not include any headers except for Cache-Control, Content-Location, Date, ETag, Expires, and Vary:

The server generating a 304 response MUST generate any of the
following header fields that would have been sent in a 200 (OK)
response to the same request: Cache-Control, Content-Location, Date,
ETag, Expires, and Vary.

Since the goal of a 304 response is to minimize information transfer
when the recipient already has one or more cached representations, a
sender SHOULD NOT generate representation metadata other than the
above listed fields unless said metadata exists for the purpose of
guiding cache updates (e.g., Last-Modified might be useful if the
response does not have an ETag field).

Also note that Cache-Control must be included, and surely there's something fishy if we previously sent Cache-Control: public (this is a conditional request from Varnish, so there's a corresponding object in cache), and we're saying that the object hasn't changed but Cache-Control is now private -- which we should be saying if a session cookie is sent.

So yeah, while I do believe that there's no direct functional issue caused by MW emitting cookies on 304s, it does smell strange. :-)

I think especially if you start considering how Vary: Cookie works in all the above (both for MW on the related 200 and 304 outputs, and in the caches and our VCL), it's quite murky to me whether all of this works sanely in this case. For a given URI, I think we can assume (or at least hope) that Vary: Cookie would be consistently either emitted or not-emitted with all outputs for a given URI (even 304s). But whether we're tracing a V:C or non-V:C case probably changes how all the above plays out with the bgfetch as well due to vary-slotting, if the original was supposedly cacheable and the followup response has a Set-Cookie (which is hopefully uncacheable).

(and to throw another dimension into the matrix of possibilities above - also whether the client is sending a session cookie to Vary on in either or both requests)