Page MenuHomePhabricator

Cookie value sent in HTTP requests changes too frequently
Open, MediumPublic

Description

How to reproduce

  • verify that, you are logged out from wikipedia website;
  • browse a wiki article page;
  • look at HTTP request header "cookie" sent by browser for the article page;
  • take notice of *TickCount=number-of-minute value;
  • browse another wiki page and wait for its complete retrieval;
  • browse the first wiki article again and notice that if "cookie" value has not changed in request headers, then article page should have been requested conditionally because there should be also an "If-modified-since" in request headers;
  • wait 1 minute;
  • browse another wiki article and wait a few (3-4) seconds;
  • browse the first wiki article again and notice that in request message, cookie value has changed and that now there is no "if-modified-since" header (this means that web server has certainly resent the whole page).

Actual results
In last above step, browser has surely invalidated its cached copy of the wiki page and has not sent "if-modified-since" header in request message for the wiki page, thus web server has resent the whole wiki page (up to 200kB .. 300kB of compressed HTML code which can result in 1MB .. 1.5MB of HTML code, etc.).

Expected results
Cookie value should not change every minute in order to let browsers to cache wiki article pages for at least a few hours or until user session does not change (which surely happens if browser is closed and reopened).

Ideas about how to avoid to change cookie value of *TickCount every minute

  1. move that cookie (*TickCount=value) to a new custom HTTP header, i.e. "XApp-cookie: *TickCount=value", then in web server or in web application retrieve that value and readd it to the list of cookies so that it can be processed as usual; this would work for both logged out and logged in users;
  2. use a longer update time, i.e. 5 minutes instead of 1 minute; this would not be ideal but it could be a significant improvement in cache period of time;
  3. avoid to send and to use that cookie *TickCount (elapsed session time) in HTTP requests for wiki articles; maybe that value could be sent via a separated XMLHttpRequest HTTP message sent every 1 .. 2 minutes (maybe when wiki page is still loading); that request message would use the same HTTP/2 TCP connection used for retrieving wiki page and related objects, so no further overhead due a new TCP connection should happen;
  4. compute duration of user session on server side after each client request for a wiki page thus giving up to find out wholes in session time; i.e. as the initial start time of user session is given by a web application then the duration of a user session could be computed by a web application too by subtracting current time from initial session time every time a request for a wiki article page is sent.

Sent on behalf of Ade56facc, split from T295556

Event Timeline

Krinkle triaged this task as Medium priority.Jan 10 2022, 8:04 PM

I've re-confirmed the bug and its nature of being a missed browser cache oppertunity specific to local repeat views of the same article. It is also specific to only the HTML resource. It does not affect CDN caching (we still get a hit there), and it does not affect browser caching of non-HTML resources such as images and static assets from load.php. This is because those do not (and must not) carry a Vary: Cookie response header like page views do, and thus the browser isn't preventing itself from even trying to get an HTTP 304 response.

I've also confirmed that the server-side is working correctly. That is, both our CDN layer and our backend appservers still respond correctly with HTTP 304 Not Modified to requests that carry a qualifying if-modified-since header.

For example:

curl 'https://en.wikipedia.org/wiki/Theropoda' --compressed -I

HTTP/2 200 
x-cache-status: hit-front
age: 40567
server: mw1452.eqiad.wmnet
last-modified: Wed, 23 Feb 2022 01:55:19 GMT
...

curl -I 'https://en.wikipedia.org/wiki/Theropoda' -H 'if-modified-since:  Wed, 23 Feb 2022 01:55:19 GMT'

HTTP/2 304
x-cache-status: hit-front
...

And behind the CDN, directly via mwdebug and with the usual HTTP cookies (not the session tick JS cookies)

curl -I 'https://en.wikipedia.org/wiki/Theropoda' -H 'if-modified-since:  Wed, 23 Feb 2022 01:55:19 GMT' -H 'X-Wikimedia-Debug: mwdebug1002.eqiad.wmnet' -H 'Cookie: WMF-Last-Access-Global=23-Feb-2022; WMF-Last-Access=23-Feb-2022; GeoIP=GB:ENG:---'

HTTP/2 304
...

This took me a while to figure out since Chrome's devtools appear to have become rather useless. Its GUI states a response to be HTTP 200 even if it was actually HTTP 304. It's only from the response size that I can what truly happened. Anyway, hence I switched to cURL above, which does display this correctly.

Screenshot 2022-02-23 at 14.02.55.png (245×847 px, 39 KB)
Screenshot 2022-02-23 at 14.03.30.png (320×2 px, 62 KB)

In addition, I've locally patched modules/ext.wikimediaEvents/sessionTick.js on mwdebug1002 and set enabled = false to further confirm end-to-end that with that change in place, Firefox and Chrome are able to make the right requests again, and get the correct response that instructs them to use the local browser cache.

I'll explore whether this can be switched to use localStorage instead, which seems like a more appropiate technology for this purpose.

I also notice that we currently have multiple event instrumention sessionId's being computed which is expensive, redundant, and probably leading to developer confusion since one of them is for browsing sessions unbound and the other is reset by WikimediaEvents ticks, though stored via EventLogging code:

cookie:

  # from mediawiki-core / mw.user.sessionId()
  enwikimwuser-sessionId=xxxxxxx;

  # from EventLogging / mw.eventLog, reset by WikimediaEvents/sessionTick.js
  enwikiel-sessionId=yyyyyyy;
Krinkle added subscribers: jlinehan, Krinkle.

@jlinehan Could you review the above and assess the possibility of storing the volatile tick time and tick count outside the cookie?

kzimmerman added subscribers: EChetty, kzimmerman.

@jlinehan is currently on leave. Tagging @EChetty and the Metrics Platform team.

@EChetty can you check with the team on how Session Length might be impacted?

mpopov added a subscriber: mpopov.

@jlinehan Could you review the above and assess the possibility of storing the volatile tick time and tick count outside the cookie?

The sessionTick instrument is the only code that accesses both cookies. When the instrument sets the cookies, it does so with the default domain, which is '' everywhere (apart from on wikitech/labstestwikitech?). The cookies are effectively per-origin storage.

I found T248987#6290436, which explains why cookies were chosen originally (emphasis my own):

we can use local storage as a synchro point as it is shared across tabs and windows ... there is also the BroadcastChannel object but it does not have wide support.

I avoided using BroadcastChannel or any of the other stuff that is not widely supported.

It is possible to solve this problem using localStorage rather than cookies, with the key benefit being to use the storage event in order to do some primitive IPC, but we decided on cookies over localStorage due to compatibility, consistency with our past decisions on localStorage, and the fact that cookies allow us to easily implement the desired session TTL. I also don't know that we are using the storage event elsewhere, how solid support is, or whether it actually reduces the complexity of the instrument.

The cookies that the instrument sets are session cookies, which should be cleared when the browsing session ends. In practice, however, that isn't the case, e.g. Chrome, Firefox, and Safari (at least) all allow you to resume a previous browsing session complete with session cookies. Remembering that the instrument ticks once every minute, a user could end their browsing session and resume it at any point in that time window and, from the point of view of the instrument, be continuing their session. Switching to localStorage would make the above scenario more likely since it persists between browsing sessions but at least it would be consistent.

Jason's point about compatibility might be a good one. A quick query of the mediawiki_client_session_tick Hive table yielded ~1500 unique browser family-major version pairs that are submitting events. For those browsers that have good coverage on Can I use…, it's relatively trivial to assess support for localStorage. Assessing support for the rest will require involved investigation, e.g. I'm confident that the Naver Whale browser supports localStorage because it is based on Chromium and its first release was in 2017. My recommendation would be to update the instrument to include a flag that corresponds to no support for localStorage, wait for two weeks, determine what percentage of events have the flag set, and then decide how to proceed. If that's acceptable, I'm more than happy to submit the required patches.

Change 831842 had a related patch set uploaded (by Phuedx; author: Phuedx):

[mediawiki/extensions/WikimediaEvents@master] sessionTick: Log localStorage support

https://gerrit.wikimedia.org/r/831842

The cookies that the instrument sets are session cookies, which should be cleared when the browsing session ends. In practice, however, that isn't the case, e.g. Chrome, Firefox, and Safari (at least) all allow you to resume a previous browsing session complete with session cookies. Remembering that the instrument ticks once every minute, a user could end their browsing session and resume it at any point in that time window and, from the point of view of the instrument, be continuing their session. Switching to localStorage would make the above scenario more likely since it persists between browsing sessions but at least it would be consistent.

The instrument deems a session to be idle it has been 30 minutes since the last tick so when I said

and resume it at any point in that time window and, from the point of view of the instrument, be continuing their session

I meant that 30 minute time window.

cjming removed cjming as the assignee of this task.Oct 6 2022, 6:41 PM
cjming added a subscriber: cjming.

Change 831842 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] sessionTick: Log localStorage support

https://gerrit.wikimedia.org/r/831842

That patch ☝️ will be fully deployed next Thursday (2022/10/13. After that I'll spot check the data coming in and, if it looks good, wait for a couple of weeks to collect a decent sample.

That patch ☝️ will be fully deployed next Thursday (2022/10/13. After that I'll spot check the data coming in and, if it looks good, wait for a couple of weeks to collect a decent sample.

The patch was fully deployed on Thursday at ~18:00 UTC. Since then there have been 16 validation errors, which appear to have been caused by a misbehaving client. As for the data being captured:

nn_localstorage_unsupported
400698169732476
select
  sum(1) `n`,
  sum(if(test['supportsLocalStorage'] = 0, 1, 0)) `n_localStorage_unsupported`
from
  event.mediawiki_client_session_tick
where
  year = 2022
  and month = 10
  and day >= 14
;

We'd lose ~0.18% of all session ticks if we changed the instrument to use localStorage.

@phuedx As per the code review comments, I suspect that those are browsers with localStorage supported (as per Startup module feature-test passing) but either with storage full or in a private browsing mode where localStorage pretends to have a quota of 0B, either way writes throw.

Some browsers no-op in private mode, some effectively do tab/sessionStorage instead, others (Mobile Safari) throw in the same way as if there no quota. I haven't checked what Mobile Safari does to cookies in that case. It might do something similar to cookies as well. Thus depending on what Safari does and how our code responds, it might end up behaving the same in those 0.18% cases on localStorage.

Note that in browsers where this test passed, it may fail at another time if the storage is full. Assuming we use the mw.storage abstraction layer, this failure will be hidden both for cases where storage is full and cases like Safari Private browsing where it effectively always full. It may be worth making sure that the data does something reasonable if setItem fails. E.g. fallback to counting the session for that tab only with a local variable, or stop counting if/when a write fails. That may be worth handling regardless, in which case the cases that throw today for private browsing can fall in the same bucket.