Page MenuHomePhabricator

Cookie value sent in HTTP requests changes too frequently
Closed, ResolvedPublic

Description

How to reproduce

  • verify that, you are logged out from wikipedia website;
  • browse a wiki article page;
  • look at HTTP request header "cookie" sent by browser for the article page;
  • take notice of *TickCount=number-of-minute value;
  • browse another wiki page and wait for its complete retrieval;
  • browse the first wiki article again and notice that if "cookie" value has not changed in request headers, then article page should have been requested conditionally because there should be also an "If-modified-since" in request headers;
  • wait 1 minute;
  • browse another wiki article and wait a few (3-4) seconds;
  • browse the first wiki article again and notice that in request message, cookie value has changed and that now there is no "if-modified-since" header (this means that web server has certainly resent the whole page).

Actual results
In last above step, browser has surely invalidated its cached copy of the wiki page and has not sent "if-modified-since" header in request message for the wiki page, thus web server has resent the whole wiki page (up to 200kB .. 300kB of compressed HTML code which can result in 1MB .. 1.5MB of HTML code, etc.).

Expected results
Cookie value should not change every minute in order to let browsers to cache wiki article pages for at least a few hours or until user session does not change (which surely happens if browser is closed and reopened).

Ideas about how to avoid to change cookie value of *TickCount every minute

  1. move that cookie (*TickCount=value) to a new custom HTTP header, i.e. "XApp-cookie: *TickCount=value", then in web server or in web application retrieve that value and readd it to the list of cookies so that it can be processed as usual; this would work for both logged out and logged in users;
  2. use a longer update time, i.e. 5 minutes instead of 1 minute; this would not be ideal but it could be a significant improvement in cache period of time;
  3. avoid to send and to use that cookie *TickCount (elapsed session time) in HTTP requests for wiki articles; maybe that value could be sent via a separated XMLHttpRequest HTTP message sent every 1 .. 2 minutes (maybe when wiki page is still loading); that request message would use the same HTTP/2 TCP connection used for retrieving wiki page and related objects, so no further overhead due a new TCP connection should happen;
  4. compute duration of user session on server side after each client request for a wiki page thus giving up to find out wholes in session time; i.e. as the initial start time of user session is given by a web application then the duration of a user session could be computed by a web application too by subtracting current time from initial session time every time a request for a wiki article page is sent.

Sent on behalf of Ade56facc, split from T295556

Event Timeline

Krinkle triaged this task as Medium priority.Jan 10 2022, 8:04 PM

I've re-confirmed the bug and its nature of being a missed browser cache oppertunity specific to local repeat views of the same article. It is also specific to only the HTML resource. It does not affect CDN caching (we still get a hit there), and it does not affect browser caching of non-HTML resources such as images and static assets from load.php. This is because those do not (and must not) carry a Vary: Cookie response header like page views do, and thus the browser isn't preventing itself from even trying to get an HTTP 304 response.

I've also confirmed that the server-side is working correctly. That is, both our CDN layer and our backend appservers still respond correctly with HTTP 304 Not Modified to requests that carry a qualifying if-modified-since header.

For example:

curl 'https://en.wikipedia.org/wiki/Theropoda' --compressed -I

HTTP/2 200 
x-cache-status: hit-front
age: 40567
server: mw1452.eqiad.wmnet
last-modified: Wed, 23 Feb 2022 01:55:19 GMT
...

curl -I 'https://en.wikipedia.org/wiki/Theropoda' -H 'if-modified-since:  Wed, 23 Feb 2022 01:55:19 GMT'

HTTP/2 304
x-cache-status: hit-front
...

And behind the CDN, directly via mwdebug and with the usual HTTP cookies (not the session tick JS cookies)

curl -I 'https://en.wikipedia.org/wiki/Theropoda' -H 'if-modified-since:  Wed, 23 Feb 2022 01:55:19 GMT' -H 'X-Wikimedia-Debug: mwdebug1002.eqiad.wmnet' -H 'Cookie: WMF-Last-Access-Global=23-Feb-2022; WMF-Last-Access=23-Feb-2022; GeoIP=GB:ENG:---'

HTTP/2 304
...

This took me a while to figure out since Chrome's devtools appear to have become rather useless. Its GUI states a response to be HTTP 200 even if it was actually HTTP 304. It's only from the response size that I can what truly happened. Anyway, hence I switched to cURL above, which does display this correctly.

Screenshot 2022-02-23 at 14.02.55.png (245×847 px, 39 KB)
Screenshot 2022-02-23 at 14.03.30.png (320×2 px, 62 KB)

In addition, I've locally patched modules/ext.wikimediaEvents/sessionTick.js on mwdebug1002 and set enabled = false to further confirm end-to-end that with that change in place, Firefox and Chrome are able to make the right requests again, and get the correct response that instructs them to use the local browser cache.

I'll explore whether this can be switched to use localStorage instead, which seems like a more appropiate technology for this purpose.

I also notice that we currently have multiple event instrumention sessionId's being computed which is expensive, redundant, and probably leading to developer confusion since one of them is for browsing sessions unbound and the other is reset by WikimediaEvents ticks, though stored via EventLogging code:

cookie:

  # from mediawiki-core / mw.user.sessionId()
  enwikimwuser-sessionId=xxxxxxx;

  # from EventLogging / mw.eventLog, reset by WikimediaEvents/sessionTick.js
  enwikiel-sessionId=yyyyyyy;
Krinkle added subscribers: jlinehan, Krinkle.

@jlinehan Could you review the above and assess the possibility of storing the volatile tick time and tick count outside the cookie?

kzimmerman added subscribers: EChetty, kzimmerman.

@jlinehan is currently on leave. Tagging @EChetty and the Metrics Platform team.

@EChetty can you check with the team on how Session Length might be impacted?

mpopov subscribed.

@jlinehan Could you review the above and assess the possibility of storing the volatile tick time and tick count outside the cookie?

The sessionTick instrument is the only code that accesses both cookies. When the instrument sets the cookies, it does so with the default domain, which is '' everywhere (apart from on wikitech/labstestwikitech?). The cookies are effectively per-origin storage.

I found T248987#6290436, which explains why cookies were chosen originally (emphasis my own):

we can use local storage as a synchro point as it is shared across tabs and windows ... there is also the BroadcastChannel object but it does not have wide support.

I avoided using BroadcastChannel or any of the other stuff that is not widely supported.

It is possible to solve this problem using localStorage rather than cookies, with the key benefit being to use the storage event in order to do some primitive IPC, but we decided on cookies over localStorage due to compatibility, consistency with our past decisions on localStorage, and the fact that cookies allow us to easily implement the desired session TTL. I also don't know that we are using the storage event elsewhere, how solid support is, or whether it actually reduces the complexity of the instrument.

The cookies that the instrument sets are session cookies, which should be cleared when the browsing session ends. In practice, however, that isn't the case, e.g. Chrome, Firefox, and Safari (at least) all allow you to resume a previous browsing session complete with session cookies. Remembering that the instrument ticks once every minute, a user could end their browsing session and resume it at any point in that time window and, from the point of view of the instrument, be continuing their session. Switching to localStorage would make the above scenario more likely since it persists between browsing sessions but at least it would be consistent.

Jason's point about compatibility might be a good one. A quick query of the mediawiki_client_session_tick Hive table yielded ~1500 unique browser family-major version pairs that are submitting events. For those browsers that have good coverage on Can I use…, it's relatively trivial to assess support for localStorage. Assessing support for the rest will require involved investigation, e.g. I'm confident that the Naver Whale browser supports localStorage because it is based on Chromium and its first release was in 2017. My recommendation would be to update the instrument to include a flag that corresponds to no support for localStorage, wait for two weeks, determine what percentage of events have the flag set, and then decide how to proceed. If that's acceptable, I'm more than happy to submit the required patches.

Change 831842 had a related patch set uploaded (by Phuedx; author: Phuedx):

[mediawiki/extensions/WikimediaEvents@master] sessionTick: Log localStorage support

https://gerrit.wikimedia.org/r/831842

The cookies that the instrument sets are session cookies, which should be cleared when the browsing session ends. In practice, however, that isn't the case, e.g. Chrome, Firefox, and Safari (at least) all allow you to resume a previous browsing session complete with session cookies. Remembering that the instrument ticks once every minute, a user could end their browsing session and resume it at any point in that time window and, from the point of view of the instrument, be continuing their session. Switching to localStorage would make the above scenario more likely since it persists between browsing sessions but at least it would be consistent.

The instrument deems a session to be idle it has been 30 minutes since the last tick so when I said

and resume it at any point in that time window and, from the point of view of the instrument, be continuing their session

I meant that 30 minute time window.

cjming removed cjming as the assignee of this task.Oct 6 2022, 6:41 PM
cjming subscribed.

Change 831842 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] sessionTick: Log localStorage support

https://gerrit.wikimedia.org/r/831842

That patch ☝️ will be fully deployed next Thursday (2022/10/13. After that I'll spot check the data coming in and, if it looks good, wait for a couple of weeks to collect a decent sample.

That patch ☝️ will be fully deployed next Thursday (2022/10/13. After that I'll spot check the data coming in and, if it looks good, wait for a couple of weeks to collect a decent sample.

The patch was fully deployed on Thursday at ~18:00 UTC. Since then there have been 16 validation errors, which appear to have been caused by a misbehaving client. As for the data being captured:

nn_localstorage_unsupported
400698169732476
select
  sum(1) `n`,
  sum(if(test['supportsLocalStorage'] = 0, 1, 0)) `n_localStorage_unsupported`
from
  event.mediawiki_client_session_tick
where
  year = 2022
  and month = 10
  and day >= 14
;

We'd lose ~0.18% of all session ticks if we changed the instrument to use localStorage.

@phuedx As per the code review comments, I suspect that those are browsers with localStorage supported (as per Startup module feature-test passing) but either with storage full or in a private browsing mode where localStorage pretends to have a quota of 0B, either way writes throw.

Some browsers no-op in private mode, some effectively do tab/sessionStorage instead, others (Mobile Safari) throw in the same way as if there no quota. I haven't checked what Mobile Safari does to cookies in that case. It might do something similar to cookies as well. Thus depending on what Safari does and how our code responds, it might end up behaving the same in those 0.18% cases on localStorage.

Note that in browsers where this test passed, it may fail at another time if the storage is full. Assuming we use the mw.storage abstraction layer, this failure will be hidden both for cases where storage is full and cases like Safari Private browsing where it effectively always full. It may be worth making sure that the data does something reasonable if setItem fails. E.g. fallback to counting the session for that tab only with a local variable, or stop counting if/when a write fails. That may be worth handling regardless, in which case the cases that throw today for private browsing can fall in the same bucket.

Change 944986 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/WikimediaEvents@master] sessionTick: Switch to localStorage to avoid breaking browser cache

https://gerrit.wikimedia.org/r/944986

@mpopov: I've reviewed https://gerrit.wikimedia.org/r/944986 and its parent patch and they're both very close to being ready to be merged. If it is merged before next Tuesday (2023/08/08), then it will be fully deployed by EOD Thursday (2023/08/10). Is there any reason, from Product Analytics' perspective, that this shouldn't be merged? Also, are there any data quality metrics for the instrument that we can monitor during and after the deployment?

@Krinkle: Thank you for creating the patch to fix the problem!

@phuedx: Thank you for reviewing the patch and pinging me! It looks like the only change is the optimization and nothing is changing about the frequency of the ticks, which sounds good to me.

We don't have any quality metrics for this, just this dashboard that we can monitor to see how the numbers change (if at all) after the deployment.

As far as I know nobody actually looks at that data so it should be okay to just deploy to all wikis (rather than in stages). All the stakeholders who requested this dataset have left the Foundation and the limitations of the dataset (no breakdown by mobile vs desktop or by logged-in vs logged-out) limits its usefulness. It's OK as a simple proof of concept but in practice has been rather useless. There was a hope that once the instrument is migrated to the Metrics Platform it will be easy to request standard dimensions to enable more granular, richer insights into session lengths and answer (currently unanswerable) questions like

  • How do session lengths differ between editors and readers?
  • Differences in skins? Do Vector 2022 users stay longer?
  • Differences in platform? How long do our mobile users visit?
  • Are there topics (as determined by collecting article ID & article's Wikidata item ID and obtaining the article's topic that way) that keep people on our sites longer?

At some point there was also a plan for feature engineers to be able to flag to the session tick instrument that the user was interacting with a feature (or a variant of a feature in an A/B test) so that we can see how engagement with different features (or variants of features) impacts the users' visits. I don't know that we'll ever get there, but enabling a variety of dimensions via the Metrics Platform would be a major upgrade to the usefulness of the instrument/data.

Pinging @mforns for additional visibility as another representative of the technical owners of the session length dataset (that is computed from session ticks). (I think Sam & Marcel are on the same team now?)

@mpopov During the deployment of the above patch, the then-active session windows will effectively be reset a bit earlier than normal. Normally a session is reset at the 30-minute mark, so there will be brief artificial increase in session count I expect.

Would that be alright, or would you recommend that we add some additional code to migrate existing sessions and then remove that code next week? The added code would increase risk slightly in that it's hard to know for sure it won't have unintended side-effects. The existing logic is well known and tested.

During the deployment of the above patch, the then-active session windows will effectively be reset a bit earlier than normal. Normally a session is reset at the 30-minute mark, so there will be brief artificial increase in session count I expect.

@Krinkle: Thank you for the detailed explanation of possible impact! That sounds good – feel free to proceed with the existing logic.

I would just like to know when the deployment happens so we can document it on Wikitech and in the Data Catalog and annotate the dashboard in Superset, especially if there is an obvious/visible anomaly in the data on that day.

Change 944986 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] sessionTick: Switch to localStorage to avoid breaking browser cache

https://gerrit.wikimedia.org/r/944986

@mpopov: @Krinkle's patch has been merged. It will be deployed as part of this week's Deployment Train. It should be fully deployed by 20:00 UTC on 2023/08/17.

When the patch was deployed:

  1. There weren't any new validation errors for events on the mediawiki.client.session_tick stream (see https://logstash.wikimedia.org/goto/f3dc2e5ecdbe8728675e3ae306cdb13b); and
  2. There was a slight dip in the event submission rate but it quickly recovered. The dip is visible here and the time of the dip matches the time of the deployment (see T343724#9100114 and/or https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2023-08-17)

Agree with @mpopov that this dataset is limited so far.
We definitely could improve it by adding more dimensions to it.
I can't recall any impediment to (once additional data is collected) splitting by dimensions other than wiki or project family.
The only thing maybe is that this dataset gets sparse in the long tail quickly, so probably we would not be able to break it down by more than 1 dimension at a time.
That said, I think it would be a pity to switch it off, I still think it gives some value, and it is an example of how to collect data in a privacy-aware way.

Regarding the quality of the data after the deployment of the optimization,
I've looked at the session length dashboard (https://superset.wikimedia.org/superset/dashboard/232/)
and could not see any change in the timeline of the metric. It is pretty stable, so I think any issue would show up visibly.
The metric is daily, so we only have 4 data points after the optimization, but it looks good so far!