Page MenuHomePhabricator

Image requests sending neither "Last-Modified" nor "ETag" HTTP headers.
Open, MediumPublic

Description

How to reproduce
I don't know because it looks like that that server replies to some static image by chance, maybe this issue has already been fixed because I don't see it anymore in the following page; here it is how I found it:

Expected results
Web server should have replied with a response including HTTP headers "Last-Modified" and "ETag", maybe with an header, i.e. "Cache-control: private, max-age=10, must-revalidate" to force a frequent revalidation in case that image was part of images that are changed frequently keeping the same image name.

Now an "ATS/x.x.x" web server replies to that image and HTTP headers are OK.
Almost surely there is a "load balancer" that selects web servers to which route HTTP requests depending on web traffic load, so I don't know whether that problem has been solved or simply it is still there ready to happen again.

Note, this is a very small missed optimization but as it has been noted and as it should be easy to fix it, then it is reported.

Event Timeline

Aklapper changed the task status from Open to Stalled.Nov 11 2021, 2:49 PM

Hi @Shushugah, thanks for taking the time to report this! I'm not sure I can follow in all aspects, plus it's not clear to me which parts are assumptions and which parts are based on specific investigation (and which steps were performed for investigation). Could you please check https://www.mediawiki.org/wiki/How_to_report_a_bug and provide, in separate sections:

  • a clear and complete list of exact steps to reproduce the situation, step by step, so that nobody needs to guess or interpret how you performed each step,
  • what happens after performing these steps to reproduce (and how you measured and/or found out when it comes to performance),
  • what you expected to happen instead,
  • a full link to a web address where the issue can be seen (a specific test case so others can try to reproduce),
  • the web browser(s) and web browser version(s) that you tested (as some web browsers have more aggressive client caching than others).

Ideally, a good description should allow any other person to follow these steps (without having to interpret steps) and see the same results. Problems that others can reproduce can get investigated faster. Thanks again.

Agreed! I pasted the text from User:Ade56facc and encouraged them to make an account/continue here.

Pppery updated the task description. (Show Details)
Pppery updated the task description. (Show Details)

I've copied over some changes made at VPT and converted the task description to use proper Phabricator markup.

NOTE: this is only a ticket about optimization of web policies of wikipedia web site, it's not about a real bug.

A better title for this ticket would have been "Improving web caching of wiki articles"; anyway, while fastly investigating the matter (for reasons explained in the first description), I found a few issues that now I am going to fully expose here.

If you prefer to open separate tickets for each one of them, then just let me know here (with a comment).

HW / SW used
OS: Windows 10 64 bit, 6 GB RAM, etc.
Browsers: Chrome 95 + Firefox 94.0.1 (the only applications in execution during the tests)

Aim and description of the investigation
I tried to find out why many wiki articles / pages (specially big / long pages), style sheets, a few images, etc. often looked like to not be cached (in browser cache) even if they had just been perused a few seconds before (while I was not logged in).

To find out a few details about when and where it happened I decided to look at HTTP headers, exchanged between one or more browsers and web servers of wikipedia, in order to understand how and maybe why that happened.

To do that I decided to use only a few tools incorporated in browsers, anyway after a while I used only those of Chrome because they showed to be more reliable than those available in Firefox.

Setup of Chrome to inspect HTTP headers

  1. open Chrome and cleanup all its "browsing data" (to do only once before tests):
    • maximize Chrome window (uppermost, secondth icon from right to left);
    • click on rightmost menu button (3 vertical dots just under the X of close application) with tooltip "Customize and control Google Chrome";
    • a menu is opened, click on "Settings" item;
    • TAB Settings, click on left item "Privacy and security" and then on "Clear of browsing data" (in the center of window);
    • check at least the checkboxes: "Browsing history", "Cookies and other site data", "Cached images and files" and click on button "Clear data";
    • close TAB Settings;
  2. Open view "Inspector" and go to wikipedia website
    • click on page opened by Chrome at startup, press right mouse button and select menu item "Inspect";
    • on the right part of window a frame is opened and enlarge inspector frame (HINT: drag the vertical line, that separates "inspector" view from web page view, in order to use at least 40% .. 50% of screen width);
    • click on View "Network" of Inspector;
    • go to wikipedia web site using the usual way;
    • on the right, all URLs requested by browser to web servers, should be shown;
    • click on one of them to see on the right, sub-view "HTTP headers" in response and request message.

Of course there is some funny bug in what is shown, Chrome does not always show an HTTP status of 304 (Not modified) but mostly 200 OK; to find out whether a request has received a 304 response from web server you must look at HTTP request headers sent by browser (if there is an "If-modified-since" and/or "If-none-match" and the response has no body data then that response is a 304 and not a 200 OK with full data).

In next comments I'll describe the small issues I found.

Issue

1. Server "Thumbor*", used for static images, sending neither "Last-Modified" nor "ETag" HTTP headers.

NOTE: this is a very small missed optimization but as it has been noted and as it should be easy to fix it, then it is reported.

How to reproduce
I don't know because it looks like that that server replies to some static image by chance, maybe this issue has already been fixed because I don't see it anymore in the following page; here it is how I found it:

  • I perused main page of wikipedia while I was not logged in.
  • I opened article ("F. W. de Klerk");
  • I waited for the complete retrieval of all its objects (images, etc. had completed);
  • after 3 .. 4 seconds I reloaded the page by pressing F5 (this forces the browser to request the page and all its objects by using HTTP conditional requests whenever applicable; you should see "if-modified-since" and "if-none-match" in HTTP headers of requests);
  • web server replied with a couple of HTTP 304 statuses (not modified) while almost all other objects were reloaded from browser cache (time near to 0 ms and "disk or memory cache" in size column);
  • there was only one image that was reloaded completely (resent by server Thumbor*) and it was: "https://upload.wikimedia.org/wikipedia/commons/thumb/6/64/Secretary_Clinton_Meets_With_Former_South_African_President_F.W._de_Klerk.jpg/220px-Secretary_Clinton_Meets_With_Former_South_African_President_F.W._de_Klerk.jpg";
  • I inspected HTTP headers and saw that there were no "Last-Modified" or "ETag" headers (required to be able to cache a web object) in server response and so I retried the operation and the result was the same.

Expected results
Web server (it does not matter its name) should have replied with a response including HTTP headers "Last-Modified" and "ETag", maybe with an header, i.e. "Cache-control: private, max-age=10, must-revalidate" to force a frequent revalidation in case that image was part of images that are changed frequently keeping the same image name.

Now an "ATS/x.x.x" web server replies to that image and HTTP headers are OK.
Almost surely there is a "load balancer" that selects web servers to which route HTTP requests depending on web traffic load, so I don't know whether that problem has been solved or simply it is still there ready to happen again.

Issue

2. Cookie value sent in HTTP requests changes every minute and this prevents caching of wiki articles for more than 1 .. 60 seconds.

NOTE: as already reported in first description, this issue regards mainly users who are not logged in because for those logged in it is historically considered safer to completely avoid web caching (maybe because of heavy editing, etc.).

Explanations
In HTTP response messages sent by web servers for article pages, there is an header "Vary" that indicates that page can be cached till all of the named headers (encoding, cookie, authorization) do not change their values and these conditions add up to the condition of "Last-Modified" date-time.

So, if a cookie value changes then the copy of all article pages (of wikipedia) kept in the cache of the browser must be invalidated.

This can happen after 1 .. 60 seconds since they have been perused, it depends on when the above mentioned cookie value changes.

The effects of this behavior are:

  • browsers have to retrieve too frequently wiki article pages that have been just perused, even if those pages have not been changed;
  • web traffic on website, web applications, etc., is unnecessary increased, something that contributes to slow down website responses under heavy traffic;
  • slow down navigation for users using slow Internet links, specially when visiting long / big articles.

How to reproduce

  • verify that, as a wiki user, you are logged out from wikipedia website;
  • browse a wiki article page;
  • look at HTTP request header "cookie" sent by browser for the article page;
  • take notice of *TickCount=number-of-minute value;
  • browse another wiki page and wait for its complete retrieval;
  • browse the first wiki article again and notice that if "cookie" value has not changed in request headers, then article page should have been requested conditionally because there should be also an "If-modified-since" in request headers;
  • wait 1 minute;
  • browse another wiki article and wait a few (3..4) seconds;
  • browse the first wiki article again and notice that in request message, cookie value has changed and that now there is no "if-modified-since" header (this means that web server has certainly resent the whole page).

Actual results
In last above step, browser has surely invalidated its cached copy of the wiki page and has not sent "if-modified-since" header in request message for the wiki page, thus web server has resent the whole wiki page (up to 200kB .. 300kB of compressed HTML code which can result in 1MB .. 1.5MB of HTML code, etc.).

Expected results
Cookie value should not change every minute in order to let browsers to cache wiki article pages for at least a few hours or until user session does not change (which surely happens if browser is closed and reopened).

Effects about longer client cache periods of time for wiki article pages
Assuming that "cookie" value does not change every minute, what could happen?

  1. modern browsers have two kind of caches: memory cache (for small objects) and disk cache for medium sized / big objects; keeping in browser cache wiki pages for more than 1 minute might lead to fill browser cache with objects (pages) that are reused rarely (it depends on the navigation pattern of the user, etc.);
  2. above consideration should not be a problem at all for PCs and other devices (smartphones, etc.) as modern browsers use sophisticated algorithms and besides this they use all possible tricks including multiple processes, threads, etc. to avoid almost all possible latencies when dealing with their tasks (including storing objects to and purging objects from disk cache, etc.).

When cookie *TickCount=value is currently updated

  • cookie value starts from 1 when user browses a wikipedia page for the first time;
  • its value is incremented by one unit every minute when at least one (TAB/window) wikipedia page is kept open;
  • if all wikipedia pages have been closed but browser is kept open then increment is stopped;
  • if user reopens a wikipedia page, the old freezed value is sent and then its value is updated with the difference between current time and the time of initial request of a wiki page;
  • if browser is closed then that value is lost and the next time user browses a wikipedia page it restarts from 1.

In practice that value is the time (in minutes) elapsed since a user session started; its only feature is that when a wikipedia page is browsed again after a while that all wikipedia browser TABs / windows have been closed, it resends the last known value (before updating it) allowing to understand that user has not shown a wikipedia page for 1 .. n minutes.

Is this so important to justify the change of cookie value every minute, considering that a wikipedia page can be kept open in a TAB browser window for hours without being used / looked at by user, because it is just one TAB in tens used in browser, etc.?

Whether the answer is yes, no or maybe, see below ideas.

Ideas about how to avoid to change cookie value of *TickCount every minute

  1. move that cookie (*TickCount=value) to a new custom HTTP header, i.e. "XApp-cookie: *TickCount=value", then in web server or in web application retrieve that value and readd it to the list of cookies so that it can be processed as usual; this would work for both logged out and logged in users;
  2. use a longer update time, i.e. 5 minutes instead of 1 minute; this would not be ideal but it could be a significant improvement in cache period of time;
  3. avoid to send and to use that cookie *TickCount (elapsed session time) in HTTP requests for wiki articles; maybe that value could be sent via a separated XMLHttpRequest HTTP message sent every 1 .. 2 minutes (maybe when wiki page is still loading); that request message would use the same HTTP/2 TCP connection used for retrieving wiki page and related objects, so no further overhead due a new TCP connection should happen;
  4. compute duration of user session on server side after each client request for a wiki page thus giving up to find out wholes in session time; i.e. as the initial start time of user session is given by a web application then the duration of a user session could be computed by a web application too by subtracting current time from initial session time every time a request for a wiki article page is sent.

In previous comment last sentence should be:
"... by subtracting initial session time from current time ..."

Those are two separate bugs then. One issue in each. Improving description as requested.

Snaevar renamed this task from Improving article caching to Image requests sending neither "Last-Modified" nor "ETag" HTTP headers..Nov 12 2021, 7:00 PM
Snaevar updated the task description. (Show Details)

OK, I have seen again responses from server Thumbor without headers named in bug title.

I have reloaded web page a few times using key F5 in Chrome browser (which surely allows to keep all data in browser cache and to redo all HTTP requests) and all the times a couple of thumb images were reloaded completely (no cached 304 responses, etc.).

Almost surely you already have all details about the issue (maybe current Thumbor behavior could be even wanted to make some tests easier, who knows), anyway I report here HTTP/2 headers just to be sure to not have dreamed about them so that now bug report is complete.

Response headers
accept-ranges: bytes
access-control-allow-origin: *
access-control-expose-headers: Age, Date, Content-Length, Content-Range, X-Content-Duration, X-Cache
age: 38515
content-disposition: inline;filename*=UTF-8''Palace_Theater_%285720196103%29.jpg.webp
content-length: 11414
content-type: image/webp
date: Tue, 16 Nov 2021 12:01:04 GMT
nel: { "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}
permissions-policy: interest-cohort=()
report-to: { "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
server: Thumbor/6.3.2
server-timing: cache;desc="hit-front", host;desc="cp3051"
strict-transport-security: max-age=106384710; includeSubDomains; preload
timing-allow-origin: *
x-cache: cp3055 miss, cp3051 hit/36030
x-cache-status: hit-front
x-client-ip: 87.15.234.3
xkey: File:Palace_Theater_(5720196103).jpg
:authority: upload.wikimedia.org
:method: GET
:path: /wikipedia/commons/thumb/f/fb/Palace_Theater_%285720196103%29.jpg/171px-Palace_Theater_%285720196103%29.jpg
:scheme: https
accept: image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
referer: https://en.wikipedia.org/
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
sec-fetch-dest: image
sec-fetch-mode: no-cors
sec-fetch-site: cross-site
user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64)

Request headers
:authority: upload.wikimedia.org
:method: GET
:path: /wikipedia/commons/thumb/f/fb/Palace_Theater_%285720196103%29.jpg/171px-Palace_Theater_%285720196103%29.jpg
:scheme: https
accept: image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
referer: https://en.wikipedia.org/
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
sec-fetch-dest: image
sec-fetch-mode: no-cors
sec-fetch-site: cross-site
user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64)

Title of ticket is wrong, it should be: Image responses without "Last-Modified" and "ETag" HTTP headers.

Learning only now about this bug thank to @Krinkle. From MWoffliner perspective this has a serious impact on the way how it works. The reason: without the ETAG, we have no way to compare it to our media cache, therefore each media file will be downloaded again from the Wikimedia backend (in place of the MWoffliner S3 cache). For us, the performance impact is probably not that big (we download everything from Wikimedia backend instead of balancing things between the S3 cache and Wikimedia. But for the Wikimedia backend, this is not the same story.

Krinkle changed the task status from Stalled to Open.May 26 2022, 8:43 PM
Krinkle triaged this task as Medium priority.
Krinkle moved this task from Limbo to Perf recommendation on the Performance-Team (Radar) board.
Krinkle updated the task description. (Show Details)

I'm not sure since when, but based on us having <14 days ats-be storage, and based on there still beeing ETag headers on cached responses, I am guessing this is a pretty recent regression. […]

I think that is just the difference between the file being generated & served by Thumbor (which doesn't seem to send ETag or LM headers), or by Swift from its object storage (which sends the non-RFC compliant ETag, and LM), note the Server header:

curl -I https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/Flag_of_South_Carolina.svg/23px-Flag_of_South_Carolina.svg.png
HTTP/2 200 
etag: 77423ce1b2c2a709de0db56d618d942c
server: ATS/8.0.8
last-modified: Wed, 20 Apr 2022 23:01:29 GMT
age: 6743
x-cache: cp3065 hit, cp3061 hit/190
[snip]

Then, after purging the file page on Commons:

curl -I https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/Flag_of_South_Carolina.svg/23px-Flag_of_South_Carolina.svg.png
HTTP/2 200 
server: Thumbor/6.3.2
age: 3
x-cache: cp3065 hit, cp3061 hit/1
[snip]

I'm moving this subthread here as it is about E-Tag presence and not about etag value formatting.

What appear to happen is that when a thumbnail is first generated on a Varnish/ATS/Swift miss, e.g. after upload, re-upload, or purge, the request goes to Thumbor which responds without an E-Tag header. However, this is not self-correcting on the next request, because the Thumbor response is CDN-cacheable. So even on repeats, it remains without E-Tag. It's not it expires e.g. a day or week later, and that it gets backfilled to the CDN from Swift that time that the E-Tag appears.