Copied over from enwp Village Pump (Technical)
In [[ https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(miscellaneous)#Very_big_pages | Wikipedia:Village pump (miscellaneous)#Very_big_pages ]] someone has arisen the problem of long delays (i.e. till 5..7 seconds) when visiting big wiki pages, specially the first time someone retrieves one of them after sometime (days or weeks?) nobody has visited it.
In practice the topic being discussed was about **how to lower the size of big / long wiki pages** because someone thought that most of the wait time was due to client browser //struggling// to retrieve and visually render such big pages (HTML code between 1MB and 1.6MB) and having a PC with CPUs not slow and enough RAM, he was worried about what could happen to smartphones and other mobile devices with much less HW resources.
Someone else replied that he did not notice such a long delays, at most 1..2 seconds, so I contributed to the discussion with my thoughts too and I found out that maybe something in **wiki pages retrieval mechanism** between web servers and browsers could be improved.
Here I do not want to discuss about the problem of long delays for the first time retrieval of big pages because it's a too technical thing that regards only wikipedia internal technical stuff; I have already written something about it in the above mentioned wiki section page.
Talking about the impact of slow response in user (web) interfaces, see also wiki article about **[[https://en.wikipedia.org/wiki/Responsiveness | responsiveness]]**, I have noticed that **web caching of wiki pages** might be **improved noticeably** in order to //decrease a lot of unnecessary web traffic load// that probably burdens web applications and DB (database) too.
**Caching assumptions**
If a user is **logged in**, it looks like that **wiki pages are never cached** by browser (this is right, at least when editing a page), instead if user is **logged out** (not logged) pages are temporarily **cached** (this makes a noticeable difference) and so if they have already been retrieved (recently, within 1 minute) and they have not been modified then they are rendered in less than 0.2..0.3 seconds on a medium speed (high speed, single CPU) PC.
Right know cacheability of a wiki page depends on:
* type of content encoding format (compressed or not);
* value of cookies;
* authorization;
because of "Vary" header, which is right.
Cookies contain also a session identifier that is dismissed when client browser is closed and so caching of wiki pages will always be session bound (it will last only till browser is kept open and only for users ''not logged in'').
***Current behavior / problems detected***
Current **caching of wiki pages** by browsers is **far from being perfect** because of the following technical issues:
* wikipedia **web servers** look like to be of different kinds, some send HTTP responses with a "Last-Modified" (but without "ETag") header, others with an "ETag" (but without "Last-Modified") header for the same wiki pages and so, depending on web traffic / load, etc. a browser can get responses one time from a web server and the other time from another one with a **different type of object / cache identifier** and of course in this case the entire wiki page has to be resent to browser even if it has not been changed, increasing the duration of overall operations by 10-20 times;
* a few wikipedia '**web servers used for static images** (i.e. upload.wikimedia.org /common/thumb/*) send neither "Last-Modified" nor "ETag" headers and so static images served by those web servers are always fully reloaded when a page using them is displayed (as those images are rarely used in wiki pages this is not a big deal but anyway it is very odd and I guess really not necessary);
* in some cases header "Date" is not updated by web servers, but I am not sure about this case, maybe it depends on behavior of some web cache;
* **caching** of wiki pages depends also on **cookies** and cookies sent by browser include a counter value (tick) that changes at least every minute, so after 1 .. 60 seconds having received a wiki page, its cookie changes and browser has to invalidate its cached copy; this means that if the same wiki page is requested again by browser after its cookie has changed then the entire wiki page has to be retrieved, instead of just asking the web server if it has changed or not; this may be an issue that increases a lot web traffic and load of web servers, web applications, DB, etc. behind them.
In web server responses, it looks like that **"Cache-control"** and **"Expires" headers already have proper settings**, so apart from the alternate / random usage of "Last-Modified" and "ETag" headers, HTTP cache settings are already in good shape.
**Goal to improve website response times of wiki pages**
The aim should be to force browsers to ask web servers if a wiki page has been changed or not every time it is going to be displayed by browser (so statistics about number of pages showed by users should not be affected by this improvement); of course just asking that is much faster than retrieving the whole page every time it is visited by a user, specially if wiki page is big (there are wiki pages that do not change for hours, days or even months).
**Proposals to reach goal of good cacheability of wiki pages**
These are a few proposals in order to improve cacheability of wiki pages.
//Web / application server(s) side//
1. Each wiki page should have a **unique identifier** for each one of its revisions made by editors:
* surely this identifier already exists but all wikipedia web servers should send it as an "ETag" value for each wiki page;
* I suspect that using only a Last-Modified value would suffice because it is hard to think that a wiki page can be changed more than one time per second without conflicts (this should be confirmed by someone that know how those things work).
2. Web servers should send both "Last-Modified" and "ETag" headers for each wiki page, as recommended by last RFCs about HTTP; if both are sent, then only "ETag" is used by modern browsers but at least they can show value of "Last-Modified" somewhere in the information window and besides this, should an old browser - not supporting "ETag" header - be used then it could work anyway by using "Last-Modified" header (which in 99.99% of cases would suffice).
//Client browser side//
3. Value of '''Cookies should not change every minute''' (at least for users ''not logged in''). The problem is to find out whether that tick value is mandatory or not to handle user session; if it is, then it should be investigated if it is mandatory for all kind of users (including those not logged in) or not and how to preserve its value without storing it in Cookies.
* first idea would be to remove that value from Cookies and to store it in a new custom HTTP header, sent by client browser, dedicated to cookie values that change frequently and that do not affect the cacheability of a wiki page, i.e.:
`xapp-cookie: *TickCount=3`
: in this case above HTTP header should travel in client browser request till its arrival to a web server, then its value should be readded to "Cookie" values by adding custom code in web server(s) (or even in some web app if all HTTP headers found in request would be passed to that app to handle the response) and then resulting Cookie values could be used as usual;
* second idea would be '''to not use that tick value in Cookies for some kind of users''' (the majority of visitors); if in near future, users who are '''not logged in''' won't be able to edit wiki pages, then also using a tick value for user's session might not be a requirement anymore and so it could be removed from Cookies sent by those user's browsers, thus avoiding current cache problem with Cookies changing every minute; instead for users who were '''logged in''' nothing would change because for those users caching of wiki pages is already disabled (slow but safe mode) and so their Cookie values could change even once per minute;
* other idea(s) to be specified.
**Conclusions**
Doing above 3 suggested modifications, could allow caching of wiki pages feasable for an entire user session (until browser is closed), thus decreasing a lot the size of HTTP responses for wiki pages and the load server side.
Something about **improving HTTP caching** could be done also for **scripts and stylesheets**; they are not always cached by browsers as they should and besides this fact, sometimes their retrieval is a bit slow; specially when logged in, you can see that wiki page is first visually rendered with standard browser fonts and then, after 0.3 .. 0.6 seconds, with proper text fonts because download of some style sheet is completed after wiki page has started to be shown in browser.