Page MenuHomePhabricator

[Task] figure out the ratio of page views by logged-in vs. logged-out users
Closed, ResolvedPublic

Description

We need to figure out the ratio of logged-in vs- logged-out pageviews on Wikidata in order to figure out if we can bypass the webcache for certain language-dependent features.

Event Timeline

Addshore moved this task from ToDo to Doing on the WMDE-Analytics-Engineering board.
Addshore set Security to None.
SELECT
    cache_status,
    count(*) as count
FROM
    wmf.webrequest
WHERE
    year = 2016
    AND month = 01
    AND day = 31
    AND uri_host = "www.wikidata.org"
    AND http_method = "GET"
GROUP BY cache_status
ORDER BY count
LIMIT 999999;

31st January 2015

cache_status    count
hit     1001337
pass    1175072
-       1627482
miss    32827343
Time taken: 310.862 seconds, Fetched: 4 row(s)

vcl_hit = if the requested document was found in the cache
vcl_miss = if the requested document was not found in the cache
vcl_pass = the request is passed on to the backend (nocache)

Not sure what the - cache_status means at this stage

@Addshore: So, vcl_pass would cover anything that bypasses caches? That includes (most) API requests, (most) special pages (including Special:Search), as well as logged in users viewing regular pages.

...can you limit the query to the item namespace? That should provide us with the numbers we need.

...can you limit the query to the item namespace? That should provide us with the numbers we need.

So some more numbers for the same period:

NS 0 & USER ONLY

cache_status    count
pass    6445
hit     14140
miss    234885
Time taken: 528.378 seconds, Fetched: 3 row(s)

NS 0 ALL USERS & SPIRDERS

cache_status    count
pass    7405
hit     304905
miss    1996823
Time taken: 556.433 seconds, Fetched: 3 row(s)

We decided in IRC that this still doesn't help us determine logged in vs logged out users.
This is due to the fact that the first time a logged in user views an item page it is a MISS, and after that it is a PASS
Thus its all just a bit to mixed together....

Trying something else now and we should have some rough numbers tomorrow and better numbers at the end of the week.

Change 267869 had a related patch set uploaded (by Addshore):
Add loggedIn X-Analytics Header

https://gerrit.wikimedia.org/r/267869

So with a small hack we have some data since midnight.
It can be found at https://docs.google.com/spreadsheets/d/1hXwFiNFhaMqIF4p3myiVlsQpbiyDvyX_GGHLgPc_I9o/edit?usp=sharing

Also see the patch above for a possible better solution here!

FYI - "cache_status" is not an accurate reflection of anything. I'm not sure why we really even log it for analytics. The problem is that it only reflects some varnish state about the first of up to 3 layers of caching, and even then it does so poorly.

@BBlack - so you think cache_status is not even close to accurate? Do we have other accurate measurements of it so we could compare to what extent it's misleading? I'm happy to remove it from the data if it's really bad.

FYI - "cache_status" is not an accurate reflection of anything. I'm not sure why we really even log it for analytics. The problem is that it only reflects some varnish state about the first of up to 3 layers of caching, and even then it does so poorly.

Yep, we established that.

So with all page loads, the first load of a page seems to be a cache miss, and after that loads of the same page will be passes.
For Anons they will either get a miss or a hit.
Which means passes are only logged in users, hits are only anons, but misses are a mixture of the two.

So the numbers for logged in users in the spreadsheet linked are generated from:
https://www.wikidata.org/w/index.php?title=MediaWiki:Group-user.js&oldid=298792111
Which is then counted in the web request logs (yes we know this is terrible)

The difference between the logged in page views and the hits, missed and passes can be seen in the spreadsheet

What's the status of this now? Do we have the data we want?

Running the query to pull the data out for the past 7 days now!

@Addshore nice, thanks!

So, the conclusion is: we have less than 1% logged in page views. Wow, that's a lot lower than I expected!

Closing this as resolved.

@Addshore nice, thanks!

So, the conclusion is: we have less than 1% logged in page views. Wow, that's a lot lower than I expected!

Closing this as resolved.

Yep, this includes all types of user agent!

If we think we are done here I'll remove the super secret tracker!

Change 267869 merged by jenkins-bot:
Add loggedIn X-Analytics Header

https://gerrit.wikimedia.org/r/267869

@BBlack - so you think cache_status is not even close to accurate? Do we have other accurate measurements of it so we could compare to what extent it's misleading? I'm happy to remove it from the data if it's really bad.

On this topic, it is pretty misleading, and we do have other stats we look at more-manually to compare. We don't have a good singular, simple replacement for cache_status to include in analytics yet, though. What we do have (that we've looked at manually in some cases lately) is the X-Cache response header. That header has evolved a bit in how it's generated over the past couple of months so that it's less-misleading than it was before, but it still requires various regex operations to bin responses according to what exactly one is trying to measure. But for an example, this pseudo-code would be an accurate way to put all requests into 3 distinct non-overlapping bins based on X-Cache regex:

if (X-Cache ~ / hit/) {
    print "This is a real cache object hit";
}
else if (X-Cache ~ / int/) {
    print "This response was generated internally by varnish (e.g. 301 redirect for HTTPS, desktop->mobile redirect on UA detect, some kinds of error response, etc)";
}
else {
    print "This is a cache miss or a cache pass (pass would be due to uncacheable content, which is more-often true for loggedin users than others, but exists in both cases in notable numbers)";
}

However, I think X-Cache's raw data is still open to further modification. Ideally we'll build on top of this and start emitting some standard, simple header that can be one of N simple strings and reflects overall cache status bins (and hopefully with better detail as to miss-vs-pass and the nature of the pass to some degree).