Page MenuHomePhabricator

Do readers use categories, or just editors?
Closed, ResolvedPublic

Description

The idea behind categories on Wikipedia articles is that readers and editors will be able to use category pages to find encyclopedia articles that interest them. Every time we talk about a category-related project, I wonder: Do normal, non-editing readers actually use those to find Wikipedia articles?

My current feeling is that they (mostly) don't. They're not visible on mobile, but I've never seen any complaints about them being inaccessible there; if they were widely used, then someone would have complained about the loss of functionality.

It should be possible to use page views for the content categories (excluding hidden/maintenance categories), and comparing expected vs actual page views for logged-in and logged-out users to come up with an approximate answer to my question.

Event Timeline

Here is a quick, partial answer for enwiki:

There is some truth to the hypothesis that category pages are more popular among logged-in users than article pages, in that the percentage of logged-in views is several times higher for the former. On the other hand, the vast majority of views to category pages still comes from anons.

Concretely, category pages have about 3% logged-in views compared to 0.8% for mainspace pages, but on the other hand that's still way lower than say for user talk pages (28% logged-in views). I generated the percentages for all namespaces on enwiki below while I was at it. See https://en.wikipedia.org/wiki/Wikipedia:Namespace for the numerical namespace IDs (category pages have ID 14).

Another thing to keep in mind is that category pages still receive vastly less pageviews than articles (less than 1/100th, in the data below).

There are various directions one could explore from here:

  1. For large categories that are paginated, it's possible to determine how often the subsequent pages are accessed, as a check how much users are interested in the full content of such a large category, as opposed to just randomly clicking the category links at the bottom of articles. (E.g. the "next page" link on https://en.wikipedia.org/wiki/Category:2018_singles currently leads to https://en.wikipedia.org/w/index.php?title=Category:2018_singles&pagefrom=Disillusioned#mw-pages ).
  2. It is possible in principle to exclude maintenance categories, as you suggested in the task, but that would require much more work than the quick query here.
  3. Another way to answer the question from the task whether users use categories to find Wikipedia articles is to look at the number of article pageviews with a category page as referrer.

As mentioned earlier, I unfortunately don't have a lot extra bandwidth right now and thus won't be to tackle much of these anytime soon, although I might be able to run a query for 3. later this month, after completing a similar request for portals.

namespace_idloggedin_percentageall_views
NULL0.017857663
-16.2323937764
00.831684955354
18.362694181
217.171973589
328.49882943
412.943500701
518.06251068
61.789854318
74.5514563
813.1516071
920.123818
109.281154212
111.31711601
124.4775682
1323.024987
142.910562814
153.4161503
1002.741137614
1016.896820
1081.6732139
1096.321455
11861.8112365
11969.923132
7104.851526
71143.7516
82819.9824956
82940.29968
230040.05
23010.02

(Data for November 24-30, 2018, known bots excluded. NULL values comes from pageviews that didn't record the namespace, which IIRC encompasses the mobile apps. Otherwise the above data doesn't distinguish between mobile and desktop.)

Data via

SELECT namespace_id, -- cf. https://en.wikipedia.org/wiki/Wikipedia:Namespace 
ROUND( 100 * SUM(IF(x_analytics_map['loggedIn'] IS NOT NULL,1,0)) / SUM(1), 2) AS loggedin_percentage,
SUM(1) AS all_views
FROM wmf.webrequest
WHERE year = 2018 AND month = 11 AND day >= 24
  AND is_pageview
  AND pageview_info['project'] = 'en.wikipedia'
  AND agent_type = 'user'
GROUP BY namespace_id
ORDER BY namespace_id LIMIT 10000;

Cool! I'll close this for now; might reopen it in case I get to look at 3. above later.

Noting for whomever may come across this: I suspect the above findings differ to some degree from present day. Since late 2020, the pageviews pipeline more reliably filters out bot traffic (ref). These bots crawl the wikis clicking on every link, so it seems highly probable the all_views figure above was partly automated traffic.

I tried running it myself on stat1007 with the following, and other variations of this query, and it always errored out with FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Java heap space. I tried giving it more RAM, etc., but no luck yet.

SELECT ROUND( 100 * SUM(IF(x_analytics_map['loggedIn'] IS NOT NULL,1,0)) / SUM(1), 2) AS loggedin_percentage,
  SUM(1) AS all_views
FROM wmf.webrequest
WHERE year = 2023 AND month = 6 AND day >= 15
  AND is_pageview
  AND pageview_info['project'] = 'en.wikipedia'
  AND agent_type = 'user'
  AND namespace_id = 14
LIMIT 10000;