Page MenuHomePhabricator

End-of-year pageview statistics for 2023
Closed, ResolvedPublic

Description

As in past years, I'm requesting assistance in preparing our annual most-popular English Wikipedia articles of the year blog post. For context, you can see 2022's and 2021's posts.

We'd like to publish the list on 1 December, vs. mid-December in past years. To accomplish that, we'd like to pull the English Wikipedia's most-viewed articles on the following timeline by 14 October, again on 14 November, and finally on 28ish November (subject to y'all's feedback). We'd additionally request that you include views from all redirects (to catch article moves etc.), plus pull the mobile view percentages + null referrer percentage to help us weed out false positives.

If needed, we can winnow this request to two pulls—one soon and one in mid-November. After the new year, we'll update the post with data from Topviews so we have an accurate archival record without troubling y'all again.


Last year's task (for reference): T323527: End-of-year pageview statistics for 2022
Code/queries from T295943: [REQUEST] End-of-year pageview statistics for 2021: https://github.com/wikimedia-research/comms-popular-pages/blob/main/T295943-2021-page-views.ipynb

Details

Due Date
Oct 27 2023, 4:00 AM

Event Timeline

mpopov edited subscribers, added: mpopov; removed: SNowick_WMF.

Research & Decision Science management will triage this task (https://app.asana.com/0/1205562963911611/1205718561430901/f) and assign a team to it on Monday, October 16th.

Triaged to Movement insights. Based on conversation with @mpopov, assignee will consult @nettrom_WMF for any context on previous iterations of this data

OSefu-WMF triaged this task as Medium priority.Oct 18 2023, 4:52 PM
OSefu-WMF set Due Date to Oct 27 2023, 4:00 AM.

Hi,

The data was pulled using the 2023-09 snapshot on October 24th 2023:

Data

None of the rows have been removed for the moment, but can be on request for this iteration or on the next data pull on Nov 14.

@Hghani Can you produce an additional version with these articles removed to match last year's submission?

@OSefu-WMF I've added 2 additional sheets to the spreadsheet. One a longer list pulled today Oct-26 2023 to match the length of last year's submission; the other sheet contains the omitted rows from that longer spreadsheet. The original sheet also remains with the original top 100 unfiltered entries pulled on October 24 2023.

Data

@Hghani you are fantastic. Thanks so much for this data, and the null referrer data so that we can spot known false positives like people trying to navigate to YouTube and accidentally searching on Wikipedia instead. I spotted a few other false positives that can be removed:

  • Index (statistics) is a new false positive, but it has nearly 100% no referrers
  • XXXTentacion has been a false positive for years and years and years now, and I can only I wish I knew why
  • Sex is also a long-time false positive (with its 96% mobile views)
  • I assume The Idol (TV series) is as well with ~82% no referrers
  • there's a couple more farther down, but we'll likely only take the top 25 anyway

Finally, I wanted to ask for your professional opinion on the articles with, say, 30%+ no referral data. For example, the first-place article has over 45% even while most of the articles on the sheet are at 20% or below. For other examples, the Indian Premier League has 95% mobile views and ~39% no referrers (all of the Premier League articles' stats are interesting, really) and Alia Bhatt has 95%/68%.

Should we be concerned with that or does that vary for other reasons?

@EdErhart-WMF thanks for the interesting question!
My thoughts are the following; there are a few factors that might be contributing to this that we can think about it:

  • Mobile Social Media & Link Sharing: Sharing Wikipedia articles via mobile on social media platforms may not consistently pass on referral data.
  • Browser Behaviors & Autocomplete: If someone previously accessed the Wikipedia page for, let's say, the Indian Premier League or ChatGPT, and later initiates a related search, their browser might autocomplete or suggest the Wikipedia page based on their past browsing history. This would make sense in the case of Premier League and ChatGPT (where users probably navigated to the wiki page at some point to get more information).
  • Geographical Influence: Browsing differences across different geographic reasons might lead to certain variances in the non-referral percentages depending on the geographic and cultural context. Specifically, it might explain why there is a difference among celebrity page differences.

Given the popularity of these topics and the potential influencing factors outlined, I am inclined to say that these numbers don't pose an issue per se since they are not approaching extreme values.

The ones you listed should be removed from the original data set

@EdErhart-WMF

I've pulled the data again using the latest snapshot for November 15 2023. I prefiltered the ones you listed previously, and included an unfiltered list in the third sheet. If you'd like any changes let me know!

It can be accessed below:

Data November 15 2023

Hey folks, I've been thinking about other ways to show off Wikipedia's popularity beyond than the most-popular list. Is it possible to obtain the below data? I don't want to ask for too much work from y'all, so if anything here is too onerous to accomplish, let me know.

  • The minute of highest traffic this year overall (or another small unit of time -- hour?)
  • The minute of highest traffic to a specific article (or another small unit of time)
  • The most-viewed articles of the year in specific countries (larger ones, I know that privacy concerns rise greatly with small countries)
  • Largest percentage increase in readership in a single day
  • Most-edited articles of the year
  • Total number of English Wikipedia pageviews this year (never mind, I figured this one out via stats.wikimedia.org!)

Hey @EdErhart-WMF - Lets explore this in a separate task from the original pageviews request. I added your questions to our R&DS request process so we can investigate the possibility of pulling this data with input from the larger team.

Hey @EdErhart-WMF - Lets explore this in a separate task from the original pageviews request. I added your questions to our R&DS request process so we can investigate the possibility of pulling this data with input from the larger team.

Update - Given the workload of the team we will defer these explorations for future versions of the blogpost. The request is tracked via the R&DS triage process.

@EdErhart-WMF

Hi,

I've pulled the data for November 28 2023 using the latest snapshot. The data is in the same format with a sheet showing the filtered items (same list as the last pull) and an unfiltered version of the list. Let me know if any changes are required.

Data is here.

Thanks @Hghani! So appreciative of your help with all of this.

@Hghani / @OSefu-WMF: Hi, the Due Date set for this open task passed a while ago.
Could you please either update or reset the Due Date (by clicking Edit Task), or set the status of this task to resolved in case this task is done? Thanks!