Page MenuHomePhabricator

[REQUEST] En Wiki pageviews by topic. Rough cut.
Closed, ResolvedPublicMay 3 2019

Description

What's requested:
Pageviews per topic in English Wikipedia for March, using draft topic

Why it's requested:
We currently have 0 idea of what topic areas drive readership outside of glancing at the top pages per day, which could be very
misleading. A base understanding that is crucial to understanding core use cases.

When it's requested:
By 5/3/19

Other helpful information:
Toby is a key stakeholder for this :)

Details

Due Date
May 3 2019, 7:00 AM

Event Timeline

In this analysis, I use the ORES draft topic model to get the topics of articles viewed on English Wikipedia in March 2019. The topics from this model are the WikiProject Directory and their mid-level categories. The WikiProject Directory provides a convenient intermediary ontology of WikiProjects that starts with four broad topics: Culture; Geography; History & Society; and Science, Technology and Mathematics (STEM). From there, the directory drills down into mid-level categories and sub-topics and eventually specific WikiProjects. For example, WikiProject Birds exists underneath the path STEM/Science/Animals. In this analysis, we call the WikiProject Directory "broad topic", and the mid-level categories "topic".

Screen Shot 2019-05-03 at 3.37.15 PM.png (930×1 px, 297 KB)

CC BY-SA 4.0, EpochFail, File:WikiProject Directory mid-level category abstraction.svg

In March 2019, 7,162,726 unique pages* (namespace 0) are viewed 7,495,953,809 times on English Wikipedia. The distribution has a very long tail -- most of the pages got very few views and only a very small number of pages got many views. In order to be more computationally efficient, I only include the top 1 million pages that got the most views in this analysis, which accounts for 92.01% of total page views of English Wikipedia in March 2019.

Firstly, I check the top 50 articles read in March 2019 on English Wikipedia (to save space I didn't include the list of articles here: please use the link on the left). From the list, we can see the topic assignment is not perfect -- for example, main page got assign to the topic "Culture.Language and literature"; the article "Disappearance of Madeleine McCann" got assigned to the topic "Geography.Europe". In general, my feeling is that these topic assignments are not completely wrong, but can be better.

Next, I sum the number of page views by topic. The table below shows the page views by broad topics of the top 1M pages viewed in March 2019 on English Wikipedia. Their corresponding proportions among the total page views of the top 1M pages are also calculated. Main page is excluded in this table, so the sum of the proportion is not 100%. Please use this link to check the breakdown by topics (mid-level categories).

Broad topicpageviewsproportion
Culture35467408740.514216
Geography12087138900.175243
STEM8378417450.121473
History_And_Society8107961310.117551
Assistance91969550.001333
Unknown42005630.000609

*Caveat: 7,162,726 is the number of unique page IDs. This means that if a redirect has a different page ID, it is counted as another page. Further analysis is needed to combine redirects and their target pages.


@JKatzWMF and @Tnegrin, please let me know if you have any questions. And I'm happy to walk you through the results next week. :)

@chelsyx I don't know what wonderful deed I did to deserve the insights derived here. This is gold. Thank you. It actually upends some previous 3rd party data I had been using about the relative traffic on "geography" pages.

Like any good data, this creates more interesting questions for me. One of them being, where are the biggest mismatches between content and traffic. In other words, Culture might have 51% of the traffic, but 51% of the pages (unlikely). On the other hand, we might have 4M pages on STEM (all those genes), even though they represent 12% of traffic. This is important because it would tell us if number of pages was a good indication of traffic. We often hear things like "this would effect 30% of all pages.." but in the end, that might be 1% of traffic. No action items here aside from thinking about how we might answer that.

I have no phab token skills, but this is awesome Chelsy! Thanks for the follow up.

You mentioned the long tail of articles -- this is fascinating -- excluding the main page (more on this later) the top read article is less than 0.1% of page views. This is not a power law distribution. Could you graph this long tail -- I think it's critical to understanding the relationship between content and readership.

Also, does the topic model work for other languages?

@JKatzWMF Perhaps we should spend some more time on the main page...

@kzimmerman This should absolutely be in the next metrics presentation.

Hi Runa, Angie -- some interesting work on the relationship between content and readership from Chelsy. I think this supports content creation for NR readership.

You mentioned the long tail of articles -- this is fascinating -- excluding the main page (more on this later) the top read article is less than 0.1% of page views. This is not a power law distribution. Could you graph this long tail -- I think it's critical to understanding the relationship between content and readership.

@Tnegrin There was something I missed in T221891#5157097 (fixed with a caveat now): 7,162,726 is the number of unique page IDs. This means that if a redirect has a different page ID, it is counted as another page. At the end of March, there were 5,835,781 unique articles on enwiki in total, but if we include redirects there were 14,519,329 pages (unique page IDs). Fortunately, this miscounting doesn't have a big impact on our analysis, because there are only 958 redirects among the top 1M pages, and they only represents 0.0609% page views.

Regarding the distribution of page views, it is a power law -- there are 1M pages in total, the top read article's 0.1% page views is actually a very big proportion. The following graph should be able to show the distribution better. The first bar from the left represents the main page (7% of top 1M pages' pageviews), every other bars is the aggregated page views of 10k articles: the second bar represents the top 10k read articles (23.9% of top 1M pages' pageviews), the third bar represents the 10k - 20k articles (8.4% of top 1M pages' pageviews), the fourth bar represents the 20k - 30k articles (5.8%), so on and so forth.

download.png (789×1 px, 21 KB)

Also, does the topic model work for other languages?

I need to ask Aaron about that.

kzimmerman added subscribers: nettrom_WMF, Charlotte.

Circling back with comments from a separate email chain: @Charlotte noted the importance of triangulating this data with other data.

From @nettrom_WMF:

I was not particularly surprised to see Culture and Geography show up as #1 and #2. There's one methodological reason for that, and one data-informed reason.

Methodology: The ORES model is trained on WikiProjects, and geography-based WikiProjects will tag articles related to their area even though we don't regard it as a "geographic" topic (e.g. a country or place of interest). For example, if you expand the list of WikiProjects on the talk page of the Barack Obama article, you'll find that it's tagged by several geography-related WikiProjects: Chicago, Illinois, Hawaii, African diaspora, and Kenya. Chelsy posted the list of top 30 articles predicted to be in "Geography.Countries" and I suspect this is related to that.

Data: there's a related analysis in the 2015 research paper I did on misalignment between popularity and quality in Wikipedia, where we looked at the 4,135 most popular articles in a month that were not Featured Articles and what WikiProjects they were connected to. In other words, we also used WikiProjects as indicators of "general topic area". There, we found "Countries" to be strongly present in that dataset, much more than you'd expect if articles were drawn at random (Table 8 in the PDF). A lot of the other topics were pop culture.

And to underscore the intent of this work: it's an exploratory analysis that can be valuable for generating product hypotheses and identifying areas of future analysis or testing. If there are specific questions that emerge from this data and that would inform product work or planning, please do submit a ticket so we can prioritize it among other needs (details on Product-Analytics ).

Closing this specific task as done - thank you, @chelsyx !

Regarding the draft topic model in other languages, here's the reply from Aaron:

We don't have a model for other languages, but I think that's a simple matter of engineering. No substantial research projects would be necessary to apply it to other languages.
In the short term, I think you could experiment with applying the model to machine translated text. I can help show you how to do that with ORES.
In the long term, if there is demand for the draft topic model in other languages and a real product use-case, we can get that work prioritized.

diego added a subscriber: diego.

Hi all,
I think this use-case highlight the need for a canonical (standanrized) cross-lingual topic model, that we could all use as the reference for all the projects within the WMF.

The "ORES draft topic model" is the best approach that we have right now, and it's an excellent starting point, however, in had some problems (ex. WikiProjects are not always general/topic-wise), and also it can't be directly extrapolated to all languages.
Imho, the document-embeddings solution proposed by @Halfak , needs to be implemented ASAP, because it would help to deal with some of these problems, mainly because it would allow to measure semantic distance among documents.

However, in parallel we need to work what I'm calling the "canonical cross-lingual topic model", that I think should be based mainly in Wikidata information, that it's by definition, language independent. I have extensively worked with topic models in the past and have several proposals of how to do this, and I'll be happy to discuss them. For sure, we wont find the perfect solution, but we should start working on this, and stop reinventing the (topic) wheel every time that we need to deal with topics across languages.

+1 to everything Diego said. After quality (e.g. ORES wp10) and demand (e.g. pageviews), topic is the most important hidden dimension to productivity/coverage questions.

Regarding topic modeling, @diego, I think we'll need to be scrappy this year. Right now the Machine-Learning-Team has no contractor budget and not enough engineering resources to devote to this. But I think we can still make a lot of progress if we make this research project with a light engineering component.

With that said, I'm hoping to devote some time to this in Q2 (Oct, Nov, Dec). Specifically, I want to work out a sustainable pipeline for generating embeddings and possibly also for mapping embeddings across languages.