Page MenuHomePhabricator

Creation of canonical pageview dumps for users to download
Open, HighPublic

Description

Creation of canonical pageview dumps for users to download

[The following added by @fdans]

Deliverable
The idea with this project it to replace Pagecounts-EZ with a dump that:

  • Spans the same timerange (2011 to the present).
    • But we should probably include @CristianCantoro's data from 2008 to 2011 (T188041). This alone would add huge value to the dataset and more than justify this project
  • Contains hourly pageview data for all Wikimedia sites:
    • Do we separate app traffic? Right now EZ includes it in mobile.
    • Do we keep reporting only user traffic? Or do we add bots?
  • Uses correct, standard wiki identifiers (e.g. de.wikisource) as opposed to WebstatsCollector wiki codes (de.s, es.z, etc).
  • Doesn't have its data skewed by one hour (see https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pagecounts-ez#One_hour_skewing_issue).
  • Is generated as part of the Pageview Hourly coordinator in Hadoop, as opposed to a script running as a cron job.
  • Probably has no DIY compression. I'm sure bzip2 's compression makes shortening highly repeated values like en.wikipedia unnecessary, but we can test.

Format
The only fundamental issue with the current row format is the wiki codes. Secondarily there's the fact that we nowadays distinguish traffic between the desktop site, the mobile site, and the apps. So the format could be like:

en.wikipedia mobile Michelle_Obama 2629 A113B112C101D129E118F92G68H88I54J58K39L87M73N80O184P143Q140R138S147T133U137V128W142X125

I think we can totally do away with the hourly data encoding because it's confusing, it doesn't save space, and storing numbers separated by spaces is probably better for compression than one big block of alphanumerical values (UPDATE: not true). So most likely the row would look like this (UPDATE: nope, will look like the above one)

en.wikipedia mobile Michelle_Obama 2629 113 112 101 129 118 92 68 88 54 58 39 87 73 80 184 143 140 138 147 133 137 128 142 125

Update: I did a quick test by compressing those two lines in each file and the compressed version of the full named, no number encoding is 35 bytes lighter.

Backfilling
There are three parts to backfill on this dataset:

  1. - As mentioned above, the 2008-2011 obtained from pagecounts-raw by @CristianCantoro
  2. - The part that was generated via WebstatsCollector from 2011 to 2016, which doesn't include mobile pageviews.
  3. - The present part, generated from the Pageview Hourly legacy dumps.

Part 1 will probably have to go through the same corrections as 2 (correct wiki names, un-skewing of data). Part 3 will be backfilled from Pageview Hourly in order to fix the local chapters-mobile wikipedia conflict. This problem doesn't affect parts 1 and 2 because no mobile pageview data was available back then.

Access/docs

The dumps site should be redone in a way that classifies our available datasets according to their distinct features, as opposed to the current form which is more like "here's a list of all our mildly different pageview datasets!".

Event Timeline

Some more things to consider after team discussion:

  • For this new dump to replace the Pageviews dump we would have to provide not only the access method, but also the agent type dimension (user, bot or automated), by which we would probably partition the files (e.g pageviews-2020-05-01-user.bz2)
  • This would also mean replacing a dump that is released hourly with one that still has hourly granularity but is only released daily. @Milimetric and I were talking about keeping hourly releases but only in a sliding window basis (like, only making available the last 24-48 hours for hourly releases).
Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

A note on DIY compression vs adding all values: Erik's format very clearly saves space. I wasn't taking into account the long tail of articles with a couple of pageviews per day that with my proposed format would contain 24 space-separated zero (a total of 47 characters), as opposed to only 5-10 characters with Erik's method. So I think the ideal solution is to use explicit wiki names and access site values, but keep the old hourly format.

Change 595152 had a related patch set uploaded (by Fdans; owner: Fdans):
[analytics/refinery@master] [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ

https://gerrit.wikimedia.org/r/595152

Thanks @fdans for leading this work -- page view data and how to handle all the different potential sources is definitely one of the most-FAQ of frequently asked questions :)

  • To me, the number one most important thing here is to include page ID. I'm happy to work with you to make that happen. This would make it much much much simpler to handle page moves / redirects when doing analyses, which are a major headache when working with current events (e.g., the Coronavirus pandemic article has had over 10 names in the few months it's been around) or large swaths of time. See: https://mako.cc/academic/hill_shaw-consider_the_redirect.pdf. This wouldn't change the number of rows in the dataset but make aggregation to the article level (as opposed to just which particular title was viewed) much much much easier.
  • +1 to one line per day and partitioning files by agent_type. Many researchers aren't interested in the hourly granularity, they just want a sense of reader demand for an article over a time period and that means being able to quickly/easily process and aggregate counts for a period of a few weeks or months while filtering out bots.
  • I'd love to consider what it means to include wikidata IDs in this dump as well. This would greatly simplify multilingual analyses (how many pageviews to a given concept across all languages), which people largely avoid in part because it's difficult to work with the Wikidata dumps and we don't make Wikidata IDs a core feature in really any of our Wikipedia datasets yet. I understand that this is challenging from a historical perspective, but I want to raise it as I think it's something we should be trying to include in any data dump going forward.

@Isaac thank you for the insight!

After a bunch of discussion among the team, we see how the addition of the page ID would add a lot of value to the dump without treading too much into feature creep territory. This would be an easy change to the current patch since this is a field that is already available in the pageview_hourly dataset.

There is one caveat: since only pageview_hourly has the field, but not the old data, by the time the dump is released there will be page ID data for the 2016-present period, but not for earlier. Page IDs for dates before that should be easy to obtain by joining the pagecount data with mediawiki_history, but we've agreed that part of work falls outside the scope of this task, so for the time being the page id column for years 2008 to 2016 will be nulled out, with the possibility of re-backfilling with them present at a future occasion. I hope that's an acceptable compromise.

My 2 cents:

  • Redirects are very useful and something to be taken into account when working on Wikipedia - the paper linked by Isaac is the prime reference in that regard - however, handling them is quite tricky. I have some experience in that, having built snapshots of the graph of Wikilinks for several Wikipedia over several years ([shameless plug] see the paper, the code is on GitHub). I have also worked with redirects and the pageviews data (code on GitHub). In short, it would be useful to have the redirects solved, but it is a project on its own, IMHO.
  • Page ids are very useful, but beware that all kind of quicky things can happen to them over time with page moves, deletions and re-creations, etc. My fear is that selecting a page by id would not be exactly equivalent to select it by title and since these data come from web requests, using the page title would be the "right" way to do it. Note that this is kind of a subproblem to the redirect problem in some sense. I would like to put together some concrete examples of what I mean, but I really do not have the time to do that at the moment.

My fear is that selecting a page by id would not be exactly equivalent to select it by title and since these data come from web requests, using the page title would be the "right" way to do it.

Or to put it in another way, associating each page title to a page id is not as straightforwards as it may seem if you need to do it over time. You can do that with respect to a given moment in time, i. e. you can say that now the page titled "Anarchism" has
pageid 12 (see API query), or you can do that with respect to a given dump.

we see how the addition of the page ID would add a lot of value to the dump without treading too much into feature creep territory

Thanks! I completely understand that including Wikidata IDs expands the scope of this task significantly. Frankly, the more long-term solution would be releasing a regular public dump of item_page_link table so that people can easily append Wikidata IDs to any dataset.

There is one caveat: since only pageview_hourly has the field, but not the old data, by the time the dump is released there will be page ID data for the 2016-present period, but not for earlier. Page IDs for dates before that should be easy to obtain by joining the pagecount data with mediawiki_history, but we've agreed that part of work falls outside the scope of this task, so for the time being the page id column for years 2008 to 2016 will be nulled out, with the possibility of re-backfilling with them present at a future occasion. I hope that's an acceptable compromise.

Yeah, I'll take that. Doing the full backfill would be an excellent onboarding project for the next analytics-y person to be hired :)

Or to put it in another way, associating each page title to a page id is not as straightforwards as it may seem if you need to do it over time.

@CristianCantoro yeah, this will matter a lot for the pre-2016 data. As @fdans said, for the 2016-present data, it's trivial because the page ID has always been present in the data where that is the page ID of the article that was viewed (post-redirect) and not the page ID of the title that was requested. For the vast majority of articles, this will enable simple analyses over time because the page ID remains constant with page moves and instances where the article is deleted and then restored. I don't have the statistics, but page merges and times when the article is deleted and then recreated under a new page ID (instead of restored) should be a very minimal fraction of the data and frankly there is no simple way to handle that. So for researchers, if you want to know pageviews associated with a specific title, you aggregate based on page_title. If you want to know pageviews associated with a specific article (grouping pageviews from all redirects), you aggregate based on page_id. Both will be simple to do with the dataset.

For the more long-term issue of determining what page ID was associated with a particular page title pre-2016, it's a lot harder as you note and probably will require constructing the history of where redirects point.

@fdans a clarification question: will the page IDs also be included in the pageviews API?

@Isaac you can open a task asking for it! But that's definitely not within the scope of this task.

Change 595152 merged by Fdans:
[analytics/refinery@master] Add pageview daily dump oozie job to replace Pagecounts-EZ

https://gerrit.wikimedia.org/r/595152

Makes sense @fdans -- I think page IDs + API is being tracked at T159046, so I'll hold off on creating a new task.

Change 629409 had a related patch set uploaded (by Fdans; owner: Fdans):
[operations/puppet@production] dumps::web::fetches::stat_dumps: add rsync job for pageview complete

https://gerrit.wikimedia.org/r/629409

Change 629409 merged by Elukey:
[operations/puppet@production] dumps::web::fetches::stat_dumps: add rsync job for pageview complete

https://gerrit.wikimedia.org/r/629409

Change 634650 had a related patch set uploaded (by Fdans; owner: Fdans):
[operations/puppet@production] dumps::web::html Add landing page/readme for pageview-complete dumps

https://gerrit.wikimedia.org/r/634650

Change 634650 merged by Elukey:
[operations/puppet@production] dumps::web::html Add landing page/readme for pageview-complete dumps

https://gerrit.wikimedia.org/r/634650

Change 640146 had a related patch set uploaded (by Fdans; owner: Fdans):
[analytics/refinery@master] Add historical_raw job to load data from pagecounts_raw

https://gerrit.wikimedia.org/r/640146

Change 640146 merged by Fdans:
[analytics/refinery@master] Add historical_raw job to load data from pagecounts_raw

https://gerrit.wikimedia.org/r/640146

Aklapper added a subscriber: fdans.

(Resetting inactive assignee account)

I did not know this was left behind in the migration to Data-Engineering. I recently talked to Olja and mentioned that I want to use this task to show Hannah around the data engineering space. So I'm going to add this and its relatives back to our boards.

Side note: this project was 99% finished. When finished, we can let go of *years* of technical debt and unnecessary complexity. So the value proposition is very good.

Adding T290060 as a subtask of this one as should have been done originally.