Page MenuHomePhabricator

Vet data in intermediate aggregate {wren} [8 pts]
Closed, ResolvedPublic

Description

Documentation for the data: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly

We should make sure that there's nothing fishy or funny going on with data as we aggregate it into the hourly pageviews table. We can vet it approximately using sampling on the raw data and we can spot-check vet it for a few hours doing select count(*) from wmf.webrequest

  • Outcome of the task: Report including top articles on various wikis, project counts, with comparison with existing legacy data.

Event Timeline

Milimetric raised the priority of this task from to Needs Triage.
Milimetric updated the task description. (Show Details)
Milimetric added a project: Analytics-Kanban.
Milimetric added a subscriber: Milimetric.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 11 2015, 7:47 PM
kevinator renamed this task from Vet data in intermediate aggregate to Vet data in intermediate aggregate {wren}.Jun 12 2015, 5:42 AM
kevinator set Security to None.
kevinator renamed this task from Vet data in intermediate aggregate {wren} to Vet data in intermediate aggregate {wren} [? pts].Jun 12 2015, 4:08 PM

Outcome of the task: Report including top articles on various wikis, project counts, with comparison with existing legacy data.

kevinator renamed this task from Vet data in intermediate aggregate {wren} [? pts] to Vet data in intermediate aggregate {wren} [8 pts].Jun 15 2015, 5:19 PM
kevinator updated the task description. (Show Details)
kevinator triaged this task as Normal priority.Jun 15 2015, 5:29 PM

Analysis done one one hour of data: 2015-06-24T00:00:00, using newly generated projectview and legacy projectcounts.
It is to be noted that new projectview files don't contain .mw suffix (mobile on every project for a specific language). Therefore we don't consider the .mw lines in the old files for this analysis.
Here are some findings (REMEMBER: Only on one hour of data)

  • Total count for new projectview is 90% of old one. This was expected given no edit/login actions are counted. Main differences are between old and new are less desktop views, and sometimes more mobile and zero view counts due to better classification in the new pageview definition.
  • Big wikis:
    • Wikipedia en, es, ja, pt have small differences (-5% desktop and +5% mobile roughly)
    • Wikipedia de, ru, fr, zh, it loose a lot of views in both desktop and mobile (-30% desktop and -10% mobile roughly)
  • New pageviews cover 92 new project (among which 90 are zero) and has 252 projects removed (small ones, representing 0.01% of total views)

To explore deeper:

  • Why is there different change patterns among (de, ru, fr, zh) and (en, es, ja, pt)

Datasets compared: pageview_hourly and pagecounts-all-site.
Same hour of data as previous comment: 2015-06-24T00:00:00 (for pagecounts-all-site, use 2015-06-24T01:00:00 because of end-hour naming in pagecounts-all-site).

Expected findings :

  • pagecounts-all-sites don't decode URL-encoded names while the pageview_hourly table do. Therefore pagecounts will have multiple lines for same page because of different encodings. We expect less number of distinct pages in new format.
  • pagecounts-all-site don't extract page_title from URL query while pageview_hourly do. We expect more special pages in the new format.

Findings:

  • Total view count in the new format is 90% of the old one, same as at project level.
  • About projects at page_title level, we have the same findings as in the previous analysis at project level (many more zeros in the new format, no .mw in the new format, about 250 projects not in the new format, but only small ones so a very small impact on pageview number). While being expected, it is good to have this confirmation.
  • About the number of distinct pages, we find what we expected: new format distinct page_title count is about 67% of the old one. This is due to duplicated counts in the old format because of no URL decoding, and also that no actions (edit in particular) are counted in the new format.
  • When url-decoding page_titles in old dataset to better analyse page_title present in both new and old dataset, we find 1% decoding errors (special characters).
  • When joining old and new datasets on project and page_title (url-decoded), we find that most projects except for mobile and zero have almost all page_titles from the new set in the joined set. That means that the new set is a subset in term of page_titles of the old one, which is expected. On mobile, since page_title is extracted from query, there are more page_titles in the new dataset that in the joined one. Also, there are some non-zero/mobile project not following this pattern (it, tr, cs, sr for instance) on which special research will be necessary.
  • The review of the top 100 page_titles (new view_count order) for big wikis (en, ja, de, es, fr, ru, it) on new-old joined dataset shows that big difference in view_count occurs only on special pages (the multi-encoding issue also creates an artifact of big number, but it's not to be considered).


The findings on this analysis seems reasonnably expected, so I consider the data safe to be used / published (with k-anonymisation for small-size buckets).

@Ironholds, I'm not as familiar as you are with all of this, but the results seem to make sense to me. Take a look if you're interested.

@JAllemandou, looks good to me, you can move to Done when you're ready. I think overall this improves the quality of the data a lot. Extracting the page title, removing false duplicates, all really cool and useful.

Also the mobile data seems cleaner this way. Really great.

kevinator closed this task as Resolved.Jul 22 2015, 5:46 PM
kevinator updated the task description. (Show Details)
kevinator added a subscriber: kevinator.