Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts]
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ezachte
	Oct 1 2015, 4:10 PM

Description

This task was simplified per a conversation with @Tnegrin. We're going to publish a new dataset, in WSC format, to feed into the Monthly Pageview Reports. The Traffic Breakdown Reports will be done as part of the wikistats 2.0 project.

Archiving the rest of the description because it's useful:

First step in replacing/updating Wikistats traffic reports is feeding these with aggregated hive data directly. This will take away confusion over multiple contradictory counts and take away the need for multiple maintenance in perl and hive.

There are two independent process flows. For each a proposal I made a diagram (png). These diagrams are pretty dense, trying to reconcile overview and detail, for use by developer, but I can make a light version if required.

Monthly Pageviews Reports ver 0.1 Oct 1, 2015

Monthly Pageview Reports.png (1×1 px, 378 KB)

Monthly Pageviews Reports ver 0.6 Nov 19, 2015

Monthly Pageview Reports.png (1×1 px, 451 KB)

Traffic Breakdown Reports ver 0.2 Oct 1, 2015

Traffic Breakdown Reports I.png (1×1 px, 404 KB)

Details

Subject	Repo	Branch	Lines +/-
Aggregate from projectviews-, not projectcounts-	operations/puppet	production	+3 -2
Archive hourly pageviews in legacy format	analytics/refinery	master	+194 -30
Parametrize the input filenames format	analytics/aggregator	master	+158 -9

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Milimetric	T114379 Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts]
Resolved	Milimetric	T114660 Spike: understand wikistats enough to estimate replacing pageview data source {lama} [8 pts]
Duplicate	None	T113981 Spike: Found out what dump file format does erik uses as feed to his definition
Resolved	• Nuria	T115922 Send email out to community notifying of change {lama} [1 pts]
Resolved	Milimetric	T115344 Publish new pageview dataset with clear documentation {lama}
Invalid	akosiaris	T117428 Allow rsync to dataset1001 from Analytics VLAN

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.Oct 7 2015, 8:39 PM

I figured we can produce all breakdowns by geography (middle column of TBD diagram) with two datasets, one for views, one for edits. 8 fields only in each:

date, hour (or 15 min), bot Y/N, main/mobile/zero, project, language, country code, [view|edit] count

That would be another quick win. for reports like
http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerCountryOverview2014Q3.htm and
http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerCountryBreakdown.htm and
http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerLanguageBreakdown.htm

For all other traffic breakdown reports like by browser type, OS version, etc. I totally support doing away old code and just borrow some ideas (like overview for Google related traffic).

It's up to @Tnegrin if he wants to get geographic breakdowns, @ezachte. I'm leaning towards handling that along with the other breakdowns separately.

The quick survey shows most support for continuation of the geographic reports (report 21-24) , more than other breakdowns https://www.mediawiki.org/wiki/Analytics/Wikistats/TrafficReports/Future_per_report_B2

There is also the issue of time to market.

After discussion with @Tnegrin and @JKatzWMF, I wanted to briefly chime in just to make sure that we will be using consistent definitions (consistent with the new def pageview data we are already publishing e.g. in the quarterly report scorecard, the Vital Signs dashboard and the weekly reading metrics report). I'm like 90% sure that's the plan already , but to spell out the assumption concretely for the monthly "all Wikimedia projects" numbers:

PageViewsPerMonthAllTotalled.csv and https://reportcard.wmflabs.org/graphs/pageviews will (apart from 30day normalization) contain the same numbers as generated by this query:

hive (default)> SELECT year, month, SUM(view_count) AS total_views FROM wmf.projectview_hourly WHERE year=2015 AND agent_type = 'user' GROUP BY year, month ORDER BY year, month LIMIT 1000;

year	month	total views
2015	4	18427001405
2015	5	18526977173
2015	6	16322476300
2015	7	15540018348
2015	8	15765852881
2015	9	15899101083

PS: I understand that it's not planned to backfill previous months with the new def data, but in the case that there is a desire to include numbers for before April 2015, the (future) outcome of T108925 should be taken into consideration.

@Tbayer absolutely, being consistent is important.

The only inherent complication I see is if wmf.projectview_hourly doesn't cater for unrecoverable data mishaps. In the beginning Wikistats used to simply add up the numbers, and I assume wmf.projectview does just that now. But over the years some really bad things happened and some mishaps were unrecoverable.

So I felt compelled to add some complexity: if we have a bad-data hour (or days) and we can't fix that at the source (e.g. the data weren't collected) Wikistats will come up with a best estimate rather than a number known to be way off, by extrapolating from the good hours in that month. This goes against simplicity (another important value), but reliability wins out.

Sometimes (rarely these days) hourly files are missing or almost empty. Wikistats detects that. Some data are wrong in a major way but Wikistats has to be told via a black-list. The issue could be an error in our analytics infrastructure, but could as well be a major mishap at ops or dev. For instance years ago for two weeks during the fundraiser an erroneous GET request for the banner page mimicked a valid pageview and doubled our numbers and it took a while before we found out. Often I was able to repair major mishaps (sometimes baflling ops that it could be done). But in this case it was too hard or we were too late (I don't remember). So I invalidated two full weeks of data and came up with much saner number, the best we could deliver. Such a big event could be explained in annotations, but with 15 years history our charts would be cluttered by those.

So how to proceed?

I know 'perfect is the enemy of good', and I wouldn't have spent the effort if our monthly totals number had differed from reality by a mere one percent every other month or so. But one full day of outage is 3% lower counts in a month.

So If the current hive table doesn't correct for mishaps I suggest we upgrade the code (or use the Wikistats data as authoritative, those scripts have been time-weathered. [*]

Now assuming others also feel it is worthy to cater for unrecoverable mishaps, it still may take some time to upgrade the current hive script (assuming this is even doable in hive). For the time being I could dumb-down Wikistats if consistency is paramount.

My inclination though would be to live with the inconsistency until the cause is remedied. And let the numbers speak for themselves. If the difference between wmf.projectview_hourly and Wikistats turns out to be negligible, that would confirm our processes are reliable. If the difference is larger that should be a wake-up call (as said, the issue could originate anywhere within WMF dev/ops engineering).

I'm talking of making the best of the data we have. I'm ignoring the huge systemic errors we had for an eternity (no mobile traffic , lots of bot included). Those have been fixed with hadoop.

Erik, I'm in favor of keeping the processing that accounts for bad data. It's one of the reasons I didn't want to replace the breakdown reports, because it would mean having to translate all that work into Hadoop world, and I know it would take a while. We also have new ways of finding missing data (pretty accurate, based on sequence number analysis) that we can use instead of automatic detection, so it'll get a bit more complicated before it gets simple again.

Consistency, in this case, will be at the level of definitions and data, not processing, in my opinion. Tillman, let us know if you disagree. I'll also ask the rest of the team. I think there's also room for compromise here, where we can track both the raw stats and normalized stats.

Dan, using sequence numbers to detect anomalies makes total sense to me. In fact I used that also to repair multi-months 20%-30% UDP message loss. By measuring per server per hour how much the average gap between sequence numbers went above the expected average gap (which of course is 1000 for the 1:1000 sampled log). That will work for capture-errors. It's not a cure-all, it won't help for the case I mentioned where massive amounts of bogus 'page views' came our way for two weeks. Neither is my half-automated blacklisting of bad hours a cure-all.

I can also see how accepting discrepancies between similar processes is much simpler. In other words we could strive towards consistency where possible, but not beyond that. I expect everyone will understand that different processing with different safe-guards can result in slightly different numbers. And if the difference happens to be major some month we will and should be alerted and will follow-up.

I support keeping things simple for this project and simply replacing the
source and definition of the page view logs as Dan proposes. Our
definitions will improve over time as will our ability to identify and
correct errors.

-Toby

In T114379#1712266, @Milimetric wrote:

Erik, I'm in favor of keeping the processing that accounts for bad data. It's one of the reasons I didn't want to replace the breakdown reports, because it would mean having to translate all that work into Hadoop world, and I know it would take a while. We also have new ways of finding missing data (pretty accurate, based on sequence number analysis) that we can use instead of automatic detection, so it'll get a bit more complicated before it gets simple again.

Consistency, in this case, will be at the level of definitions and data, not processing, in my opinion. Tillman, let us know if you disagree. I'll also ask the rest of the team. I think there's also room for compromise here, where we can track both the raw stats and normalized stats.

Yes, so what I said above was indeed about using consistent definitions (see first sentence), although I mean that to include e.g. the choice to report the sum of all views included in projectview_hourly, removing only spiders/bots (i.e. WHERE agent_type = 'user') - it's not quite clear to me if such choices count as "processing" or not.

I can see the reason for making one-time adjustments in the case of such isolated incidents as Erik describes (recalling e.g. the data loss incidents two months ago), although in those cases we should still strive to update the other places where we communicate the same numbers, for consistency. (BTW I would expect the quarterly report, at least, to switch to Wikistats as a source anyway once this update has been completed.)

What I would really like us to avoid is trying to "fix" shortcomings in the main content pageview definition downstream. Say if we find that Hindi Wiktionary's pageviews have been artificially inflated by a bot for the last half year, we should change the main pageview definition to exclude that bot, not exclude the Hindi Wiktionary from the Wikistats reports. I seem to recall quite a few cases like that from past years, which I guess may have been justified at the time because of our limited ability to react to such findings in more systematic ways back then. But going forward this should be avoided - let's keep things as simple and consistent as possible.

• kevinator closed subtask T114660: Spike: understand wikistats enough to estimate replacing pageview data source {lama} [8 pts] as Resolved.Oct 9 2015, 4:09 PM

@Tbayer not sure why you mention Wikistats in this context. Or am I getting you wrong?

I don't recall even one occasion where Wikistats ruled out some wiki beause of peak traffic. If you look at http://stats.wikimedia.org/wiktionary/EN/TablesPageViewsMonthly.htm I see sudden peaks in several columns. In Sep/Oct 2011 a botnet did generate %5 of our overall traffic by spamming wk:pt. That still stands out in the table.

@ezachte: It looks like we are on the same page here (no pun intended ;), so no need to dig further into this, but perhaps it's worth stressing that Hindi Wiktionary was an intentionally made-up example ("say), and by "cases like that" I wasn't referring to pageview data on Wikistats (but possibly to other parts of Wikistats, and examinations of usage data elsewhere). Again, if everyone agrees that this kind of thing should be avoided and shortcomings of the pageview definition should be fixed centrally in the definition itself, the issue is moot, beyond recording that consensus here for the future.

Milimetric updated the task description. (Show Details)Oct 13 2015, 5:20 PM

Change 246149 had a related patch set uploaded (by Milimetric):
[WIP] Archive hourly pageviews by article in wsc format

https://gerrit.wikimedia.org/r/246149

gerritbot added a project: Patch-For-Review.Oct 14 2015, 3:39 AM

• ggellerman moved this task from In Progress to In Code Review on the Analytics-Kanban board.Oct 14 2015, 4:20 PM

While waiting for new input for Monthly Pageview Reports (which is coming along, thanks @Milimetric !), I looked into Traffic Breakdown Reports, subset Geo Reports.

I decided I can actually do that myself, so I did build a hive script yesterday to collect daily geo breakdown of page view data with 15 min precision. Job is backfilling from Sep 1 onwards (about 13 minutes per day)
Consider this a unofficial solution until data have been vetted and scripts have been productified.

stat1002:/a/wikistats_git/squids/csv/yyyy-mm/yyyy-mm-dd/public/views_geo.csv.bz2 contains
continent,country_name,country_code,wiki,access_method,time_bin,count
Africa,Angola,AO,ab.wikipedia,mobile web,2015-09-02:04-15,2
Africa,Angola,AO,ab.wikipedia,mobile web,2015-09-02:17-45,2
Africa,Angola,AO,ab.wikipedia,mobile web,2015-09-02:21-45,2
Africa,Angola,AO,ace.wikipedia,mobile web,2015-09-02:15-30,1
Africa,Angola,AO,af.wikipedia,desktop,2015-09-02:04-30,1
Africa,Angola,AO,af.wikipedia,desktop,2015-09-02:09-30,1

Swapping this file for squid logs based csv file is not a difficult task.
I can aggregate into hourly/daily data for publication on dumps server (to be decided separately).

Doing the same for page edits is another matter, to be considered separately.
Then later (optional but recommended) I can isolate the perl code for these geo reports into a separate module.

Traffic Breakdown Reports 0.3.pdf289 KBDownload

Change 247323 had a related patch set uploaded (by Milimetric):
Parametrize the input filenames format

https://gerrit.wikimedia.org/r/247323

Change 247458 had a related patch set uploaded (by Milimetric):
Aggregate from projectviews-*, not projectcounts-*

https://gerrit.wikimedia.org/r/247458

Change 247323 merged by jenkins-bot:
Parametrize the input filenames format

https://gerrit.wikimedia.org/r/247323

This got merged (Jenkins automatically does that when you C +2, so you need to C -1 if you want to prevent it).

However, it's ok, this is the first thing that needed to be done in the deployment plan. I outlined the plan in this commit message: https://gerrit.wikimedia.org/r/#/c/246149/

Milimetric moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.Oct 20 2015, 2:46 PM

Change 246149 merged by Joal:
Archive hourly pageviews in legacy format

https://gerrit.wikimedia.org/r/246149

• Tbayer mentioned this in T116244: Update reportcard.wmflabs.org with July-October data.Oct 21 2015, 11:12 PM

The deployment of this is stalling because of two production issues that we had to jump on this week. Sorry for the delay, everyone involved, our operational role is an unmitigated risk.

Milimetric moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Oct 27 2015, 7:12 PM

@ezachte, this has been done. The new dataset will be available at dumps.wikimedia.org/other/pageviews/ once the rsync completes. (And at /data/xmldatadumps/public/other/pageviews on the file system)

Let us know if you have any problems.

• Nuria moved this task from Done to Ready to Deploy on the Analytics-Kanban board.Oct 28 2015, 4:07 PM

Change 247458 merged by Ottomata:
Aggregate from projectviews-*, not projectcounts-*

https://gerrit.wikimedia.org/r/247458

JAllemandou moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Oct 28 2015, 8:43 PM

Thanks Dan! I'll do some sanity checks, and report back.

You're very welcome, Erik. There was one small last bug, the pageviews-* files got synced without being zipped, so we're going to fix that tomorrow. The file names will just get a .gz. The projectviews-* files will all stay the same though.

Let us know, most importantly, if you need more pageviews-* data backfilled, right now it's just generating going forward from October 26th.

Dan, here is a comparison of data for one hour in webstatscollector 1/2/3
Most counts are similar, or understandably different. A few differences I'm not sure what to make of it. Any idea?

webstatscollector3_sanity.xls41 KBDownload

• kevinator renamed this task from Feed Wikistats traffic reports with aggregated hive data {lama} [8 pts] to Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts].Oct 29 2015, 4:09 PM

The .mw missing from "wsc 3.0" data. Hm, I missed this extra INSERT statement in Hive, which does the "wsc 2.0" data: I didn't realize it was doing two INSERTS. I can add that in, and we'd have to backfill again. Let me know, but I'm not a huge fan because if someone didn't know they would add up all the numbers and get something wrong.

About the Zero numbers not showing up in wsc 2.0 but showing up in wsc 3.0, I see that but I have no explanation at all :( I have always been super confused about how zero data goes through our pipeline, but the wsc 3.0 way seems to make more sense. Wsc 2.0 just looks at the hostname to see if it's zero. Wsc 3.0 relies on the request to be tagged "zero" in the refine process, which uses X-Analytics. That seems more authoritative to me, so I'd trust the Wsc 3.0 numbers and check with the Zero team if you still feel uneasy.

I agree with your explanations for the other discrepancies, it's very nice the way you summarized the data and differences.

Milimetric moved this task from Done to In Progress on the Analytics-Kanban board.Oct 29 2015, 9:40 PM

Milimetric moved this task from In Progress to In Code Review on the Analytics-Kanban board.

I wonder if we redacted the W0 numbers in 2.0. I seem to remember some
concerns about sharing those publicly.

Oh yeah... But they're not fully redacted anyway... hm...

Is there any sensitivity we need to be aware of when publishing reports for small countries from the unsampled logs? For projects with little to no activity a set of localized pageviews can disclose the location of an editor.

+1 hm on redacted numbers.

It seems the end result has fallen between the cracks. For webstatscollector 3.0 does that policy still apply?

In T114379#1769061, @DarTar wrote:

Is there any sensitivity we need to be aware of when publishing reports for small countries from the unsampled logs? For projects with little to no activity a set of localized pageviews can disclose the location of an editor.

The pageviews aren't localized as part of this dataset, this is just Page Title, View Count. Do you mean the localization that wikistats does used in combination with this? I'm not seeing the connection there either.

In T114379#1769299, @ezachte wrote:

+1 hm on redacted numbers.

It seems the end result has fallen between the cracks. For webstatscollector 3.0 does that policy still apply?

I'm not sure, I'll try to get @kevinator to look into that, but we think it probably still applies. It would be fairly easy to just remove .zero completely from this data, shall we do that?

I also vote for doing away with .mw, it's redundant, and confusing indeed.

For percentage bots I changed 'OK-ish' to 'spot-on' a grep on "bot|spider|crawl|http" also catches 19% bots for this particular hour

As for zero traffic being derived from X-Analytics, that makes sense.
But why would we maintain a different url at all then for zero traffic? Is that a relict of the past now?

As for totals for zero traffic:, query from webrequest is 10x query from pageview_hourly (am I doing something wrong here?) WC 3.0 count is close to the smaller number, but if webrequests number is better, we'd want to acknowledge that somehow?

webstatscollector3_sanity_zero.jpg (457×552 px, 78 KB)

Lastly WC 3.0 number for en.wikipedia desktop is 4.8% higher than direct hive query 6206k vs 5922k
And WC 3.0 number for en.wikipedia mobile_web is 3.1% lower than direct hive query 4976k vs 4950k.
So ballpark OK, but worth another look at how this can be explained?

In T114379#1769563, @Milimetric wrote:

In T114379#1769061, @DarTar wrote:

Is there any sensitivity we need to be aware of when publishing reports for small countries from the unsampled logs? For projects with little to no activity a set of localized pageviews can disclose the location of an editor.

The pageviews aren't localized as part of this dataset, this is just Page Title, View Count. Do you mean the localization that wikistats does used in combination with this? I'm not seeing the connection there either.

The overall phab task also deals with breakdown per traffic type and region. It's just not what we are vetting right now.

The 10x larger numbers in webrequest vs. pageview_hourly are probably due to is_pageview being false for 90% of the hits. That makes sense on the regular site where there are a lot of things like JS, CSS, etc. coming down with each pageview. It's a bit surprising on wpzero. You can add the is_pageview filter on webrequest to validate this theory.

The discrepancy between pageview_hourly and WC 3.0 numbers are indeed very weird. That just shouldn't be, or if anything WC 3.0 should be lower. So I'll think that through.

I did have a question, you can switch to this new dataset once we figure out whether or not to remove Wikipedia Zero traffic, correct? I mean, we don't have to make sure we understand all these differences first? Or do you think that's more prudent, so you don't have to re-run computations later unnecessarily..

In T114379#1770131, @Milimetric wrote:

The 10x larger numbers in webrequest vs. pageview_hourly are probably due to is_pageview being false for 90% of the hits. That makes sense on the regular site where there are a lot of things like JS, CSS, etc. coming down with each pageview. It's a bit surprising on wpzero. You can add the is_pageview filter on webrequest to validate this theory.

Dan, you're right, broken down by 'is_pageview', both webrequest and pageview_hourly give same number for "agent_type='user' and is_zero"

I think if the projectviews files need to be regenerated I can quite easily regenerate the yearly tar files and the resulting csv files. I do plan to keep old and new data streams separate and just switch during aggregation from one set of files to another, on some nice first-of-the -month cutover date.

also, you asked earlier about backfilling. Nuria just reported most of huge overcount on smaller projects is due to bogus traffic (strangely at times we have 50+ Special:HideBanners requests for every real page request on smaller projects)
https://phabricator.wikimedia.org/T116609
It would be nice if we could repair these stats for the older history.

broken_monthly_counts_for_smaller_projects.JPG (725×647 px, 106 KB)

Ok. I think backfilling should be just a config change and a bunch of CPU / IO for the cluster. But I'll have to check with @JAllemandou if I'm missing something.

So @kevinator: could you check with the Wikipedia Zero folks on what we should do about zero numbers in this new dataset? Once we know that we can update the code, regenerate, and finish this.

@Milimetric, I spoke to @DFoy and there aren't any issues writing Wikipedia Zero pageviews numbers per project and page as long as there is no data on the country the requests come from.

@ezachte, before we backfill pageviews-* data back to May, I just want to double check, if that's useful to your process. It seems you're using mostly projectviews-*, and that's already backfilled as far back as we have data.

@ezachte: I think I got to the bottom of the 4.8% difference between WC 3.0 and wmf.pageview_hourly.

Basically the WC 3.0 files are named for the "end" of the period [1] for legacy compatibility according to that code comment. So, 2015-10-28 00:00:00 is data from 2015-10-27 23::. If you query that in Hive, and exclude "zero_carrier is null" from the result, and group by access_method and agent_type, it looks close enough to make me stop looking. It's not a perfect match, which is a little annoying, but let me know if you're not ok with this explanation.

[1] https://github.com/wikimedia/analytics-refinery/blob/master/oozie/pageview/hourly/coordinator.xml#L130

@Milimetric, projectviews are indeed all I need for this process
(someday when I upgrade daily&monthly aggregates, backfilling pageviews could be helpful) [1]

Status:

I updated the script to collect all projectviews files into yearly tar (primary reason to add this step is that in earlier years massive repair was needed, so these tars are not always 1:1 from https://dumps.wikimedia.org/other/pagecounts-raw/)
I updated the script to collect counts from these tars This script can switch at an arbitrary date from wc 1.0 tar to new wc 3.0 tar and change which syntax (codes) to accept

To do:

Complete testing the above

Cleaning house: change files names and locations, a.o. so that these csv files are together as one set, not mingled with 1000+ other dump based files in one folder, also add headers, and comments

Adapt several scripts so these can find the csv files at new location and name (and still at old location for A/B tests)

Then rsync daily zip to https://dumps.wikimedia.org/other/pagecounts-ez/wikistats/ for everyone interested

Q: do you happen to have access to dataset1001?
I'm trying to add a folder stat1002:/mnt/data/xmldatadumps/public/other/pagecounts-ez/projectviews like stat1002:/mnt/data/xmldatadumps/public/other/pagecounts-ez/projectcounts, but rsync doesn't let me

[1] https://dumps.wikimedia.org/other/pagecounts-ez/merged/

Milimetric moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.Nov 2 2015, 3:11 PM

@ezachte, to rsync from stat1002 or stat1003 to dataset1001 into your pagecounts-ez directory, you SHOULD be able to do:

rsync -rv /path/to/source/ dataset1001.wikimedia.org::pagecounts-ez/path/to/dest/

However, it looks like the analytics VLAN firewall is blocking. Will open a ticket to fix.

Status: updates have been tested, see stat1002:/a/dammit.lt/projectviews/projectviews_csv.zip

Getting the syntax right for webstatscollector 3.0 was painful. Someday we need to update to wc 3.1 with orthogonal syntax

in general .mw means total for all projects combined (relict of the past)
in wc 1.0 '.m' is only for special projects
wc 2.0 introduces '.m.' which means mobile e.g. 'en.m.q' is 'mobile wikiquote' and 'en.m' (think 'en.m.p' for wikipedia, but '.p' suffix is implicit) means 'mobile wikipedia'
in wc 2.0 there is en.m en.m.b en.m.n en.m.q etc and en.mw, the latter being the overall total of all these
in wc 3.0 .mw is gone, being redundant
Now what does 'species.m.m mean? Hint '.m.' is here for 'special project', not for 'mobile'. Aargh!

To do:

Adapt several scripts so these can find the csv files at new location and name (and still at old location for A/B tests)

Then rsync this zip daily to dataset1001 for anyone interested (once https://phabricator.wikimedia.org/T117428 has been fixed)

:( sorry for the format problems, Erik; I understand you fought through it but you should've just pushed back on us, if we could've made it easier. I copied the code from the wc 2.0 format, so I don't see where the differences come from :( Sorry again.

Hey Dan, no worries. I should have been more clear. This has nothing to do with your upgrade to webstatscollector 3.0. It's a result of a conscious decision by Christian and me to keep webstatscollector 2.0 totally downward compatible. We chose to keep the upgrade to wc 2.0 transparent for users, who could switch to new files but could ignore new codes. This allowed us to do this upgrade fast.

We also thought at that time this would need to be fixed, but only after thorough consultation of wikitech-l, and early announcement. That just never happened yet. I'm reminding us of this issue.

https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites

encore My scripts were processing webstatscollector 1.0 output so far. That's why I encountered it only now.

• ezachte added a comment.Nov 5 2015, 4:18 PM

This comment was removed by • ezachte.

So I checked with hive query on pageviews_hourly

sanity_check_pageviews_hourly_vs_wikistats.xls36 KBDownload

Close but no cigar.

To do:

-Need to check at least wikinews and foundation further.
-Grey earlier months on smaller projects that are corrupted by HideBanners bug
-Disable YoY top row, as long as previous year is with bots and current year isn't
-Update diagram, new file names
-Update comments in reports: bots no longer included
-Mark report clearly as using new page definitions
-Add/upd docs on dataset1001, a.o. in new folder https://dumps.wikimedia.org/other/pagecounts-ez/projectviews/

In T114379#1773661, @Milimetric wrote:

@ezachte: I think I got to the bottom of the 4.8% difference between WC 3.0 and wmf.pageview_hourly.
Basically the WC 3.0 files are named for the "end" of the period [1] for legacy compatibility according to that code comment. So, 2015-10-28 00:00:00 is data from 2015-10-27 23::.

That makes totally sense to me. Thanks, Dan

akosiaris closed subtask T117428: Allow rsync to dataset1001 from Analytics VLAN as Invalid.Nov 6 2015, 10:55 AM

@ezachte, I checked yesterday a little bit, by looking at vital signs data [1] which is what this chart [2] uses. I saw similar numbers but that makes sense with your comparison too. So it's really weird that only wikinews and foundation are different. Do those go through different processing in wikistats?

[1] https://metrics.wmflabs.org/static/public/datafiles/Pageviews/enwiki.csv (also generated from pageview hourly, the Pageviews directory has each project as <<database name>>.csv)
[2] https://vital-signs.wmflabs.org/#projects=enwiki/metrics=Pageviews

Fixed foundation stats, which uses codes a bit differently:
www.f is foundation desktop, m.f is foundation mobile, zero.f is foundation zero.

The other 'anomaly' was a reading error. I copied English counts for a number of reports to sanity_check_pageviews_hourly_vs_wikistats.xls but hadn't noticed that for wikinews German column comes before English.

New to do:

find out why overall totals still use old numbers http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm

Great! https://lists.wikimedia.org/pipermail/analytics/2015-November/004502.html
I'll now update https://meta.wikimedia.org/wiki/Module:Project_portal/views
Will the "Views/hr" column in the index for each project (https://stats.wikimedia.org/wiktionary/EN/ and friends) be converted too?

Done: quite a few ranking changes! https://meta.wikimedia.org/w/index.php?title=Module%3AProject_portal%2Fviews&type=revision&diff=14573840&oldid=14425648

@Nemo_bis

Will the "Views/hr" column in the index for each project (https://stats.wikimedia.org/wiktionary/EN/ and friends) be converted too?

Yes, and the summary charts http://stats.wikimedia.org/wikiversity/EN/ReportCardTopWikis.htm

It's interesting to see that some languages are unaffected by the new calculations, for instance Vietnamese Wikiquote is stable at some 460 k/month. https://stats.wikimedia.org/wikiquote/EN/TablesPageViewsMonthly.htm Other results confirm past suspicions about crawlers, for instance French Wikiquote and Serbian Wikinews.

Other results confirm past suspicions about crawlers, for instance French Wikiquote and Serbian Wikinews.

Also please remember that we are filtering bot traffic using the user agent only, we certainly have more crawlers and automated traffic that is passing this filter, but, again, little by little.

Milimetric moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Nov 12 2015, 5:02 PM

Dan,
I'm still working on loose ends for Monthly Page View Reports.
Also this task also was about Traffic Breakdown Reports, which we just started to work on. Is that another phab task now?

Done since prev status report

Mark report clearly as using new page definitions (green text block)
Disabled YoY top row, until new pagedef is in use for at least a year
Fixed anomalous counts for foundation
Extra validation of input in project[counts|views] files (for special project suffixes .m .m.m and .wm are valid, no suffix is invalid, but did occur)
Added counts for wikidata and legacy special projects (a.o. strategy) [1]
Use new file naming scheme for all files drawn from webstatscollector (yet to document)
Use estimate for meta between Aug 2012 and Apr 2015 when Special:RecordImpression caused 1000fold overcount (thus wrecking totals for all special projects combined) [1] (will up estimate on next run to 7M)
Quite a few minor tweaks in introduction section of Page Views reports to make it more easy to understand (e.g. what links to where)

Doing/ToDo

Trying to assess amount of HideBanners traffic for 2014/15 from sampled squid logs
Update diagram, to show new file names
Add/upd docs on dataset1001, a.o. in new folder https://dumps.wikimedia.org/other/pagecounts-ez/projectviews/
Take a look at what causes traffic for commons to drop steeply in recent months (all using new definition) [1]
Possibly cleanup WikiCountsSummarizeProjectCounts.pl and make it independent from other Wikistats projects and move it to dammit.lt/perl where all pageview and projectview counting occurs (not trivial, but will make maintenance easier)

[1] https://stats.wikimedia.org/wikispecial/EN/TablesPageViewsMonthly.htm

mxn mentioned this in T117515: Populate the gerrit portal repository with code for all the portals (DUE Nov 12).Nov 12 2015, 7:45 PM

@ezachte:
I created https://phabricator.wikimedia.org/T118323 to keep track of the "pageviews per country" report. We can use it or not (whatever is more convenient for you)

Wow, Erik, the commons numbers do indeed drop a lot on September 15th: https://vital-signs.wmflabs.org/#projects=commonswiki/metrics=Pageviews

I wonder what happened...

I'm happy to use whichever task you like, I'll delete the other one if we keep working here.

@ezachte, @Milimetric,
I have the reason for the commons drop: this deploy, including better spider detection using regexp over pageviews.
In hive:

SELECT agent_type, year, month, day, SUM(view_count)
FROM wmf.projectview_hourly
WHERE project = 'commons.wikimedia'
    AND year = 2015 AND month = 9 and day > 10 AND day < 20
GROUP BY agent_type, year, month, day
ORDER BY agent_type, year, month, day
LIMIT 100000;

commons_drop_september.png (340×605 px, 10 KB)

@JAllemandou, Wow, great find! I guess this affects mostly wikis where a large percentage of page views is from wikipedians editing pages. Looking at http://stats.wikimedia.org/wikispecial/EN/TablesPageViewsMonthly.htm a similar effect seems to occur at meta. But somehow not at wikidata.

@JAllemandou - thank you, I've added an annotation to explain: https://vital-signs.wmflabs.org/#projects=commonswiki/metrics=Pageviews

• Nuria closed subtask T115922: Send email out to community notifying of change {lama} [1 pts] as Resolved.Nov 16 2015, 7:42 PM

• kevinator closed this task as Resolved.Nov 19 2015, 5:03 PM

• ezachte reopened this task as Open.Nov 19 2015, 8:40 PM

• ezachte updated the task description. (Show Details)

Done

Updated diagram, a.o. to show new file names + added missing report

Monthly Pageview Reports.png (1×1 px, 452 KB)

Added docs on data1001

Updated summary chart for page views to show old and new page views separately
Mobile didn't change much (no bots ever) so no distinction for old and new counts
Mobile counts will be shown separately only if 10% of total or more (to keep these small charts tidy)

Created smaller version of edit history chart, and added this to summary report

PlotEditsSmallEN.png (240×640 px, 58 KB)

Added links to multi-year edit and view charts + link to summary report to Monthly Pageview Report
http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm

new_links_in_monthly_pageviews_report.PNG (275×532 px, 14 KB)

Ongoing

Upgrading daily and monthly aggregation jobs to generate files in https://dumps.wikimedia.org/other/pagecounts-ez/merged/ from WC 3.0 data stream
(marked as 'to do' in diagram above)

This needs more scrutiny as new daily aggregate is 4 times smaller than old one.
This is unexpected as WC 3.0 contains finer breakdown than WC 1.0 (with extra codes for mobile and zero)

• ezachte updated the task description. (Show Details)Nov 20 2015, 10:27 PM

I migrated daily/monthly aggregates from WC 1 to WC 3. This concludes migration effort for Monthly Page Views stream.

Here is a comparison of per wiki view counts from one day, using old and new definition.
For large wikis ratio new vs old is reasonably close to 1, in percentages 90%-100%. For small wikis the ratio is much lower.

pagecounts-2015-11-18-webstatscollector-wc1-wc3-ratio.xls61 KBDownload

And that made me realize my wrong assumption yesterday that new file should be larger due to finer breakdown into different codes. Unlike yesterday it does make sense to me now that new daily aggregate has fewer lines. In WC 1.0 nearly all article titles were listed, even when in many cases this was only from crawler traffic. In WC 3.0 all bot access is filtered, so many titles are now missing from http://dumps.wikimedia.org/other/pageviews/yyyy/yyyy-mm/pageviews-yymmdd-hhnnss.gz

This long-tail effect works equally stronger on less popular Wikipedia languages and non Wikipedia sister projects. Bots grab all content in a similar frequency, so their share grows when human page views are low.

For large wikis ratio new vs old is reasonably close to 1, in percentages 90%-100%. For small wikis the ratio is much lower.

Very interesting rank. We learn that en.wiki and zh.wiki are overcrawled compared to the typical "top 10" wiki, while some wikis among those with least crawlers traffic are really small, as if crawlers didn't care about them (ig, tl, arz, ay, min, so, wuu, gu, azb, dsb, kn, mn, ky, bo, nv, ksh, om, st, ...).

Three new charts for per project totals, to do: 'Total articles'

Preview: http://stats.wikimedia.org/EN/draft/SummaryZZ.htm

Added 3 more charts for per project totals, e.g. http://stats.wikimedia.org/EN/draft/SummaryZZ.htm (preview location)

'Total articles',
'New articles'
'Active Wikis (3+ active editors)' (3+ is arbitrary, open for discussion)

Last upgrade I can think of for monthly pageview reports: add page view stats for Wikivoyage

• Nuria closed this task as Resolved.Dec 14 2015, 4:53 PM

• Nuria reopened this task as Open.

• Tbayer mentioned this in T126579: Total page view numbers on Wikistats do not match new page view definition.Feb 11 2016, 4:07 AM

• Nuria closed this task as Resolved.Apr 1 2016, 8:40 PM

• Nuria closed subtask T115344: Publish new pageview dataset with clear documentation {lama} as Resolved.

	F2992951: pagecounts-2015-11-18-webstatscollector-wc1-wc3-ratio.xls
	Nov 20 2015, 10:47 PM

	F2976815: new_links_in_monthly_pageviews_report.PNG
	Nov 19 2015, 9:03 PM

Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts]Closed, ResolvedPublicActions