Regional codes and number of speakers per language.
Apr 24 2016
As for counting bytes vs chars vs words, here are some considerations.
Apr 21 2016
The daily/monthly aggregates already use the newest data feed (aka Dan's webstatscollector 3.0, hadoop based)
We should abandon the hourly file feeds, based on webstatscollector 1.0/2.0. Not the aggregates.
Apr 19 2016
I looked at the backups at stat1001. I need to tidy things up. Some backups occur too often, and have a lot of garbage in it. Apologies for the overhead this incurred.
Apr 13 2016
I can't login right now to check.
The vast majority of that 2TB will be backups, which I thin out every half year or so.
All html files in htdocs should be copies from generated files on stat1002.
Mar 26 2016
Changing default to percentages will make WoW changes really small (except when a new browser version is released and people do mass update).
Mar 25 2016
Denied. Expect an announcement on Wikistats in coming weeks, soon as migrated traffic reports (breakdown of browser and OS traffic data) are published.
Mar 7 2016
@Not much. I did some consistency checks, but nothing conclusive yet. My approach is to compare Wikistats counts with ad hoc aggegrated webstatscollector 3.0 counts. If those match it's out of my hands, and the mismatch should be found in hive scripts. If those don't match hopefully it will become apparent what constitutes the difference. BTW I may be mostly offline for the rest of the week (moving).
Mar 5 2016
The two metrics are incompatible from 2007 onwards
Mar 2 2016
I was ging to update those docs, then I forgot. My bad.
@Nuria depends on what @Biangjang meant: I thought separate counts for each article. That might work in theory, not in practice, for largest wikis, no? If combined total for all articles, then yes of course.
Feb 29 2016
Bianjiang, probably not. Dumps 2.0 is about database dumps, not traffic log dumps.
Daily and monthly aggregates are at https://dumps.wikimedia.org/other/pagecounts-ez/merged/
Feb 25 2016
Yes, that's why I asked:
Feb 24 2016
2 I updated links at both places, thanks for noticing.
@Zdzislaw. Hardcoded extra namespaces for some wikisource projects is older code. The newer way is to follow the API which lists all content namespaces per wiki. Every day I harvest these settings for all wikis.
Feb 18 2016
1 The code is at https://github.com/wikimedia/analytics-wikistats/blob/master/dumps/perl/WikiCountsInput.pm, line 1810 etc, sub CollectArticleCounts
WikiStats is poorly documented (but that fact by itself is pretty well documented as I keep telling this every year or so)
Jan 28 2016
Path names for this step were wrong after major update for https://phabricator.wikimedia.org/T114379 . Fixed
Oops. This I fixed some two weeks ago, but I hadn't marked it as resolved yet. Doing that now.
Jan 22 2016
If we have more human resources than functional requirements, I'd like to propose this idea: what about making this new UI language independent, internationalization done via Translatewiki?
Jan 21 2016
Jan 18 2016
Another tool that was hugely popular among press people, many years ago was Wikistics, built on top of stats.grok.se
It focuses totally on most accessed pages, in a very simple format.
Here is a screen copy of WikiViewStats.
closed https://gerrit.wikimedia.org/r/#/c/92056/ -> 4/5 open
@Nemo_bis says this may have to wait till Feb
in the meantime I can look further into '[Full dump analysis] Reduce edits_only and reverts_only intricacy'
Jan 17 2016
There was lingering test code. Files are up to date now.
Jan 15 2016
Since early Dec 2015 there is a new chart on 'Active wikis' which will help us to assess a good cut-off point.
I reached out by mail to @Nemo_bis with comments on each open patch.
Jan 13 2016
Well everything did get synced in the end. My assumption on required folder rights was wrong. Still not sure why
hdfs dfs -put -f /a/wikistats_git/mediacounts/daily/2016/mediacounts.top1000.2016-01-02.v00.csv.zip hdfs:///wmf/data/archive/mediacounts/daily/2016
didn't produce an immediate update.
Jan 12 2016
@Ottomata now both 2015 and 2016 have drwxr-xr-x instead of drwxrwxr-x So I can't update 2016
Jan 11 2016
Jan 8 2016
@Hydrix, sorry the fix was incomplete, in that the rsync still fails over folder access rights. I can't fix that myself (and today is all-staff) but should be done no later than Monday I assume.
Jan 6 2016
Connection of stat1002 with /mnt/hdfs/wmf/data/archive/projectview/geo/hourly/ was lost
@Ottomata fixed this: "Hadoop namenode was inactive"
Dec 17 2015
Revisiting that page I think my comment about not being terribly essential was mostly for the 2nd, 3rd and 4th tables on that page which focus on most requested non existing page and files.
I don't think the new page view api can zoom in on those missing pages/files in particular. But again, there seems to be little demand for it.
Dec 11 2015
The empty row is YoY which will return when new pageview def is there for 13 months. (removing it entirely would be better, but is really a small matter)
Dec 10 2015
New location is http://stats.wikimedia.org/mail-lists/index.html
Dec 9 2015
@Milimetrics FYI the dumps use http://dumps.wikimedia.org/other/pageviews/
What makes them still usefull is that they contain page views for all articles (with 5 or more views per month).
Monthly totals, while retaining hourly precision.
Actually this was done already some two weeks ago as a subtask of https://phabricator.wikimedia.org/T114379
I just learned the scripts had stalled for half a year from this thread.
Dec 1 2015
The new data in http://dumps.wikimedia.org/other/pageviews/ already exclude spider requests, so contain user data only.
The way I'm reading @Milimetric's comment is amount of spiders requests could be added as a separate metric if called for.
I would be (somewhat) interested to see the overall share of spider traffic per project (not per wiki), but no big deal at all.
We could do that with a internal hive job using sampled data.
Nov 27 2015
Added 3 more charts for per project totals, e.g. http://stats.wikimedia.org/EN/draft/SummaryZZ.htm (preview location)
Nov 24 2015
Three new charts for per project totals, to do: 'Total articles'
Nov 20 2015
I migrated daily/monthly aggregates from WC 1 to WC 3. This concludes migration effort for Monthly Page Views stream.
Nov 19 2015
Nov 17 2015
Also Magnus and I both pleaded for monthly stats earlier, each for different use cases
Nov 13 2015
@JAllemandou, Wow, great find! I guess this affects mostly wikis where a large percentage of page views is from wikipedians editing pages. Looking at http://stats.wikimedia.org/wikispecial/EN/TablesPageViewsMonthly.htm a similar effect seems to occur at meta. But somehow not at wikidata.
Nov 12 2015
I'm still working on https://phabricator.wikimedia.org/T114379 (see status report there)
Can we postpone this till everything is in place?
I'm still working on loose ends for Monthly Page View Reports.
Also this task also was about Traffic Breakdown Reports, which we just started to work on. Is that another phab task now?
Nov 11 2015
Will the "Views/hr" column in the index for each project (https://stats.wikimedia.org/wiktionary/EN/ and friends) be converted too?
Nov 10 2015
Fixed foundation stats, which uses codes a bit differently:
www.f is foundation desktop, m.f is foundation mobile, zero.f is foundation zero.
BTW I propose we don't update comScore trends (and tell so on the RC). Their sudden drop seems inconsistent with our internal numbers. comScore was asked to comment and they agreed to investigate but that didn't bring us any further. We won't receive any updates from them anyway.
Well I am mostly responsibly for this. When the comScore unique visitors and page views counts dropped so suddenly that it raised serious doubt over the numbers, and our internal page views revealed massive corruption of our own data  I stalled updates (informing @Tbayer). Our internal page view numbers are mostly fixed now , and better than ever (no more bot traffic). I hope to finalize this cut-over in coming days. We will then present updates to report card, with better numbers since May 2015, and some older totally insane and hard to fix numbers blanked out (PV for smaller projects) .
Nov 6 2015
So I checked with hive query on pageviews_hourly
Nov 5 2015
Nov 4 2015
encore My scripts were processing webstatscollector 1.0 output so far. That's why I encountered it only now.
Hey Dan, no worries. I should have been more clear. This has nothing to do with your upgrade to webstatscollector 3.0. It's a result of a conscious decision by Christian and me to keep webstatscollector 2.0 totally downward compatible. We chose to keep the upgrade to wc 2.0 transparent for users, who could switch to new files but could ignore new codes. This allowed us to do this upgrade fast.
By all means let's talk. I moved the meeting to Monday (I'm away Fri-Sun).
Status: updates have been tested, see stat1002:/a/dammit.lt/projectviews/projectviews_csv.zip
Once I got https://phabricator.wikimedia.org/T114379 done (hopefully tomorrow) I hope to get geo reports  back online using new hive feed
Nov 3 2015
As suggested in June, I think that lists of users who were counted in one dump but not another might be useful for debugging.
Nov 2 2015
@AndyRussG thanks for chiming in. Now I understand what this is about.
@Milimetric, projectviews are indeed all I need for this process
(someday when I upgrade daily&monthly aggregates, backfilling pageviews could be helpful) 
What's the best way to detect which language codes have new stats? (other than screen scraping https://meta.wikimedia.org/wiki/Research:VisualEditor)
Oct 30 2015
The 10x larger numbers in webrequest vs. pageview_hourly are probably due to is_pageview being false for 90% of the hits. That makes sense on the regular site where there are a lot of things like JS, CSS, etc. coming down with each pageview. It's a bit surprising on wpzero. You can add the is_pageview filter on webrequest to validate this theory.
Thanks Nuria, so we're zooming in on what happened. I'm still wondering though, how can we have 56 Special:HideBanners requests for every real page request? Doesn't that seem odd? Would we have to ask ops to explain?
Is there any sensitivity we need to be aware of when publishing reports for small countries from the unsampled logs? For projects with little to no activity a set of localized pageviews can disclose the location of an editor.
The pageviews aren't localized as part of this dataset, this is just Page Title, View Count. Do you mean the localization that wikistats does used in combination with this? I'm not seeing the connection there either.
I also vote for doing away with .mw, it's redundant, and confusing indeed.
+1 hm on redacted numbers.
Oct 29 2015
Dan, here is a comparison of data for one hour in webstatscollector 1/2/3
Most counts are similar, or understandably different. A few differences I'm not sure what to make of it. Any idea?
Oct 28 2015
Thanks Dan! I'll do some sanity checks, and report back.
Oct 27 2015
Oct 26 2015
This text in description is misplaced "From squid logs I get a totally different number yet again (see below)" as both chart above and below this text refer to same data feed. (read below as 'upcoming comments')
@Nuria Right, I know actually, so yes 128 (x 1000) for sampled logs (255 - 127 CentralAuthoLogin) comes somewhat close to hive number from pageviews_hourly for July 10: 48k spider + 35k user = 82k. The 128k from squid logs is the upper limit as that factors in mime type only.
@Ottomata squid logs is 1:1000 sampled at stat1002:/a/squid/archive/sampled>
filtering 1:1000 sampled squid logs for wikinews html requests
webstatscollector 2.0 output:
Oct 25 2015
Yes, I mean July. Aug 2011 was a botnet, with 5% of overall page views from less than 100 ip addresses, requesting Random Page (very nifty way to keep us busy).
Oct 19 2015
Some notes on what seems an unanswerable question.
Oct 15 2015
While waiting for new input for Monthly Pageview Reports (which is coming along, thanks @Milimetric !), I looked into Traffic Breakdown Reports, subset Geo Reports.
Oct 9 2015
@Tbayer not sure why you mention Wikistats in this context. Or am I getting you wrong?
Oct 8 2015
Dan, using sequence numbers to detect anomalies makes total sense to me. In fact I used that also to repair multi-months 20%-30% UDP message loss. By measuring per server per hour how much the average gap between sequence numbers went above the expected average gap (which of course is 1000 for the 1:1000 sampled log). That will work for capture-errors. It's not a cure-all, it won't help for the case I mentioned where massive amounts of bogus 'page views' came our way for two weeks. Neither is my half-automated blacklisting of bad hours a cure-all.
So how to proceed?
@Tbayer absolutely, being consistent is important.
Oct 7 2015
The quick survey shows most support for continuation of the geographic reports (report 21-24) , more than other breakdowns https://www.mediawiki.org/wiki/Analytics/Wikistats/TrafficReports/Future_per_report_B2
I figured we can produce all breakdowns by geography (middle column of TBD diagram) with two datasets, one for views, one for edits. 8 fields only in each: