Page MenuHomePhabricator

Restore WikiStats features disabled for mere performance reasons
Open, NormalPublic


In the example URL, "No detailed statistics for anonymous users are available for this wiki (performance reasons)". This and several other reports are perfectly ok and useful, but disabled only because the machine they've been run on for several years wasn't powerful enough.

I'm sure the WMF can now easily provide Erik with a spare server no longer used elsewhere with a faster CPU, or perform whatever micro-improvements are needed to give us back these crucial stats, even before Kraken and all the other analytics beasts are freed.

Version: unspecified
Severity: major
See Also:



Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 12:49 AM
bzimport added a project: Analytics-Wikistats.
bzimport set Reference to bz42318.
bzimport added a subscriber: Unknown Object (MLST).
Nemo_bis created this task.Nov 21 2012, 9:24 AM

erikzachte wrote:

The main bottleneck was memory, as the list of anon ip's was huge, and perl hashes are pretty memory intensive. Even when stat1 has much more memory than bayes the list of ip's of course has grown over time as well.

So what I need to do is collect anon edits in a flat file and sort/aggregate after dump parsing is complete. Then this could stay integral part of current job.

I added a task in Asana:, but low priority

[mass-moving wikistats reports from Wikimedia→Statistics to Analytics→Wikistats to have stats issues under one Bugzilla product (see bug 42088) - sorry for the bugspam!]

Sj added a comment.Dec 14 2012, 10:52 PM

This would be awesome. I was looking for one of the anon stats last month.

Today's post by Erik made me miss them a lot. :(
This bug is seriously impairing our ability to understand what's happening on our projects and what new pieces of research mean.

erikzachte wrote:

Revert stats can be deduced from stub dumps as these now contains checksums. It just hasn't happened yet.

In a wider perspective:

I'm hoping the stub dumps can be extended with the few meta data missing that would fill in the blanks. Not exactly trivial, but we could forget about full dumps for wikistats.

Things we miss:

Does the article contain an internal link? (disregarding links in templates for pragmatic reasons) Now we have different article counts in wikistats when processing stub or full archive dump.

Word count (wikistats first strips headers, html and the like, and tries to be (too) smart about (some) non western languages (using a conversion factor to deduce word counts from glyph count).

External links, image counts (I guess we could skip both, less requested than metrics above)

Thanks for the comment, Erik. I understand that this is hard, but stat1 has something like 8 times the CPUs and 10 times the RAM bayes had, and mostly idle. While we wait for the a permanent solution, having the stats updated every 2 or 3 years would still be very nice and a great improvement.

erikzachte wrote:

Nemo, again it comes down to backlog in coding. I can't run the full dump and partial dump concurrently. They will overwrite each others' files. For largest dumps one month is not even enough to run full dump. I'll make a list of open items for dump scripts soon, so we can prioritize.

I see. Prioritizing is good: I'm trying to suggest things that don't add to the backlog; I know that just asking MOAR is stupid.
If making them progress at the same time is not possible with more coding, and another server is not available, then I say that delaying the normal updates for a month or two is an acceptable cost to pay in order to fill the last 2/3 years of blanks for the full stats.

I'm getting sick of this bug... Erik, if I run full counts on full dumps on my own, would I then have CSVs that you can use to fill the blanks on the main wikistats, at least for things like character count etc. (the ten empty columns in the main "Monthly counts & Quarterly rankings" tables)?
I don't think I'll be able to do it soon, but if there is a prospective concrete usage I may.

Trying to understand how this is set in the code (as part of bug 62566),,,
It's currently a bit confusing:
If I understand correctly, almost all of it is controlled by the -e / edits_only flag, with some interaction with -u / reverts_only and some bits which have no configuration flag (yet) but are simply commented in the code.

This bug is also addressed in bug 60826

More comments in bug 62566

Sj added a comment.Mar 24 2016, 11:25 PM

Current thoughts on this? I still would like to refer to these stats a few times a year. Nemo, did you figure out if this is something that could be run effectively on a mirror? I could get my local library to do this.

Also, I don't think @ezachte you meant 62566, that doesn't seem related.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 24 2016, 11:25 PM

Denied. Expect an announcement on Wikistats in coming weeks, soon as migrated traffic reports (breakdown of browser and OS traffic data) are published.

Sj added a comment.Mar 25 2016, 11:23 AM

Thanks Erik.

Nemo, did you figure out if this is something that could be run effectively on a mirror?

Yes, it's not rocket science although a bit tedious. I have a functioning checkout of wikistats in and I documented the steps at T64566. I've only produced statistics for (and not in the last couple of years for lack of dumps).

I think running such things in a USA server, or in Labs, is better due to the extreme slowness of download from (T45647).

Restricted Application added a project: Analytics. · View Herald TranscriptOct 25 2018, 7:09 PM
fdans added a subscriber: fdans.Oct 29 2018, 4:00 PM

@Nemo_bis hey, is there a list of metrics you have that we could maybe develop with the current infrastructure?

fdans moved this task from Incoming to Radar on the Analytics board.Oct 29 2018, 4:02 PM

@Nemo_bis hey, is there a list of metrics you have that we could maybe develop with the current infrastructure?

Some of them should be easy enough, like:

  • all-time most active unregistered editors,
  • all-time most edited articles.

I don't know about the monthly stats, such as:

  • number of articles above 200, 512 or 2048 bytes;
  • total number of bytes and words in the content namespaces;
  • total number of internal links, external links, image links.

The plots like (monthly contributors, active editors, very active editors, articles etc.) I suppose are already "for free", but I've never tried getting the charts for 10 years or more.