Page MenuHomePhabricator

Update board metrics deck with new content quantity and edit volume metrics
Closed, ResolvedPublic


See this deck [Wikimedia Foundation internal].

Event Timeline

nshahquinn-wmf created this task.
nshahquinn-wmf moved this task from Triage to Next Up on the Product-Analytics board.

This looks like it will be harder than I thought. We want to start tracking some content metrics (namely the number of articles, number of media files, number of Wikidata entities, and number of Wikidata claims).

This is pretty easy, but we also want to get past versions of these numbers so we can look at trends. That is harder, because the MediaWiki database don't store past page state in an easily accessible way. This should be mostly possible using the mediawiki_history tables, but it'll be a bit harder.

@ezachte, @Erik_Zachte (I don't know which is the real one!): I'm working on producing these new metrics for our Board reporting, and I'm trying to get historical monthly data on the article counts for all our projects. This seems really tricky to calculate myself, but I think I can get it from Wikistats instead :)

Where can I find that data in machine-readable format? So far I've found two dumps folders that seem like they might be right: other/pagecounts-ez/wikistats and other/wikistats_1, but I'm not sure which files have what I need.

@ezachte, I've found the location of the data I want within the Wikistats dumpfiles, but I'm still not sure which folder is the canonical one.

other/wikistats_1/ has data through March 2018, but the data on Wikidata articles is broken (it shows the number exactly flat over the past year). other/pagecounts-ez/wikistats has more plausible data on Wikidata articles, but it only has data through February 2018.

Which is the right one, and is there a place I can find the April data, which is already available on the website?

This is mostly done! I've heavily overhauled the calculation pipeline and created a notebook for calculating the monthly metrics table.

Remaining tasks:

  • Figure out where and when we can get updated data from Wikistats 1 (@Milimetric, do you have any advice here?)
  • Create a notebook to produce metric graphs (eliminating the need for sparklines)
  • Update on-wiki documentation to point at my Git repo rather than the spreadsheet of doom.


Aggregating data over all wikis and all projects has been done for the (now defunct) Report Card.
I just adapted that script to the new server environment (on stat1005) and ran it for the first time in a year.
Please see attached csv file.

It only show largest wikis for any metric (plus overall totals, which you asked for), and only last 24 months.
If you need more wikis or months I can tweak the script, but it’s not entirely trivial to do so.
Is this for a one-time verification check?

I missed your earlier comment on other/wikistats_1/ will look into that right now.


The zip&publish step had not run yet. New zips are now online.

StatisticsMonthly.csv in has data for wikidata up till April 2018, as expected.
If you find an anomaly elsewhere please let me know.

Oh BTW path other/pagecounts-ez/wikistats is obsolete, should be replaced by redirect.
On recent server migration we fixed the path, as it was counterintuitive (these csv files have nothing to do with pagecounts/pageviews)

Thanks @ezachte! That helps a lot. I'm using the article count data as part of our monthly metrics, so I want them on an ongoing basis. Happily, I've figured out to use the zipped CSVs so I don't need anything new 😁

The only question I have is what the normal time to publication of the zip files is. It seems like it's about a day after the underlying dumps complete (which I know I can track on your personal site): for example, it seems like the final April dump to finish was Wikidata, on 28 May, and the zip files showed up 29 May (today).

@Neil_P._Quinn_WMF for Wikistats a special cycle to generate stub dumps is started at beginning of new month. This takes a few days, with wp:en, commons and wikidata being the slowest.
Wikistats does indeed process a dump soon after completion, but each dump only once per month.

In the report you mention, section 'Dump jobs per start date' does report on latest dump available for every wiki, even if that wiki was already processed this month (relevant for reruns after very rare bug fix in dumps scripts).

I check mostly this variation of the report, which now runs from wikimedia server: (but doesn't report on dumps age).
Subscripted number is 'days ago', color is 'last month processed'. See section 'Legend'

Vvjjkkii renamed this task from Update board metrics deck with new content quantity and edit volume metrics to t2caaaaaaa.Jul 1 2018, 1:10 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed nshahquinn-wmf as the assignee of this task.
Vvjjkkii removed Due Date.
Vvjjkkii updated the task description. (Show Details)
Tbayer renamed this task from t2caaaaaaa to Update board metrics deck with new content quantity and edit volume metrics.Jul 2 2018, 10:23 PM
Tbayer removed a subscriber: CommunityTechBot.