Page MenuHomePhabricator

Provide total active editors for December 2014
Closed, ResolvedPublic

Description

For the first edition of the Foundation's new quarterly report (2014/15 Q2, due to be published on February 15), we are still going to use the old active editor definitioin (per discussion with Toby)- i.e. we need the total active editor numbers for October, November and December 2014.

Currently December is still missing at [1] and [2] and I don't recall what the monthly update schedule is. Would it be be possible to provide this until next week, say February 12? It would be good to leave some extra room in case T87738 unearths any existing data quality issues.

[1] http://reportcard.wmflabs.org/graphs/active_editors
[2] https://stats.wikimedia.org/EN/TablesWikimediaAllProjects.htm

Event Timeline

Tbayer raised the priority of this task from to Needs Triage.
Tbayer updated the task description. (Show Details)

Hi Tilman --

Unfortunately we are having trouble with the dumps; they are running 6
weeks behind per Erik Z. We're working with ops to get an ETA.

-Toby

Thanks, Toby!

We also need the December number for these two:
http://reportcard.wmflabs.org/graphs/edits
http://reportcard.wmflabs.org/graphs/new_editors

I'll assume for now that they depend on the same dump and therefore will be provided in the same update once the dump is available, so I'm not filing a separate request.

Big thanks to @Joe for getting the dumps back on track. Per @ezachte, this unfortunately won't be enough to hit our February 15 publication deadline for the quarterly report, since all dumps need to be fully processed til even the necessary revision metadata is available (a separate longstanding issue).

@Tnegrin, @ezachte, @DarTar, @kevinator - what are our options here? We have a working AE definition in WikiMetrics; can we get October-December numbers through SQL for the first report? As with anything, it's a cost/benefit question -- getting the December numbers quickly is not worth the whole analytics team's time, but if we can do it with a couple of days of effort by the Feb 15 deadline, it would make our first quarterly report significantly more useful.

I can't comment on SQL approach.

As for dump based data: I'm not totally pessimistic about how long it will take to generate the missing dumps. Most dumps for December are still missing, but most *large* dumps exist.The largest one still missing ranks #15 in size: Vietnamese.

At http://stats.wikimedia.org/WikiCountsJobProgressCurrent.html if you scroll down to Job Statistics only the ones in orange still need the December update. There are a lot of those but manyare really small. To speed up that dump queue maybe an extra server could be plugged in, Ariel sometimes added servers when there was a bottleneck.

Caveat on previous comment. This is assuming the new cycle picks up oldest dumps in a particular queue first, which is how it is supposed to work. I'm not sure why nlwiki is currently processing then. http://dumps.wikimedia.org/backup-index.html

Reproducing the wikistats definition in SQL is challenging and we know that AE data generated from the databases will produce discrepancies with our official numbers. We also never implemented an equivalent of TAE in Vital Signs, only per-project rolling active. My main concern is that we won't have a publicly citable source we can point people to if we generate ad hoc data and cite it in the report.

I'd like to understand if @ezachte's proposal of fast-tracking dump generation for specific, smaller projects is viable. @ArielGlenn ?

@ezachte @DarTar @ArielGlenn any update on whether we can realistically get this done by end of this week, one way or another?

@Eloquence. as per https://phabricator.wikimedia.org/T85970 we were wildly optimistic in dump processes catching up in limited time after restart. (and no extra servers were allocated).

I did parse all dump status reports to find out trends and (hopefully soon) will be able to detect anomalies early on. Turns out wikidata is taking 15-20 days now. It may still be in the same queue as small wikis, thus holding up updates for majority of (small) wikis.

Switching to plan B:
I will run a partial TAE report for those wikis which have December stats available, which includes 15 largest wikis.
Then from that derive a MoM and apply that to full TAE for November.
I need to make make some script patches so as not to prevent running TAE processing for recent months if data for large wikis are missing.

So here is approximation based on available data for December (including top 15 Wikipedias)

Thanks again, ErikZ! I used this (marked as "est.") in the published scorecard.