Page MenuHomePhabricator

Track number of Wikidata edits by namespace
Closed, ResolvedPublic

Description

This will be useful not only for Schemas, but also for other namespaces. Can probably be done as a daily cron job: count the number of revisions in the namespaces 0 (item), 120 (property), 146 (lexeme), tbd (schema), and the corresponding talk namespaces over the previous 24 hours and write that to Graphite.

It should be added as a new panel to the Wikidata Edits dashboard (which currently distinguishes between user/bot/anonymous/etc., but not between item/property/etc.).

Event Timeline

That script uses the API with action=query&list=recentchanges, which doesn’t directly give us the namespace. We could try to parse it from the title; get the page IDs and ask for their namespaces in a separate query; or perhaps switch to SQL.

Is there a reason why that script doesn’t use SQL in the first place, by the way?

Change 500752 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[analytics/wmde/scripts@master] WIP: count number of Wikidata edits by namespace

https://gerrit.wikimedia.org/r/500752

Reading about this - Would delayed data be interesting? This information is accessible in hadoop :)

Change 500752 merged by jenkins-bot:
[analytics/wmde/scripts@master] Count number of Wikidata edits by namespace

https://gerrit.wikimedia.org/r/500752

Change 502169 had a related patch set uploaded (by Hoo man; owner: Lucas Werkmeister (WMDE)):
[analytics/wmde/scripts@production] Count number of Wikidata edits by namespace

https://gerrit.wikimedia.org/r/502169

Well, we currently run the cron job at 3AM each day and have it check the recent changes from midnight yesterday to midnight today (all UTC), so it’s already delayed by three hours. Would using Hadoop be advantageous to us? I’m not sure if any of the existing scripts use it.

Some queries are computed using hadoop for wikidata (see https://github.com/wikimedia/analytics-refinery/tree/master/oozie/wikidata). If SQL over recent-changes works for, that's great :)

I don’t think it’s deployed yet (see https://gerrit.wikimedia.org/r/502169 above). And then we’ll need to add the new metrics to some Grafana board.

Should this go back to some “Doing” column until the deployment is done? Though I guess Shape Expressions Sprint 5 would no longer be the appropriate project.

Change 502169 merged by jenkins-bot:
[analytics/wmde/scripts@production] Count number of Wikidata edits by namespace

https://gerrit.wikimedia.org/r/502169

The Grafana board exists now, but there are still things to be done:

  • How do I assign better labels to the serieses?
  • Is a log-10 Y-Axis useful here
  • Is 12 weeks a useful time range?
  • How do I assign better labels to the serieses?

Take a look at the alias functions.
The one you probably want is alias sub, which is basically find and replace.
If you still have the full metric name you also want to only look at the last Jose I imagine.
On the Wikidata edits dashboard the oauth panel has an example of these things.

This dashboard panel should probably use the time range selected for the dashboard as a whole.
This will allow users of the dashboard to easily look at different ranges of data for the panel without needing someone to actually edit the dashboard.

I'm aware this is currently only daily data, but it probably still make sense to have it all line up.
Thoughts @Lydia_Pintscher ?

Done, and I also switched the default time range of the dashboard to 7 days.