Page MenuHomePhabricator

track number of editors from other Wikimedia projects who also edit on Wikidata over time
Closed, ResolvedPublic

Description

We'd like to know how many editors who edit on Wikidata also edit on one of the other Wikimedia projects and which ones those are. This will help us get a better understanding where the editors are coming from inside Wikimedia and where we still have a lot of work to do for better integration. Someone counts as editing on a project if they've made at least one edit in the given month on Wikidata and that project. The numbers should be broken down by language (for Wikipedia) and project.

Preliminary work for Wikipedia: https://docs.google.com/spreadsheets/d/1RpTgMpxyFzl9dFyCM7lQDFzy-B8eiA_pKOjQXW1_7VQ/edit?ts=5acc8a4f#gid=1729864649

Review:

Dashboard:
https://grafana.wikimedia.org/dashboard/db/wikidata-co-editors

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Also, the event_timestamp doesn't fit, right?

We want to know these 2 things:

  1. How many co editors we had in Q2 2017 (broken by wikipedia)

P7278

  1. How many co editors we had in Q2 2018 (broken by wikipedia)

P7282

  1. How many co editors we had since 01.01.2017 until Q2 including

P7283

Jonas moved this task from Peer Review to Ready to estimate on the Wikidata-Campsite board.
Jonas removed a project: User-Ladsgroup.

@Ladsgroup what were the queries you used to generate the data in the above comment?

T193641#4304748 has a couple of issues:

  • You looking at data from 2017 (not sure if that was intended or not), the example query does indeed have 2017 in it.
    • I now realize that this is probably fine in the queries you ran in the above comment
  • The snapshots don't match, one is using 2018-05 and the other 2018-2, these should probably match, Analytics and @JAllemandou would be able to clear up this. I'm also not sure where we can get the ID of the latest snapshot either.

11:54 AM <joal> Joseph Allemandou addshore: By convention, it's YYYY-MM, but there is no automated place to get it except asking hive for partitions (or look in folder availability)
11:54 AM <addshore> ack
11:54 AM so generally can one assume that each month there will be a new one?
11:54 AM <joal> Joseph Allemandou addshore: That's what we do yes
11:54 AM <addshore> thanks!
11:55 AM <joal> Joseph Allemandou But it arrives in thew middle of the month addshore, about the 10th

So 2018-02 is created in march and will contain data for all of feb and possibly some of march

@Ladsgroup what were the queries you used to generate the data in the above comment?

To get answer for the first question:

WITH
wikipedias AS (
  SELECT
    DISTINCT dbname
  FROM wmf_raw.mediawiki_project_namespace_map
  WHERE snapshot = '2018-05'
    AND hostname LIKE '%wikipedia.org'
),

wikidata_editors AS (
  SELECT
    DISTINCT event_user_text
  FROM wmf.mediawiki_history
  WHERE snapshot = '2018-05'
    AND event_entity = 'revision'
    AND event_type = 'create'
    AND wiki_db = 'wikidatawiki'
    AND NOT event_user_is_anonymous
    AND NOT ARRAY_CONTAINS(event_user_groups, 'bot')
    AND event_timestamp RLIKE '^2017-0[45].*'
)

SELECT
  mwh.wiki_db,
  COUNT(DISTINCT mwh.event_user_text) as wikidata_coeditors
FROM wmf.mediawiki_history mwh
  JOIN wikipedias w ON (mwh.wiki_db = w.dbname)
  JOIN wikidata_editors wde ON (mwh.event_user_text = wde.event_user_text)
WHERE snapshot = '2018-05'
  AND event_entity = 'revision'
  AND event_type = 'create'
  AND NOT mwh.event_user_is_anonymous
  AND NOT ARRAY_CONTAINS(mwh.event_user_groups, 'bot')
  AND mwh.event_timestamp RLIKE '^2017-0[45].*'
GROUP BY
  mwh.wiki_db
ORDER BY wikidata_coeditors DESC
LIMIT 1000;

To get answer to the second question

WITH
wikipedias AS (
  SELECT
    DISTINCT dbname
  FROM wmf_raw.mediawiki_project_namespace_map
  WHERE snapshot = '2018-05'
    AND hostname LIKE '%wikipedia.org'
),

wikidata_editors AS (
  SELECT
    DISTINCT event_user_text
  FROM wmf.mediawiki_history
  WHERE snapshot = '2018-05'
    AND event_entity = 'revision'
    AND event_type = 'create'
    AND wiki_db = 'wikidatawiki'
    AND NOT event_user_is_anonymous
    AND NOT ARRAY_CONTAINS(event_user_groups, 'bot')
    AND event_timestamp RLIKE '^2018-0[45].*'
)

SELECT
  mwh.wiki_db,
  COUNT(DISTINCT mwh.event_user_text) as wikidata_coeditors
FROM wmf.mediawiki_history mwh
  JOIN wikipedias w ON (mwh.wiki_db = w.dbname)
  JOIN wikidata_editors wde ON (mwh.event_user_text = wde.event_user_text)
WHERE snapshot = '2018-05'
  AND event_entity = 'revision'
  AND event_type = 'create'
  AND NOT mwh.event_user_is_anonymous
  AND NOT ARRAY_CONTAINS(mwh.event_user_groups, 'bot')
  AND mwh.event_timestamp RLIKE '^2018-0[45].*'
GROUP BY
  mwh.wiki_db
ORDER BY wikidata_coeditors DESC
LIMIT 1000;

And to answer the third question:

WITH
wikipedias AS (
  SELECT
    DISTINCT dbname
  FROM wmf_raw.mediawiki_project_namespace_map
  WHERE snapshot = '2018-05'
    AND hostname LIKE '%wikipedia.org'
),

wikidata_editors AS (
  SELECT
    DISTINCT event_user_text
  FROM wmf.mediawiki_history
  WHERE snapshot = '2018-05'
    AND event_entity = 'revision'
    AND event_type = 'create'
    AND wiki_db = 'wikidatawiki'
    AND NOT event_user_is_anonymous
    AND NOT ARRAY_CONTAINS(event_user_groups, 'bot')
    AND event_timestamp RLIKE '^201[78].*'
)

SELECT
  mwh.wiki_db,
  COUNT(DISTINCT mwh.event_user_text) as wikidata_coeditors
FROM wmf.mediawiki_history mwh
  JOIN wikipedias w ON (mwh.wiki_db = w.dbname)
  JOIN wikidata_editors wde ON (mwh.event_user_text = wde.event_user_text)
WHERE snapshot = '2018-05'
  AND event_entity = 'revision'
  AND event_type = 'create'
  AND NOT mwh.event_user_is_anonymous
  AND NOT ARRAY_CONTAINS(mwh.event_user_groups, 'bot')
  AND mwh.event_timestamp RLIKE '^201[78].*'
GROUP BY
  mwh.wiki_db
ORDER BY wikidata_coeditors DESC
LIMIT 1000;

T193641#4304748 has a couple of issues:

  • You looking at data from 2017 (not sure if that was intended or not), the example query does indeed have 2017 in it.

That was for comparison.

    • I now realize that this is probably fine in the queries you ran in the above comment
  • The snapshots don't match, one is using 2018-05 and the other 2018-2, these should probably match, Analytics and @JAllemandou would be able to clear up this. I'm also not sure where we can get the ID of the latest snapshot either.

I fixed it in the next run but at the end it doesn't matter much, there hasn't been lots of wikipedias created in Q2 of 2018 and they are really small to contribute much to the whole stats.

Change 443069 had a related patch set uploaded (by Jonas Kress (WMDE); owner: Jonas Kress (WMDE)):
[analytics/refinery/source@master] Track number of editors from other Wikimedia projects who also edit on Wikidata over time

https://gerrit.wikimedia.org/r/443069

Vvjjkkii renamed this task from track number of editors from other Wikimedia projects who also edit on Wikidata over time to esdaaaaaaa.Jul 1 2018, 1:14 AM
Vvjjkkii removed Jonas as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
AfroThundr3007730 renamed this task from esdaaaaaaa to track number of editors from other Wikimedia projects who also edit on Wikidata over time.Jul 1 2018, 6:18 AM
AfroThundr3007730 assigned this task to Jonas.
AfroThundr3007730 raised the priority of this task from High to Needs Triage.
AfroThundr3007730 updated the task description. (Show Details)
AfroThundr3007730 added subscribers: GerritBot, Aklapper.

Change 443370 had a related patch set uploaded (by Jonas Kress (WMDE); owner: Jonas Kress (WMDE)):
[operations/puppet@production] Add monthly storage schema for graphite

https://gerrit.wikimedia.org/r/443370

Change 443409 had a related patch set uploaded (by Jonas Kress (WMDE); owner: Jonas Kress (WMDE)):
[analytics/refinery@master] [WIP] Introduce oozie job to schedule generating metrics for Wikidata co-editors

https://gerrit.wikimedia.org/r/443409

Change 443069 merged by jenkins-bot:
[analytics/refinery/source@master] Track number of editors from Wikipedia who also edit on Wikidata over time

https://gerrit.wikimedia.org/r/443069

Change 443370 abandoned by Jonas Kress (WMDE):
Add monthly storage schema for graphite

https://gerrit.wikimedia.org/r/443370

Change 443409 merged by Joal:
[analytics/refinery@master] Introduce oozie job to schedule generating metrics for Wikidata co-editors

https://gerrit.wikimedia.org/r/443409

We should be able to backfill the data for this since wikidata came to be.
Is this something we want to do @Lydia_Pintscher ?

Though thinking about it it might get a bit weird for the time before we had unified logins for all wikis?

We said we also want this for the other projects. So far it's only Wikipedia. Is there anything preventing us from doing the other ones as well?

Should work as well.
Do you have a list of projects you are also interested in?
What kind of charts would you like to have?

Addshore triaged this task as Medium priority.Sep 19 2018, 7:02 AM

It looks like the dashboard only shows data until June currently.
Is the reason we are missing more recent months because the data just isn't there yet or is there something wrong with the job?

Mh, it seems like the job is not running @JAllemandou ?

1:25 PM <addshore> joal: any idea about https://phabricator.wikimedia.org/T193641#4614082 ?
1:30 PM <joal> addshore: job seems successfully running - something else must be going on
1:43 PM <joal> addshore: just triple checked the query on Spark - Looks correct - The issue is somewhere else (maybe oozie config)
1:43 PM <joal> addshore: just triple checked the query on Spark - Looks correct - The issue is somewhere else (maybe oozie config)

Jobs have been successful for the past months. However rerunning the jobs manually made the data-points appear. This is very bizarre.
Let's keep this open and monitor next month.

Will close for now and mark as down.
If I notice the data missing the future I'll re open

WMDE-leszek subscribed.

So it looks https://grafana.wikimedia.org/dashboard/db/wikidata-co-editors is not showing any data after August, 1st, which is not good, I believe.

@JAllemandou any idea what could be going wrong here?

Thanks for raising the issue. This is very bizarre.
The job for october was showing successful in our side. I reran it, and data showed up :(
I have the feeling this is not the first time this happens, something must be wrong somewhere.
I'am also going to run backfilling info.

[Resetting assignee as the assignee user account is inactive.]

I don't remember when this is set to run @JAllemandou , but I see no data for Nov yet?

Same exact problem as last month: job has run, but no data is present :(
More investigations needed, probably early next year.

@JAllemandou: as at WMDE we are in the need for the December data (Dec, 1st in particular), could you maybe please do the magic and re-run the job and backfill the data again? Thanks in advance.

Hi @WMDE-leszek - core data has not been computed et (usually done around the 9th of the following month).
I'll be sure to have an eye on data showing up for month 12 and rerun the job if needed.

Change 482664 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update wikidata-coeditor job data dependency

https://gerrit.wikimedia.org/r/482664

Change 482664 merged by Mforns:
[analytics/refinery@master] Update wikidata-coeditor job data dependency

https://gerrit.wikimedia.org/r/482664

Change 482692 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Correct wikidata-coeditors data dependencies (2nd)

https://gerrit.wikimedia.org/r/482692

Change 482692 merged by Mforns:
[analytics/refinery@master] Correct wikidata-coeditors data dependencies (2nd)

https://gerrit.wikimedia.org/r/482692

Bug found and corrected (patches above).
Data is available now and the rerun problem should be solved.

I confirm the fix :) Closing this task.