Page MenuHomePhabricator

Investigate how to identify our code contributors to on-wiki code (gadgets, templates, modules) on a WMF site
Closed, ResolvedPublic

Description

We have Git repository statistics on https://wikimedia.biterg.io/ but currently have no idea about our on-wiki developers.

Potentially related links:

When I once asked anomie about getting usernames of most active Lua template editors across WMF sites, his reply was

The best you might do would be to look at the top editors in the Module namespace on different wikis, although that might as well catch people who make lots of edits to some test module because they have no idea what they're doing or (on smaller wikis) people who copy-paste modules from other wikis.

Event Timeline

Aklapper created this task.

Well, quick and dirty theoretical and untested idea, for poor people without production shell access:

  • Assuming the MediaWiki: namespace is always 8 (if it is not, got to query first via API:Siteinfo)
  • Get a list of all MediaWiki:Gadget-* pages on a site: curl -s "https://xx.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=Gadget&apto=Gadgeu&aplimit=500&apnamespace=8&apprefix=Gadget-" | jq -r '.. |."title"? | select(. != null)' > textfile1
  • Only get the actual JS pages: grep '.js' textfile1 > textfile2
  • Cannot query revisions for several pages (titles) in one request. So for each page, list the authors of its revisions since 2015 (see API:Revisions)
  • Install jq and iterate over each line in textfile2 via curl -s "https://xx.wikipedia.org/w/api.php?action=query&prop=revisions&titles=MediaWiki:Gadget-SomeGadgetNameFromTextfile2.js&rvprop=user&formatversion=2&rvlimit=max&rvend=20150101000000" | jq -r '.. |."user"? | select(. != null)' >> textfile3 to get the author of each edit on that gadget code since 2015.
  • Run sort textfile3 | uniq -c
  • Profit.

If you want to run this across several WMF sites, use the sitematrix to iterate.

Aklapper renamed this task from Investigate how to identify our code contributors to on-wiki code (gadgets, templates, modules) across WMF sites to Investigate how to identify our code contributors to on-wiki code (gadgets, templates, modules) on a WMF site.Mar 22 2018, 12:47 PM
Aklapper closed this task as Resolved.
Aklapper claimed this task.

Works for me:

1#/bin/sh
2# List the most active gadget editors on a Wikimedia site
3# This code is licensed under CC0 1.0 Universal: https://creativecommons.org/publicdomain/zero/1.0/legalcode
4
5WEBSITE=cs.wikipedia.org
6DEADLINE=20150101000000 # how far to go back; must be YYYYMMDDHHMMSS
7
8echo "Scanning author names for gadgets on $WEBSITE since $DEADLINE"
9# Get a list of all MediaWiki:Gadget-* pages on a site:
10curl -s "https://$WEBSITE/w/api.php?action=query&list=allpages&apfrom=Gadget&apto=Gadgeu&aplimit=500&apnamespace=8&apprefix=Gadget-&format=json" | jq -r '.. |."title"? | select(. != null)' > textfile1
11sleep 5
12# TODO: Must check for 'continue' value and implement support, as the API only offers 500 results due to pagination
13# Only get the actual JS pages, no translations etc:
14grep -o '.*\.js$' textfile1 > textfile2
15sleep 5
16# Cannot query revisions for several pages (titles) in one request. So for each page, list the authors of its revisions since 2015 (see API:Revisions)
17# Install jq and iterate over each line in textfile2 to get the author of each edit on that gadget code since 2015. Drop "-r" if you want each name encapsulated in apostrophes.
18while read gadget; do
19 echo "Processing $gadget"
20 curl -s "https://$WEBSITE/w/api.php?action=query&prop=revisions&titles=$gadget&rvprop=user&formatversion=2&rvlimit=max&rvend=$DEADLINE&format=json" | jq -r '.. |."user"? | select(. != null)' >> textfile3
21 sleep 5
22done <textfile2
23sort textfile3 | uniq -c | sort -bnr > output.txt
24rm textfile1 textfile2 textfile3
25more output.txt
26echo "DONE! See output.txt"

This is great, thanks!
Questions: Above you write "use the sitematrix to iterate" - what does that mean in practice? For example, if I want to get results from all the French language projects, is there something simple we can add to the script to make it automatically run across the 8 French projects, so that output.txt would have 8 separate lists? (or combined, whichever is easier) And/or something we can add, to make it run across all the Wiktionaries?

Hi @Quiddity, thank you for your interest. Our new limited freemium model offer allows you to request features for a price you cannot beat! I wonder whether to accept some command line argument (like ./foo.sh "fr."), pulling the sitematrix and extracting the hostnames, and then match hostnames on this $1 argument.
Quick and dirty, you could wrap the code in a for WEBSITE in fr.wikipedia.org fr.wiktionary.org fr.wikisource.org; do ... loop.

Note to myself: AFAIK Quarry does not yet allow running cross-site queries.
https://quarry.wmflabs.org/query/17556 by halfak offers a similar query for the Module namespace, on a single wiki though:

USE enwiki_p;
SELECT rc_user_text, COUNT(*) 
FROM recentchanges_userindex 
WHERE 
    rc_namespace = 828 AND  -- Module namespace
    rc_type IN (0, 1)       -- Change is an edit of some sort
GROUP BY rc_user_text 
ORDER BY COUNT(*) DESC 
LIMIT 100;

Note to myself: AFAIK Quarry does not yet allow running cross-site queries.

That is covered in T95582: it would be useful to run the same Quarry query conveniently in several database