Page MenuHomePhabricator

Technical contributors emerging communities metric definition, thick data
Closed, ResolvedPublic

Assigned To
Authored By
Nuria
Apr 15 2020, 2:51 PM
Referenced Files
F32248827: Screenshot 2020-09-07 at 12.49.12.png
Sep 7 2020, 11:52 AM
F32248825: Screenshot 2020-09-07 at 12.49.01.png
Sep 7 2020, 11:52 AM
F32248829: Screenshot 2020-09-07 at 12.49.21.png
Sep 7 2020, 11:52 AM
F31942482: image.png
Jul 21 2020, 1:14 AM
F31942486: image.png
Jul 21 2020, 1:14 AM
F31942479: image.png
Jul 21 2020, 1:14 AM
F31914743: image.png
Jul 2 2020, 11:25 PM
F31914745: image.png
Jul 2 2020, 11:25 PM

Description

There are two definitions we have explored for emerging communities one around contributors and another around supportive work about article creation

  • communities with less than 5 technical contributors
  • if you look at tools and bots written you can also look at the number of edits coming (in percentage) from tools/bots in cloud environment to that one wiki, this number would be between 0 and 5% (for healthy wikis is 10%) . In some cases for some wikis all this work is done manually and thus the number of 'automated' edits is 0%

Neither of these two 'definitions' include large amounts of data (thus the 'thick data' label)

Event Timeline

ping @srishakatux so she is subscribed to this ticket

nshahquinn-wmf added a subscriber: nshahquinn-wmf.

It looks like Product-Analytics is tagged just for information. If this is a request for us to do something, please move it back to the "triage" column on our workboard.

Given that the emerging communities are of very different nature it is very likely that we need to think of a categorization scheme. Some categories to think about (just an example): " wikis that started a while back whose community has not grown", "wikis that share a technical community (example: the many indian wiki languages wikipedias that benefit from a pool of common editors/developers), "wikis that are new (less than a year old)".

cc @Bmueller

communities with less than 5 technical contributors

If technical contributor means bot editors, then we can count bot editors per country or per wikis. Our monthly active editors is defined as "the number of registered users who made at least 5 content edits across all projects in the given month" (source). To decide if we should adopt the threshold of 5 edits per month, I explored 2 ways: with the threshold of 5 edits per month per country, and without threshold. Data is posted at table. SQL definition posted below.

# editors who edit 1 or more during the month
bot_editor_query='''
SELECT country_code, COUNT(DISTINCT CONCAT(wiki_db, user_fingerprint_or_id)) AS bot_editors
FROM wmf.editors_daily
WHERE month = '{SNAPSHOT}' AND size(user_is_bot_by) != 0 AND country_code != '--'
GROUP BY country_code
'''
all_editors_query='''
SELECT country_code, COUNT(DISTINCT CONCAT(wiki_db, user_fingerprint_or_id)) AS editors
FROM wmf.editors_daily
WHERE month = '{SNAPSHOT}' AND country_code != '--'
GROUP BY country_code
'''
# editors who edit 5 or more during the month
bot_editor_5edits_query='''
SELECT tmp.country_code, COUNT(DISTINCT CONCAT(tmp.wiki_db, tmp.user_fingerprint_or_id)) AS bot_editors
FROM
(SELECT wiki_db, country_code, user_fingerprint_or_id, SUM(edit_count) AS edits
FROM wmf.editors_daily
WHERE month = '{SNAPSHOT}' AND size(user_is_bot_by) != 0 AND country_code != '--'
GROUP BY wiki_db, country_code, user_fingerprint_or_id
) AS tmp
WHERE tmp.edits>=5
GROUP BY tmp.country_code
'''
all_editors_5edits_query='''
SELECT tmp.country_code, COUNT(DISTINCT CONCAT(tmp.wiki_db, tmp.user_fingerprint_or_id)) AS editors
FROM
(SELECT wiki_db, country_code, user_fingerprint_or_id, SUM(edit_count) AS edits
FROM wmf.editors_daily
WHERE month = '{SNAPSHOT}'  AND country_code != '--'
GROUP BY wiki_db, country_code, user_fingerprint_or_id
) AS tmp
WHERE tmp.edits>=5
GROUP BY tmp.country_code
'''

if you look at tools and bots written you can also look at the number of edits coming (in percentage) from tools/bots in cloud environment to that one wiki, this number would be between 0 and 5% (for healthy wikis is 10%) . In some cases for some wikis all this work is done manually and thus the number of 'automated' edits is 0%

We have a calculation of bot edits% in column O in this table, which measures the similar aspect, just another way of calculation.

Milimetric moved this task from Incoming to Data Analysis on the Analytics board.

@Nuria, @Bmueller
As we discussed, I have calculated bot editors % per wiki projects and added to table "Editors_by_wikis_2020-04".

We talked about grouping by project rather than country to be able to count bots that are doing edits located in labs.

We also talked about knowing the % of bots that are located OUTSIDE the labs environment.

Basic question is "does this community have tooling?"

Some thoughts/hyphotesis: the ratio of bots/(active editors) will be high in emerging communities but low on stablished communities

Worth thinking about edits by bots versus edits by humans , data for which is presented here: https://wmcs-edits.wmflabs.org/#wmcs-edits

Let's filter the data a bit further to see patterns that emerge:

  • let's remove projects with no bots edits/editors
  • let's remove projects with less than 5 active editors per month
  • let's remove projects with less than 15 edits per month (5 per editor)
  • let's filter away stablished wikis that have large numbers of active editors so as to leave only emerging wikis

@Nuria, @Bmueller

Add edits to table "Editors_by_wikis_2020-04". And filtered out projects 1) no bots edits/editors 2) less than 5 active editors per month, 3) less than 15 edits per month 4) more than 100,000 monthly active editors.
https://docs.google.com/spreadsheets/d/1GzyDzCuOAjEU6sF3Gs0fiPhGXvZxGYJ4_VVWt07hMpY/edit#gid=731709624

Further exploration to see whether number of bots is a good proxy to identify emerging communities:

  • let's just consider 'wikipedia' projects
  • let's remove any project with less than 5 editors or more than 40,000 (to include arwiki)
  • let's intercompare the data using some bucketing strategy, for example we know cawiki or rowiki are healthy tech communities

It is useful to note that the bot_editor_ratio in arwiki is not dissimilar from the one in healthier wikis

let's just consider 'wikipedia' projects
let's remove any project with less than 5 editors or more than 40,000 (to include arwiki)

Added wiki group to the table, and filtered projects which are

  1. non-wikipedia projects
  2. no bots edits/editors
  3. less than 5 active editors per month,
  4. more than 80,000 monthly edits. In order to include arwiki, did NOT filtered by 40,000 per Nuria's request. Arwiki has 64,375 edits which is above the threshold.

table link:
https://docs.google.com/spreadsheets/d/1GzyDzCuOAjEU6sF3Gs0fiPhGXvZxGYJ4_VVWt07hMpY/edit#gid=731709624

Notes from meeting: Let's try to understand what the bots are doing in detail to classify what are the types of contributions. Are bots editing mostly on the main content namespace versus non-content namespaces? Let's include commons and wikidata and try to explore a few wikis (some large, medium, small)

Also useful to explore tags for edits happening in the cloud vps, @Bmueller to provide a list of tags that are of interest. Or look at all tags for all edits happening on cloud namespace?

@Nuria, @Bmueller

Please find the bot edits by namespaces in table. It includes editing data of all projects from 2020.01.01 to 2020.05.31.

table link
https://docs.google.com/spreadsheets/d/1GzyDzCuOAjEU6sF3Gs0fiPhGXvZxGYJ4_VVWt07hMpY/edit#gid=1655475021

Looking at data from the last 5 months: the bulk of edits is happening on the content namespace for both small and larger communities so not a big distinction there.

Maybe is worth thinking about the number of articles in a wiki and the number of bots, over some # of articles/edits bots are needed for a wiki to be healthy. What is the moment in which a wiki community will need to start thinking about bots?

We can compare the number of articles and edits of a healthy wiki historically versus the number of bots and edits by bot. We would need to get that historically for a large wiki, like ruwiki, rowiki, svwiki and compare it to history of pages and edits on smaller wikis.

dImensions: #of articles, # edits, # bots edits (all for content namespace)

@Nuria, @Bmueller
I created a dashboard so that you can explore the pattern with other wikis. https://superset.wikimedia.org/r/263
My observations include:

  1. The content edits fluctuated more in small projects than large projects.
  2. Bot editing is the major cause of the peak in historical trend in small projects.
  3. Big projects tend to have a lower rate of bot edits/total edits. Small projects tend to have a higher rate of bot edits/total edits. (Not always)

Here are the trends of 3 listed wikis: ruwiki, rowiki, svwiki.
ruwiki

image.png (832×1 px, 110 KB)

rowiki

image.png (814×1 px, 130 KB)

svwiki

image.png (824×1 px, 170 KB)

For # of articles, aka content pages, I only found data since 2019-12.

wiki2019-102019-112019-122020-012020-022020-032020-042020-05
ruwikiNaNNaN158723815944111600954160971316203701631346
rowikiNaNNaN403123403961404705405624408401409374
svwikiNaNNaN3745394rMW37386642edb73738237373585137311723731412

From the sync with @jwang today:

  • To gain insights how bots contributed to the content while a project grew, it would be great to have data with longer history. @Nuria what is the earliest data we have on a.) numbers of article per wiki + b.) editors (bot/humans) per month? (which table?)
  • Potential workaround to explore by @jwang: Data on creation of bot IDs (as early as possible) and looking up the documentation on a selected set of wikis to learn about article size at the point in time when bots got created
  • Once we know if there is a correlation visible in the growth history of wikis, next step would be to look into what types of bots communities develop first

@jwang please add if I missed anything! Thanks for the session today!

To gain insights how bots contributed to the content while a project grew, it would be great to have data with longer history. @Nuria what is the earliest data we have on a.) numbers of article per wiki + b.) editors (bot/humans) per month? (which table?)
Potential workaround to explore by @jwang: Data on creation of bot IDs (as early as possible) and looking up the documentation on a selected set of wikis to learn about article size at the point in time when bots got created

I have a few new methods to explore. 1) table wmf.mediawiki_history is a good source for bot edits/edtis. Just need to simplify so that superset dashboard can handle. 2) try to count articles by creation day for # articles.
No questions for @Nuria about table now unless the new methods don't work.

@Nuria , @Bmueller

Here are the historical trends of ruwiki, rowiki and svwiki. You can explore other wikis at dashboard https://superset.wikimedia.org/r/263.

ruwiki

image.png (828×1 px, 130 KB)

rowiki

image.png (808×1 px, 142 KB)

svwiki

image.png (816×1 px, 118 KB)

I think now we need some wikis that are on the other side of the spectrum in terms of edits/articles and bots to compare findings. it is remarkable how similar these graphas are for svwiki and rowiki; number of edits by humans and bots do not differ that much and spikes on boths are closely related to creation of pages. in ruwiki creation of content is much more 'organic'

In the meeting with @Bmueller last week, she is interested in the editor trends by wiki and how it correlated with the growth of bot editing. I have added editor trend and bot editor trend to dashboard https://superset.wikimedia.org/r/263.

Note: The definition of editor metric here is slightly different from the monthly active editor in data dictionary.
In data dictionary, monthly active editors is defined as "the number of registered users who made at least 5 content edits across all projects in the given month". Here, to simply the query in superset, just count editors without 5-content-edits threshold. So the editor metrics in dashboard is the number of registered users who made content edits in the given month on the given project.

On ruwiki , when bot editing became active ( > 1k) in Sept 2004, the number of editors is 268 .

image.png (828×1 px, 130 KB)

image.png (814×2 px, 108 KB)

On rowiki, when bot editing became active ( > 1k) in July 2005, the number of editors is 134 .

image.png (808×1 px, 142 KB)

image.png (802×2 px, 117 KB)

On svwiki, when bot editing became active ( > 1k) in Jun 2005, the number of editors is 624.
image.png (816×1 px, 118 KB)

image.png (804×2 px, 124 KB)

We were thinking of plotting several wikis on same graph with relative times so we can see

editors/articles content namepsace and bots

Now, that brings some concerns with data normalization so we can inter-compare numbers, could we normalize editors and bots by population (or even population with internet access )

Summarized the findings and recommendations in memo so that tech team can start to take a look. In parallel, I am working on a report (draft) and will publish on meta wiki (link) .

Hi @jwang, I read through and did some format and copy edits, hopefully small tweaks. Let me know if something is inappropriate and I can revert it. Very nice research 👍👏

I wanted to let you know that most of the tables with legend and figures on https://meta.wikimedia.org/wiki/Research:Emerging_Technical_Communities#Explorations are cut off on smaller windows. Around 1000px wide they are already kind of unreadable which is quite big.

Examples:

Screenshot 2020-09-07 at 12.49.01.png (950×1 px, 297 KB)

Screenshot 2020-09-07 at 12.49.12.png (1×1 px, 246 KB)

Screenshot 2020-09-07 at 12.49.21.png (1×1 px, 305 KB)

I didn't want to touch anything else in case the positioning has big significance and you are against changing the format. Up to you.

You can make them single column like I did here putting the images top to bottom and centered and removing the floats on the wiki tables.

I know it can be annoying to not be able to format the content exactly but it is important to keep in mind different form factors when styling content for the web with HTML and CSS.

Let me know if you need a hand with anything.

@Jhernandez thank you very much for your review and edits. And thank you for bringing up the formatting issue on smaller screen. I will adjust the format to make it more friendly to all sizes of screen. Thanks!

@Jhernandez I made some formatting changes on the page. Let me know if it looks better on smaller screen.