Page MenuHomePhabricator

Implement inequality metrics for WikiStats
Open, LowPublic

Description

Profile Information

Name: Abel Serrano Juste
IRC nickname on Freenode: Quasipodo / Akronix
Web Profile: https://www.akronix.es/
Resume (optional):


Location (country or state): Spain
Typical working hours (include your timezone): 17:00-21:00 (UTC+02:00) Mon - Fri

Synopsis

  • Short summary describing your project and how it will benefit Wikimedia projects

Inequality metrics are a way to measure the distribution of work within the members of a community. Knowing if the work is very much concentrated in few people can point us to further issues like: lack of accessibility for newcomers, missing of neutrality and diversity (potentially conflicting with Wikipedia Neutral Point of View and democracy principles) or inefficacy on communication and collaboration within that community, among others; yet it could also represent few people are very well committed to that community and keeping it alive, and it needs a bit of attention to boost it up. Either way, inequality indicators are a good starting point for a better understanding of collaborative communities, in particular, wikis, as it has been shown in previous research [1][2].

My project would consist in implementing a subset of validated inequality metrics into Wikistats v2. The request for those metrics and discussion about them is already started in: https://phabricator.wikimedia.org/T195033.

The whole project involves going through the following technical steps:

  1. Use sqoop to pull in the data needed from the database replicas to hdfs
  2. Write an oozie job to compute the metrics
  3. Load the output into Cassandra
  4. Write an API endpoint to serve the metric over the dimensions we decide on
  5. Configure the wikistats frontend to connect to the API

Deliverables

Days/DatesMilestone/Deadline/Subtask Accomplished
May 4 - Jun 1Community bonding period: spend time interacting with Analytics team at WikiMedia. Understand common practices and norms
Jun 1 - Jun 14Learn about the technologies & infrastructure being used within the Analytics team. Read and understand documentation: [3] [4]
Jun 15 - Jun 28Investigate about metrics, discuss options, inspect available data. Discuss about metric ideas and define related requirements
Jun 15 - Jun 28Draft some metrics. Validate them
Jun 29 - Jul 3Phase 1 Evaluations
Jul 6 - Jul 17Get data with sqoop. Write metrics with oozie. Load output into Cassandra.
Jul 20 - Jul 24Write tests for previous metrics code
Jul 27 - Jul 31Phase 2 Evaluations
Aug 3 - Aug 14Write API endpoints to serve metrics. Write tests.
Aug 3 - Aug 14Work on Wikistats fronted to connect to the API.
Aug 14 - Aug 21Test all the above.
Aug 14 - Aug 21Final deploy.
Aug 24 - Aug 31Last refinements: Document whatever is not documented, clean-up code, etc.
Aug 31 - Sept 7 Final Evaluation

This is possibly a pessimistic planing, though. I think that I'll be able to extract some days off from there.

Participation

  • Technical communication will happen in the corresponding Phabricator tasks or subtasks.
  • Available on IRC, at least, during my working hours ( 17:00-21:00 UTC+02:00, Mon - Fri ).
  • Communication through e-mail and through the analytics team mailing list for more thoughtful discussions.
  • Weekly updates will be written to have a better overview of the current status of the project in my meta wiki user page.
  • Bi-weekly reports on a blog will be written as a communication channel for both inside the Wikimedia community and the outside world.
  • Source code should be uploaded and published to its corresponding Gerrit instance following the git flow of the team.

About Me

I graduated in 2015 of my Bachelor's in Computer Science in the Universidad Complutense de Madrid (UCM). During my time in the university, I founded a student association to support free & open-source software called LibreLabUCM. I also did one academic year abroad in Cyprus under the Erasmus program. After university, I was working for a bit doing tutoring, web development, IT support and system administration.

From March 2017 to Spring 2019 I got a position as an assistant researcher in the university to support the research around online collaborative communities, in particular wikis. During this time I have published some research papers, done some data processing, analysis and visualization with wiki data and attend different related events, like WMF Hackathons 2018 and 2019 or OpenSym 2018.

I'm currently finishing my Master's in Data Science and I should be done by the beginning of June. The university I'm enrolled in is a online university, so that has allowed me to be moving around during my studies. Lastly, I have experienced a bit of the live of a digital nomad.

I heard about Google Summer of Code while I was studying in the university and I found it as a very good opportunity to get started with real open-source projects, to develop computer skills (like remote working, good coding practices) and get precious knowledge and experience.

This project gives me the opportunity to do something with real impact and useful for the society by extension. I also want to get practice learning on how a big organization works, in both technically and organically. Finally, I value the Wikimedia Foundation knowledge-for-everyone mission as well as I admire and applaud the fact that is built fully upon volunteers and private donations.

Finally, I should highlight that for this summer I'd like to live in some sort of an intentional community which entails some collective work plus other tasks that would take me part-time. Nevertheless, I still believe that my proposed dedication would be enough to get the job well done and, of course, I will look for a place with good internet connection and commit with the minimum of working hours that I presented in this proposal.

Past Experience

During my time in the university I have been mostly downloading, processing and analyzing MediaWiki data. I also led the development of WikiChron, a web app to visualize metrics and networks upon wiki communities from their historical data. As a result of those two years working as an assistant researcher at the university I published four articles and presented my work in both scientific congress [5], but also in some other events (like in the wikimedia hackathon 2018).

I have also done some small contributions to other open source projects from time to time (you can see some of my contributions in my github profile) . My most remarkable involvement right now is in the Trustroots travelers social network.

Regarding code directly owned by Wikimedia itself, I can only recall this very little fix I did to the docs of mediawiki-utilites: https://github.com/mediawiki-utilities/python-mwtypes/pull/2.

Lastly, I want to add that I'm a editor of Wikipedia since 2013 and I am an active editor of other wikis.

Any Other Info

I have coded already some inequality metrics in Python and packed them down in a library available in pip: https://github.com/Grasia/inequality_coefficients/
Also, I have coded some scripts to download, process and filter wiki data: https://github.com/Grasia/wiki-scripts.

References:

[1]: F. Ortega, J. M. Gonzalez-Barahona, and G. Robles. 2008. On the Inequality of Contributions to Wikipedia. In Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008). 304–304.
[2]: Abel Serrano, Javier Arroyo, and Samer Hassan. 2018. Participation Inequality in Wikis: A Temporal Analysis Using WikiChron. In Proceedings of the 14th International Symposium on Open Collaboration (OpenSym '18). ACM, New York, NY, USA, Article 12, 7 pages. DOI: 10.1145/3233391.3233536.
[3]: Documentation for the Analytics technology set
[4]: Wikistats 2 documentation
[5]: See the publications section on my website to refer to them: https://www.akronix.es/publications.html

Event Timeline

Quasipodo updated the task description. (Show Details)
Quasipodo updated the task description. (Show Details)

Ok, proposal looks solid. I like the detail and am excited we get to work with someone of your background. One higher level thought does come to mind, and I'd love @MGerlach's opinion as well. The data that I know is available to compute this type of metric is the user's gender, voluntarily filled in by each user. I have lost track of recent work on understanding that data, and I think @Isaac has a much fresher perspective, but here are my assumptions phrased as questions:

  • sparse: not a lot of people fill out this preference?
  • relatively constant: doesn't change very much from year to year?
  • potentially biased towards folks who feel comfortable self-identifying on a platform like ours?

If the answer to these is more "yes" than "no", then maybe this work could compute a yearly number instead of a monthly metric. We could publish this in a separate section of Wikistats on a yearly basis, with other metrics that we don't think change much. And if we get better data and the number starts moving more quickly, we can add it as a metric. To be clear, I think this would still be very valuable to surface and we've been thinking of such a yearly report for other overall totals.

If the answer is more "no" than "yes", let's make sure to take into account how the data is better now, as we compute the metric.

Thanks, and looking forward to this!

Thanks @Milimetric you for you input.

The Gini coefficient (and other inequality metrics) has been used by us (see WikiChron) and many other studies to measure concentration of work production (edits) within a particular wiki / communities. This can be seen as a proxy, for instance, of how horizontal a community is. Nonetheless, online communities (and also CBPP) commonly follow the 90-9-1 rule (see article in Wikipedia). so inequality is, in all the cases, expected.

I'd like to refer here to my talk in Wikimedia Hackathon 2018, where I explain better how we used these metrics to analyze productivity and concentration of work in Fandom (Wikia) wikis, as well as to the respective published article in OpenSym.

I hadn't thought before to use it as a way to measure gender gaps. My opinion is that further discussion is needed to deepen in how to use inequality metrics for this (you think to compare edits made by man vs woman, number of users of each, number of top editors,..?). To the best of my knowledge, I don't know of any work which has done along these lines for wikis previously. Also, I share the concern that gender typification on the mediawiki ecosystem is not very well spread and reliable, although I am aware of some gender research which has been successfully done for Wikipedia.

Also, I'd like to ask @awight for his input on this.

Finally, I want to highlight that I could only commit part-time, given the current situation and my possible future situation in summer.

Finally, I want to highlight that I could only commit part-time, given the current situation and my possible future situation in summer.

About the time commitment, around 30+ hours per week are required as per the program guidelines: https://developers.google.com/open-source/gsoc/faq#how_much_time_does_gsoc_participation_take. @Quasipodo Will it be possible for you to commit that many hours?

I like the proposal and think tracking inequality metrics over time as described above can be very informative about health of a community.
Regarding the discussion on gender, my understanding is that the current proposal only looks at inequality of editing across all its users regardless of gender. I do think that looking at gender-inequality specifically would be super interesting. If possible, it would be great to include but I suspect this would be a project in itself. This discussion came up previously in the Analytics/Research office hours, so I am putting some pointers to relevant work here:

[1] Lam, S. (tony) K., Uduwage, A., Dong, Z., Sen, S., Musicant, D. R., Terveen, L., & Riedl, J. (2011). WP:clubhouse? an exploration of Wikipedia’s gender imbalance. Proceedings of the 7th International Symposium on Wikis and Open Collaboration, 1–10. https://doi.org/10.1145/2038558.2038560

The data that I know is available to compute this type of metric is the user's gender, voluntarily filled in by each user. I have lost track of recent work on understanding that data, and I think @Isaac has a much fresher perspective, but here are my assumptions phrased as questions:

  • sparse: not a lot of people fill out this preference?
  • relatively constant: doesn't change very much from year to year?
  • potentially biased towards folks who feel comfortable self-identifying on a platform like ours?

@Milimetric yeah see some basic analyses I did here: https://meta.wikimedia.org/wiki/Research:Surveys_on_the_gender_of_editors/Report#User_Preferences

Quick takeaways would be:

  • Yes, it is sparse, especially among new editors: ~2% of new users have filled it out vs. 25% or greater of long-time editors
    • You can account for some of this by joining in edit counts and balancing the results so they aren't dominated by long-term editors
  • I didn't look at how quickly it changes but I did look at how it varies depending on how you define active editors.
  • We didn't see any evidence that men or women were more likely to set the user preferences gender field but we only looked at Arabic, English, Norwegian Wikipedia, so this behavior might occur in other contexts.
  • The main issue was that most editing communities are quite small combined with low response rates for setting gender, so plausibly if you have an editathon where they encourage new users to add their gender properties, this could significantly move the numbers for that language community. So I just don't consider the data robust unless new users are nudged to set the property when they create their accounts (but then I have a bunch of suggestions for what else would have to happen too).

It sounds interesting for me as well to try to measure the gender gap using inequality metrics. I can see it as a follow-up task once some inequality metrics have been implemented for the general edits.

Also other interesting variants can be explored like comparing Gini on males editors vs Gini on female editors on different wikis (see if there's more equality in one gender than another or if they behave similarly), I don't know of any previous research doing this kind of analysis: inequality of work by gender in collaborative online communities, and can be of broad interest. All of this given the premise that we have an statistical representative "population" of editors with labeled gender.

We might be going too much off-topic now and maybe another phabricator task / subtask should be added. Just for the record of it, I'd just wanna add that there was a specific report in 2018 about gender equity in Wikipedia based on surveys and there are some wiki pages specifically dedicated to this topic and with a good compilation of related references both in Wikipedia and in meta Wikimedia.

@Quasipodo perhaps you already know, but I came across the wikidata knowledge imbalance dashboard which calculates the gini-coefficient for different wikidata items and does a good job of putting the obtained numbers into context and explaining what they mean. Maybe a useful reference and inspiration for how to present metrics.

@MGerlach Wow! very visual and interactive way to explore the gini coefficient. I really like it!

Hi everyone! After checking with Google about the time requirements for the program, I learned that we would not be able to accept this proposal as part of GSoC as it does not meet the requirement of +30 hrs time commitment per week. Of course, @Quasipodo can work on the project outside GSoC.

Good to know. In that case, I'm also happy to keep an eye on this task and help as I can, and we can be less worried about deadlines.

(Maybe, should we move all the non-GSoC discussion to the other phabricator task?)

@Quasipodo perhaps you already know, but I came across the wikidata knowledge imbalance dashboard which calculates the gini-coefficient for different wikidata items and does a good job of putting the obtained numbers into context and explaining what they mean. Maybe a useful reference and inspiration for how to present metrics.

Anyway, and just for the record of it, I think something similar to what they do in that webapp could be done for articles within categories in Wikipedia, or to compare different wikipedia languages, etc. etc. The range of options can span quite widely and I guess it mostly depends on what the foundation is more interested in inspect, visualize and analyse.

Pavithraes renamed this task from Proposal (GSoC 2020): Implement inequality metrics for WikiStats to Implement inequality metrics for WikiStats.May 5 2020, 7:38 PM

@Quasipodo Thanks for your application. The GSoC projects have been announced, and as per T248964#6063593 we couldn't select this project. I've updated the task title and removed the GSoC tag. :)

@Quasipodo sorry to hear that the project was not selected.
I am still happy to keep an eye on this task and help out if I can if you decide to pick up any of this in any case.

Thank you @Milimetric and @MGerlach, I appreciate that. Let's see how everything evolves, I'm currently deeply involved in finishing up my master's thesis.

Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on August 22nd, 2022.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!