Page MenuHomePhabricator

Create a user script for showing statistics on Wikipedia articles about the gender of those linked in the article
Closed, ResolvedPublic

Assigned To
Authored By
Isaac
Aug 11 2021, 8:34 PM
Referenced Files
F34596512: image.png
Aug 14 2021, 5:48 PM
F34595375: Screen Shot 2021-08-14 at 10.11.47 AM.png
Aug 14 2021, 2:15 PM
F34594806: image.png
Aug 13 2021, 7:14 PM
F34594794: image.png
Aug 13 2021, 6:55 PM
F34590803: Screen Shot 2021-08-11 at 3.30.25 PM.png
Aug 11 2021, 8:34 PM
F34590808: Screen Shot 2021-08-11 at 3.39.17 PM.png
Aug 11 2021, 8:34 PM

Description

Overview

While discussions of the gender gap on Wikipedia often focus on the number of articles about people of different gender identities, another core consideration regarding bias is the distribution of links on Wikipedia -- i.e. it's not just important for content to be available about women, it should also be discoverable. Research has demonstrated that disparities do exist: Wagner et al. showed "...that there exists a bias in the generation of links by Wikipedia editors, favoring articles about men." Adding appropriate links within Wikipedia to articles about women and individuals with non-binary identities then is an important part of addressing some of these systemic biases on Wikipedia (example).

One potential approach to better surfacing these link gaps is via better surfacing of data about the gendered nature of links on Wikipedia. To this end, I have already created a simple API that takes any article on Wikipedia and provides data on the gender distribution of the links in that article -- e.g., https://article-gender-data.wmcloud.org/api/v1/details?lang=en&title=Computer_programming. The API works for any Wikipedia article in any language but lacks a nice interface for showing the data.

Goal

The goal of this project would be to build a simple user script that visualizes the data for Wikipedia articles as a user browses. There would be a few potential components to this:

  • A button for turning on / off the script. The latency of the API isn't particularly high but given that gathering the data requires several API calls, it'd be nice to only make the request when necessary.
  • A simple summary of the distribution of links -- e.g., see mock-up below that uses the same colors used by the Humaniki dashboard.
  • Link-specific shading so that it's also possible to see if there are differences in the prominence of links to individuals of different gender identities -- e.g., see mock-up below.

Skills Needed

I am primarily looking for someone who wants to build the user script. I have already built the back-end API and can be responsible for making any necessary adjustments to it. Ideally you would have some experience building user scripts or with javascript, but folks who want to learn these things are welcome to give it a shot.

Communication

Probably easiest via phabricator but you can also find me on IRC at #wikimedia-research as isaacj. I'm based on the east coast of the United States (UTC-4) so depending on the time of day may be faster/slower to respond. If there's a better medium for communication, feel free to ask.

Simple Mock-ups

Summary (grey/orange/blue bar on top of article; potentially could also label the colors or have labels pop up when hovering or exclude data on links to non-people):

Screen Shot 2021-08-11 at 3.39.17 PM.png (720×2 px, 255 KB)

Link highlighting (orange/blue highlights on links to people):

Screen Shot 2021-08-11 at 3.30.25 PM.png (1×2 px, 1 MB)

Event Timeline

@Isaac good project. The script should be a gadget. Right?

good project. The script should be a gadget. Right?

Thanks @Mh-3110 . I said user script because I that's generally where these sorts of things start. Eventually if folks find it useful, could lobby to make it into a gadget. My understanding though is that making it a user script is the first step.

I'm new to this whole thing so I might have gotten something wrong, but I quickly hacked this together https://en.wikipedia.org/wiki/User:TayIorRobinson/wikigender.

@TayIorRobinson wow that was quick! I'm going to take a look and I know @Frexpe is also going to experiment with it. In my quick skimming of the code, I wanted to get one discussion started around the simplifyGender function which currently combines trans and cis identities. I know questions around the correct terminology generally has no one right answer. I generally refer to these guidelines when considering what terminology I use and the design of the Humaniki tool for practical approaches to handling the complexity of gender identity in Wikidata. My question then is:

  • While simplicity of design is nice, is there a way to adhere to Wikidata's range of identities rather than introducing our own taxonomy overtop?
  • If not, is there a strong reason not to use Humaniki's approach of showing male, female, and all other values as other genders with a way to still see the actual gender identities -- e.g., a tooltip if you hover over the information?

I wanted to keep with Humaniki's 3 gender label's system, and this was the primary reason for simplifying it down to the 3. I'd love to show a tooltip, but I couldn't find a clean and quick way to do that, so that is definately something that can be worked on.

For reasoning on combining trans and cis identities, this was done because trans people are seen as the identity they identify as, and some might be concerned if they get excluded from that with a seperate "trans woman"/"trans man" category and I'm sure this has already been talked about on a larger scale, and of course, I'd be happy to switch to a model that's the wide consensus, so therefore I do see the value of having the ability to see who is trans and who is cis quickly. Possibly make it show as trans women being a subset of women, rather than being a seperate category, with a slightly darker/lighter colour?

I'd love to show a tooltip, but I couldn't find a clean and quick way to do that, so that is definately something that can be worked on.

Yeah, I think you can use the title attribute in HTML but I'm not an expert so might be missing something there.

For reasoning on combining trans and cis identities, this was done because trans people are seen as the identity they identify as, and some might be concerned if they get excluded from that with a seperate "trans woman"/"trans man" category and I'm sure this has already been talked about on a larger scale, and of course, I'd be happy to switch to a model that's the wide consensus, so therefore I do see the value of having the ability to see who is trans and who is cis quickly. Possibly make it show as trans women being a subset of women, rather than being a seperate category, with a slightly darker/lighter colour?

Yeah, it definitely gets complicated. Because I'm not certain whether e.g., male on Wikidata always indicates cis male for example. I think your suggestion about using similar colors makes sense to me. It will be editable because I'll leave the API backend as the raw Wikidata values so we can always change at a later date if someone indicates a good reason to use a different approach.

I updated it to have a table appear on hover to show a key and the individual counts. I'll admit there is styling work to be done though, as the table isn't exactly easy to read at a first glance.

image.png (157×1 px, 45 KB)

I updated it to have a table appear on hover to show a key and the individual counts. I'll admit there is styling work to be done though, as the table isn't exactly easy to read at a first glance.

Hey, this looks great!! Thanks! Like you said, still some minor style things to work out and I think that not all of the links are being highlighted so I'm going to look into that to see if I can figure out what's going on in the next few days, but really excited to see this come to life! The Wikimania showcase for the hackathon is on Tuesday. If you want to present this, let me know and let's go ahead and fill out the form to show it. Details: https://wikimania.wikimedia.org/wiki/2021:Hackathon/Showcase

@TayIorRobinson ok, some more detailed thoughts on the code. Thanks again -- it was really easy to follow what you did and make suggestions. Regarding the highlighting of links, I figured out what was going on. Links often appear multiple times on the page but right now the code only captures the final time a link appears. I think the fix would be to make the values in the dictionary a list of links instead of just a single link. I haven't debugged but my attempt at that updated code below:

function applyLinkColours(links) {
    let pageLinks = document.querySelector(".mw-parser-output").querySelectorAll("a[href^=\"/wiki/\"]");
    // Create an object of all the links on the page.
    let linksOnPage = {};
    for (let link of pageLinks) {
        let linkText = decodeURIComponent(link.href.split("/")[4].toLowerCase().replace(/_/g, " "));
        // Build array of all instances of a link appearing on page
        if (!(linkText in linksOnPage)) {
            linksOnPage[linkText] = [];
        }
       linksOnPage[linkText].push(link);
      }
    // Colour the links.
    for (let link of links) {
        let linkOnPage = linksOnPage[link.title];
        if (!linkOnPage) continue;
        for (let l of linkOnPage) { 
            l.style.backgroundColor = getGenderColour(link.gender) + "88";
        }
    }
}

Some thoughts on table styling:

  • Should the "Other" column have the same style="margin-left: 2em as the other column headers?
  • Could we keep to Humaniki's terminology and go with Other Genders as the full column header?
  • Could we switch the details rows to be added when the length is > 0 for each category (code)? That way, one can always see which gender identities comprised the other genders category and a page with only links to transgender men or women isn't just reduced to male or female.
  • Is there a way to add empty rows so each column has the same number of rows? For example, when I look at the English Wikipedia article for Pose, the detail rows get misaligned because there is no data for the other genders column so the male and female columns shift over to fill that place (screenshot below). I assume easiest way is to add a minRows parameter or something like that to makeRows?

Screen Shot 2021-08-14 at 10.11.47 AM.png (300×1 px, 92 KB)

Hi @Isaac, I just made some changes, and it should now fix the issues you mentioned (also, I just removed the margin-left: 2em as it wasn't valid to apply anyway and had no effect)

image.png (151×1 px, 38 KB)

I just made some changes, and it should now fix the issues you mentioned (also, I just removed the margin-left: 2em as it wasn't valid to apply anyway and had no effect)

Thanks -- looks great! I also was thinking that including all the non-people links in the stats didn't make a whole lot of sense so I added another parameter to the API call all that needs to be there if the non-gendered links are including in the API response. Let me know if you don't like that and I can either switch the API back or you can just add the all to the API response as below:

__________ All data: https://article-gender-data.wmcloud.org/api/v1/details?lang=en&title=Pose_(TV_series)&all
Just gender subset: https://article-gender-data.wmcloud.org/api/v1/details?lang=en&title=Pose_(TV_series)

I also loaded the script into my global javascript to make sure it works on other languages: https://meta.wikimedia.org/wiki/User:Isaac_(WMF)/global.js
In general it does! Pretty interesting to see how different languages end up with different distributions for the same article, especially when it comes to what links make it into the lede paragraph. Obviously would benefit from greater localization -- e.g., not always using English labels -- and the table gets switched around a little bit in Arabic or other right-to-left languages, but that feels likely beyond the scope of this initial working prototype :)

To me, this feels pretty complete. Let me know if you had additional thoughts / improvements you wanted to make though. I think just remaining then would be whether you want to present it at the showcase on Tuesday: https://wikimania.wikimedia.org/wiki/2021:Hackathon/Showcase

It'd be nice to show it off at the showcase, @Isaac, however I'm not comfortable speaking in front of an audience, so I'm afraid I won't be able to do that.

It'd be nice to show it off at the showcase, @Isaac, however I'm not comfortable speaking in front of an audience, so I'm afraid I won't be able to do that.

No problem -- I'm happy to do the presentation then. Submitted a request and I'll let you know what I hear. Let me know if there's anything you'd like me to mention but otherwise I'll give a quick overview of the tool giving you credit for the user script and Humaniki as our design inspiration.

Also, as I was recording a video showing the script, it occurred to me that given that this script is aimed at boosting visibility of women and other gender identities on Wikipedia, we probably should explicitly show 0s in our main categories when there are no links so the gap is clearer. Like if the data was 10 links, 80% male and 20% female, then the table would be something like:

Other Genders 0 links 0%Female 2 links 20%Male 8 links 80%
female 2 links 20%male 8 links 80%

I think it's just a matter of removing the .length > 0 conditions for each gender group here and again here.

@Isaac: Thanks for participating in the Hackathon! We hope you had a great time.

  • If this task was being worked on and resolved at the Hackathon: Please change the task status to resolved via the Add Action...Change Status dropdown.
  • If this task is still valid and should stay open: Please add another active project tag to this task, so others can find this task (as likely nobody in the future will look back at the Hackathon workboard when trying to find something they are interested in).
  • In case there is nothing else to do for this task, or nobody plans to work on this task anymore: Please set the task status to declined.

Thank you,
your Hackathon venue housekeeping service

Thanks @Aklapper -- research project tag added while wrapping up.

TODOs:

  • @Isaac to update documentation on this phab task (reflect current status; indicate limitations)
  • @TayIorRobinson to tweak script to include all high-level gender categories even when no links (@Isaac made this tweak for the Showcase but did it in a very hacky way so there's probably a better approach). At that point, this script will feel like it's in a stable place as other changes (localization to other languages) would require more substantive work.
  • @Isaac to upload overview video to Commons of the tool

@TayIorRobinson so you aren't on the hook for any additional updates and because this code is really a fantastic template for additional projects, I wanted to see if you had any preferences around attribution.

Right now I link to your original code at the top of my new code and also in the documentation. You can see an example here where I was very quickly able to create a user script in which links are highlighted based on how much they are clicked on: https://en.wikipedia.org/wiki/User:Isaac_(WMF)/clickstream_viz.js

Isaac claimed this task.

I just made some changes, and it should now fix the issues you mentioned (also, I just removed the margin-left: 2em as it wasn't valid to apply anyway and had no effect)

Thanks -- looks great! I also was thinking that including all the non-people links in the stats didn't make a whole lot of sense so I added another parameter to the API call all that needs to be there if the non-gendered links are including in the API response. Let me know if you don't like that and I can either switch the API back or you can just add the all to the API response as below:

__________ All data: https://article-gender-data.wmcloud.org/api/v1/details?lang=en&title=Pose_(TV_series)&all
Just gender subset: https://article-gender-data.wmcloud.org/api/v1/details?lang=en&title=Pose_(TV_series)

I also loaded the script into my global javascript to make sure it works on other languages: https://meta.wikimedia.org/wiki/User:Isaac_(WMF)/global.js
In general it does! Pretty interesting to see how different languages end up with different distributions for the same article, especially when it comes to what links make it into the lede paragraph. Obviously would benefit from greater localization -- e.g., not always using English labels -- and the table gets switched around a little bit in Arabic or other right-to-left languages, but that feels likely beyond the scope of this initial working prototype :)

To me, this feels pretty complete. Let me know if you had additional thoughts / improvements you wanted to make though. I think just remaining then would be whether you want to present it at the showcase on Tuesday: https://wikimania.wikimedia.org/wiki/2021:Hackathon/Showcase

I get 502 on the API. Is it still maintained? Is the source code for the API available somewhere?

I get 502 on the API.

@So9q thanks for letting me know -- I had turned off the instance while shuffling around some things and forgot to restart it. I have restarted it now so hopefully working again. I just noticed my UI for the stats table is broken because the language links is now at the top of the article for me but I likely won't have time to fix that anytime soon (happy to have you fix the code if you also have the issue).

Is it still maintained?

The short answer is no -- it's easy for me to keep it running and occasionally update it but there's no formal plan in place to preserve it. So always feel free to ask if something is off but it might take me a while to make the changes. I'll try to make it easier for other folks to maintain it though so you don't have to depend on me. Essentially, I think it should be easy to run this on Toolforge where I can add maintainers more easily so I'll try to make that switch and update the scripts with the new API endpoint. Let me know if you'd like me to add you as a maintainer there when I get around to it (I'll try this week while it's top of mind).

Is the source code for the API available somewhere?

Yep, relevant links:

  • API repo: https://github.com/wikimedia/research-api-endpoint-template/tree/article-gender-distribution (see README for details)
  • The thing that's missing from there is how to update the data used on the backend to assign gender. It's essentially just a script for processing the Wikidata dumps (nothing particularly complicated). It's written to run on the Wikimedia Foundation servers though where it can be processed quite quickly. I'll also take a TODO to link to that script and write a basic proof-of-concept that'll run on the public dumps for folks who don't have access to the WIkimedia Foundation infrastructure.

And just following up to say that I moved the tool over to Toolforge (was easier than I anticipated -- let me know if you or anyone else would like to be added as a maintainer).