Page MenuHomePhabricator

Investigation: Check Wikidata data on gendered category names
Closed, DeclinedPublicSpike

Description

If we want to utilize Wikidata for gendered category re-labeling we should get gather some more information on how this could be done.

  • see how "complete" the data on categories on WD is
  • how can we make sure, that we do not flood WD with requests?
  • could the system be utilized for cases with more then two genders?
  • could WD be used to generally support a category alias system?

Preliminary notes

There are a few Wikidata relationships which encode gender, but nothing which is obviously a good fit for our purposes.

No links between gendered items

When Wikidata includes both genders of a linked concept, such as actors and actresses, there is no direct link relating the two items.

Gendered label support

These are the most promising properties so far, "female form of label", and "male form of label". They are text properties rather than a link to a full Wikidata item. For an example usage, Physician has the female form "Ärztin" (de) but no explicit property for the male form. The masculine "Arzt" would have to be extracted from the German label, or the dewiki sitelink. In our case, the free-text property is probably a good fit since we don't want to create additional items or pages for each gendered form.

Grammatical gender

https://www.wikidata.org/wiki/Property:P5185

Lexemes can have grammatical gender, for example https://www.wikidata.org/wiki/Lexeme:L21064 "Arzt" is masculine.

Lexemes may be linked to a Wikidata concept through the property "item for this sense", https://www.wikidata.org/wiki/Property:P5137 . Concepts might include "topic's main category" https://www.wikidata.org/wiki/Property:P910 .

Permanently duplicated items

https://www.wikidata.org/wiki/Property:P2959

This is an interesting property, described as:

this item duplicates another item and the two can't be merged, as one Wikimedia project includes two pages, e. g. in different scripts or languages (applies to some wiki, e.g.: cdowiki, gomwiki). Use "duplicate item" for other wikis.

For example, this set seems to have gendered categories for North Frisian:

This property doesn't provide any information about how gender pertains to each duplicate item.

Any solution where wikidata items are linked

For example, "permanently duplicated items". These solutions all present the challenge of an assumed mapping between each Wikidata item and a wiki page on various sites. This makes sense if we maintain two gendered categories for a language, but that's not our intention, so it seems to carry the risk of causing people to actually create these additional pages.

Categories and subcategories are redundant with the concept

By this, I mean that both the category, its diffusing, and its non-diffusing subcategories like "20th-century woman scientists" will each have to be mapped between genders. Mapping the profession concept itself feels almost useful, but the redundant labels for all minor categories is a wiki maintenance nightmare. I have no ideas about a robust way to extrapolate the gendered profession labels over to categories.

Here's an example subcategory, Category:20th-century Indian women scientists. There are "category contains" values with qualifiers saying, "human", "female", and "scientist". Maybe this is enough that we could safely run a simple string substitution using rules derived from the "scientist" concept and its "female form of label"?

Event Timeline

awight subscribed.

TODO:

  • Query how many Wikimedia category items include "human" and an occupation for "category contains".
  • Query how many occupation items include "female form of label" values, for German.

I dumped a lot into the task description, so I want to highlight what I believe is the most promising property so far. Gendered labels can be added to a profession's wikidata item, and then that profession can be added as a constraint on subjects of a category. That can be used to produce a regex appropriate for converting the profession name to its gendered equivalent, whenever it appears in the category label. Exceptions can be encoded directly into the wikidata item for the category, but we should try to keep this to a minimum.

Currently, only 52 categories have a German "female form of label" entry in Wikidata:

SELECT
  (count(?item) as ?count)
WHERE 
{
  ?item wdt:P2521 ?value .
  ?item wdt:P31 wd:Q4167836 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  filter(lang(?value) = "de")
}

I don't have anything great to say about this, yet:

how can we make sure, that we do not flood WD with requests?

It seems that the Wikidata item corresponding to a page is easily available, and this is what powers parser functions like {{#property}}. However, our task is to pull Wikidata items and properties for the many categories which might be associated with an article. These do require another Wikidata query.

We can batch using the MediaWiki Wikidata API, e.g. wbgetentities and retrieve 50 category titles at a time.

Wikidata can be used for any third+ genders or for general aliasing, simply by selecting different properties other than "female form of label".

Aklapper subscribed.

Adding WMDE-TechWish to this open task as there are no active project tags on this task since the archival of WMDE-TechWish, hence nobody could find this task on some workboard.

Permanently duplicated item means they are identical in meaning, so that property shouldn't be used if there's a gender difference. Aren't those North Frisian categories different dialects though, rather than different genders?

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptNov 2 2020, 10:06 AM
Tobi_WMDE_SW subscribed.

This task was part of an investigation for a project that was discontinued.