Page MenuHomePhabricator

Build a taxonomy for "impactful topics"
Open, Needs TriagePublic

Description

We are looking at improving the taxonomy of topics we use to classify articles so it better reflects topics that are relevant to the community.

  • At the moment, the taxonomy includes an expanded version of the second level of the Wikiproject directory taxonomy.
  • We would like to stick to Wikiprojects as a reference unit because
    • They are largely adopted by the community as a way to organize labour
    • They provide high-quality label data to train topic classifiers
  • The idea is that we can expand the existing topic taxonomy to include more wikiprojects whose topics are considered relevant/impactful by the community
  • The above implies setting up community consultations to define what those topical categories are.
  • Once (a part of ) this set of topical categories is finalized, we can retrain our topic models so that they can classify articles according to a more impactful set of topics.

Event Timeline

@Astinson hi! We had originally created this task to start a conversation about creating the "taxonomy of topics". I am happy to modify it based on what we discussed yesterday and assign it to you!

@Miriam Sounds good! Yeah, we have two layers of work that I think need to happen: first examining the topic areas that we have signal from the Grants space and seeing if we can build for those(which @Rmaung has the most recent data on) and then thinking about how we can gain the most insight from the communities we do have to improve the overall data model to reflect on and offwiki organizing beyond enwiki. I am going to work with Isaac next week to figure out how complex some of the methodologies might be, and propose a timeline or process.

Summary of some data analysis I did for evaluating the current topic taxonomy and gathering some thoughts about potential changes (google doc with more data/notes):

  • We'll have to do some cleaning up of the WikiProject->Topic mapping as WikiProject names etc. have shifted since it was created in 2020. For example, WikiProject Climate Change used to be a task force of WikiProject Environment (I think) and then became its own project so is not currently tracked in the data. This seems pretty doable though as a one-time manual pass and allocation of larger WikiProjects to specific topics.
  • Big changes that I think we are pretty certain about:
    • Shifting of geographic topics to a country-based model. This will allow for more granularity than current regions and incorporate data from Wikidata so build on that community work.
    • Shifting of model-based outputs for people (biography/women topics) to a Wikidata-based output (deterministic based on instance-of:human and gender properties). This will lose some of the hazier, women-related topics that the model-based women topic could surface but be clearer (less likely to provide problematic predictions) and we will see about addressing some of this change with the new topics.
  • A number of small changes to the arts/science topics -- e.g., perhaps merge a few categories that get low usage and have low coverage.
  • The larger discussion will be around how to handle some of the existing history/society topics and what topics are possible for folks engaged in sustainability and human rights work.
  • Expanding the data pipeline to incorporate WikiProjects from other language editions wouldn't have a large effect at the moment (most major wikiprojects with coverage of non-English articles are for geographic/biographical topics and only a few are in areas where we probably do need more diverse data like history/society topics). But this might be useful for certain topics if we do have low data volume/diversity from English and we know there are relevant WikiProjects in other language editions supported by PageAssessments.
Miriam renamed this task from Brainstorm taxonomy for "impactful topics" to Build a taxonomy for "impactful topics".Thu, May 9, 2:15 PM
Miriam updated the task description. (Show Details)

As a brief followup note to @Isaac above: I am currently reviewing the data collected by Isaac, and comparing it with the reported use of WikiProjects and other topical collaborations in community reporting areas such as Diff and This Month in GLAM -- to be better understand the topical networks that would most be prepared for having the conversation identified in the "history/society" and other topics like climate/biodiversity/sustianability identifed by Isaac. I currently have a sketched timeframe for targeted data modeling discussions about the rebuild in Q2 of FY24-25.

Hi @Astinson thanks!! Double checking something! Is this work going to be folded under WE1.1.3 (The hypothesis text seems pretty aligned with what we are trying to achieve)? If not, should we also take that work into account when designing the topic taxonomy? @MMulaudzi-WMF CC

@Miriam yep exactly the Q2 discussions should be part of 1.1.3, and some Q1 awareness building and WikiProject identification are covered in 1.1.2. The idea is that these consultations/outreach processes feed into eachother in a way that keeps us from doing too many parallel outreach moments for different related things (from the perspective of organizers/editors).

Wonderful, thank you @Astinson! Is that ok if I assign this task to you and Isaac for now?