Page MenuHomePhabricator

Support for topic infrastructure work
Open, Needs TriagePublic

Description

I've (@Isaac) been doing a variety of tasks in support of our topic infrastructure (with an eye towards our recommender systems and supporting campaigns) that will be captured here.

Topic infrastructure is a catch-all to capture the ecosystem of filters we might provide for subsetting content, especially within our recommender systems. This currently includes the ORES topics but will hopefully be expanded to include countries, Wikidata-based people tags, WikiProjects, quality scores (and potentially related features like image/reference counts), and perhaps others based on needs.

Some expected tasks:

  • Formalize vision sketched out in https://docs.google.com/document/d/1qp-NPbP1pT7S2_VC9wCMIndpDEzMSPj1c62eOcTxbjc/edit?usp=sharing
  • Support hypothesis generation during Annual Planning
  • Build buy-in from the various teams that would use these features (Campaigns, Language, Growth, Apps, Community Growth)
  • Coordinate with teams that this would depend on (ML Platform, Search)
  • Guide technical changes to standardizing existing recommender systems to better use the topic infrastructure hosted in the Search index.

Event Timeline

Weekly updates:

  • Asked for input from Search on adding in the different topic tags we're considering (countries, quality, wikiprojects): https://etherpad.wikimedia.org/p/recsys-search-tags-future
  • Talked with Inuka Team about challenges/opportunities in this space as they consider potential projects to take on
  • Part of discussions with EH at Wikimedia Uruguay and others around their new templates for WikiProjects, which automatically find tasks to surface to editors: https://es.wikipedia.org/wiki/Wikiproyecto:Cambio_clim%C3%A1tico. This is an exciting replication of the infrastructure that Growth has worked on for Newcomer Homepage but by community members within the WikiProject context. It's further motivation for adding WikiProject tags to Search as well because without that, it's much harder to use our structured task filters (add-a-link; add-an-image) because there's no single query that filters by Wikiproject and task availability.
  • Began exploring feasibility of geography model on LiftWing. Ascertained that there could be key-value store support in the future that might be useful (if we use links to infer countries, we'll need to quickly look up the associated countries with each article link). In the meantime, it should be easy to just grab an item's Wikidata JSON and just check the country-related properties as we do with the culture metrics.

Weekly updates:

  • I put forth a draft hypothesis for next year related to a country-level article prediction model: If we build a country-level inference model for Wikipedia articles, we will be able to filter lists of articles to those about a specific region with >70% precision and >50% recall. I had a conversation with Fabian about this too and it'd be easy to pull in the cultural/geographic code that currently exists for inferring countries based on Wikidata properties. To take it a step further and cover articles without Wikidata items or with incomplete items or for geographic aspects that are not really covered in Wikidata -- e.g., geographic extent of flora/fauna -- I'd want to do some inference based on the country topics of the links in an article. Doing this online would be challenging (likely high latency as you'd need to evaluate many articles at once). There are ways to build a cache of predictions for articles and use that for evaluating the links but then you run into challenges with cache invalidation etc. Because the intent is to load the model predictions into the Search index as weighted tags, however, we can actually probably use the Search APIs to gather the country predictions for an article's links (analogous example for articletopic for en:Japanese_iris) and infer from there. This is nice because the Search index will always have up-to-date information and so we won't need to store this source of truth in multiple places.

Weekly updates:

  • ML Platform and Search Platform indicated that my plans were fine for the article-country hypothesis and they can support deployment. In particular, EB on Search indicated that the broader expansion of tags on Search index for recommender systems shouldn't pose any issues.
  • Put together basic API for using just the Wikidata properties: https://wiki-topic.toolforge.org/countries
  • Good meeting put together by Miriam in which we charted out that Community Growth could do some outreach to get feedback on the current topic taxonomy and we'd work to make updates based on that but then try to freeze the taxonomy.

As AP currently stands, I'm quite happy as we're seeing progress on almost all fronts and good coordination thusfar:

  • Pre-defined topics:
    • I'll be working on extending geography portion of topic infrastructure
    • Community Growth will work on determining what changes to make to the existing topic infrastructure (which I'll implement in following quarters)
    • Content Translation is working on incorporating topic filters so they should be able to connect in with our existing topics
  • Tasks:
    • Inuka will be focusing on experimenting with the task aspect of recommender systems as it relates to campaigns
    • Martin is also considering the orphaned-article task and Growth is looking at expanding exposure of structured tasks
    • iOS is investigating an alt-text task
  • WikiProjects:
    • Campaigns will begin exploration of surfacing WikiProjects within Event Lists to see how they can connect in with Campaigns
    • Community Growth is intending to start studying what makes for successful WikiProjects
  • List-building:
    • Community Growth is going to work test list-building tool with a few communities

Weekly updates:

  • Added basic form of link-based inference to the country-article prototype -- e.g., https://wiki-region.wmcloud.org/regions?lang=en&title=Japanese%20iris. That was a final feasibility check for me and I'm going to pause on development for that for now until Q1 begins. The next steps for when I pick back up that work:
    • Evaluation:
      • Offline: probably a large stratified sample by geo + language edition to test link-based logic -- i.e. whether it can reproduce what's already in Wikidata. I think I should be able to easily re-write the API logic to use the cluster instead so it's fast to test/iterate.
      • Human: a small sample of articles with Wikidata properties to just verify that those are indeed accurate and complete when present but I think it's fair to assume ~100% precision/recall for those. Focus then would be on articles lacking Wikidata-based country properties. For those, just have folks go through the corresponding Wikipedia article and tag with any relevant countries. Might need to stratify by continent to make sure even-ish coverage but I want to keep the sample size manageable.
    • Guardrails (how to handle links):
      • Motivating challenge here is something like the biodiversity articles -- e.g., https://en.wikipedia.org/wiki/Limonium_strictissimum. This plant is native to Italy/France, which is mentioned in the article, so ideally those two countries would be predicted. There are actually more links, however, to US/UK because many of the orgs linked to in the Taxonbar at the end of the article who track information about plant species are based in those countries.
      • Why it's not trivial to fix: we use the pagelinks API to get info on links because it can easily be run as a generator so with a single API call we can get all the links and their corresponding Wikidata IDs (for looking up countries associated with each). So we can't e.g., exclude links based on how they're presented in the page. We could in theory maintain a list of links to ignore based on how many articles they're present in -- the list is probably not super long and would be effective and filtering out these sorts of links but it's also an additional layer of complexity.
      • My current approach is two-fold:
        • I do apply a tf-idf transformation (code) to the link proportions for each country so e.g., a few links to Ecuador will be treated as a strong signal than a few links to the US. This helps a bit with the US/UK problem (also dampens France quite a bit).
        • I require a minimum count (3) and minimum proportion (0.25) of links in order to elevate a linked country to a prediction (code). This was aimed pretty directly at the taxonbar issue but I'm sure it could be fine-tuned. The challenge is balancing a requirement of enough support to be "real" without making the bar too high for stub articles to exceed. The minimum proportion part also makes it hard for articles that are relevant to many countries to ever reach the threshold, which I don't love but also might be acceptable behavior. For example, the WWII article is certainly relevant to many countries but isn't necessarily a useful result if you're filtering by country to find content to edit.
      • Another possible guardrail is restricting where we apply the link-based logic. One approach could be only running the code for those articles lacking coordinates / any Wikidata-geo-property? This would reduce the possibility of false positives and probably let us better fine-tune the model to articles in topics lacking Wikidata properties about countries. Might confuse things for the end-user but also better latency and maybe nudges editors to improve Wikidata if they find issues with the predictions.

Weekly updates (adding early while it's fresh):

  • I put together some data and thoughts around the next iteration of the topic classification model in discussion with Alex as far as what steps Community Growth should be leading to do the community consultations on making improvements to it (google doc). Summary:
    • We'll have to do some cleaning up of the WikiProject->Topic mapping as WikiProject names etc. have shifted since it was created in 2020. This seems pretty doable though.
    • Big changes we both agree on are the shifting of geographic topics to a country-based model and shifting of model-based outputs for biography/gender to a Wikidata-based output (deterministic).
    • A number of small changes to the arts/science topics -- e.g., perhaps merge a few categories that get low usage and have low coverage.
    • The larger discussion will be around how to handle some of the existing history/society topics and what topics are possible for folks engaged in sustainability and human rights work.
    • Expanding the data pipeline to incorporate WikiProjects from other language editions wouldn't have a large effect at the moment (most major wikiprojects with coverage of non-English articles are for geographic/biographical topics and only a few are in areas where we probably do need more diverse data like history/society topics).

Thanks so much for the updates, @Isaac !! Should we decline T343241 as duplicate? Or maybe add it here as subtask and change the scope?
@Rmaung CC as I we were talking about this yesterday!

@Miriam I don't mind either way but I'll be bold. This is my quarterly goal task so it touches on the topic classification evolution but also other related aspects and I mainly see as a personal tracking task that I intend to close out at the end of this quarter. I think best thing would be to make this a subtask of T343241 (as in I'm playing a supporting role for the taxonomy work) and I'll shift my updates over there when they're about the topic taxonomy.

Not too much movement in this space this week on my end though some additional tasks that I signed up for:

  • Helping with upcoming Community Growth contract hiring
  • Review for T363022

Did have a nice discussion with AS about possibilities for centralized, dynamic list storage -- i.e. the glue that holds together many of these pieces. PageAssessments and the wikiproject structure of English Wikipedia is one approach (tagging every article on their talk page with the relevant wikiprojects via templates that talk with PageAssessments and help it keep its tables up-to-date). Mentally sketched out another possible interaction style where there's a special type of page on wiki that has support for building tables similar to WikiProject Women in Red's classic topic lists (example). The page would be a table with one row per article in a given WikiProject's worklist (or event worklist). You could add standard columns like pageviews or Wikidata properties or tasks available that would be handled by the software. Maybe even columns like quality or importance that could be updated with new assessments. Rows (new articles) could be added manually or via tools like list-building or PagePile or others. PageAssessments could still be used to track all the articles in that page and assign them to whatever WikiProject or Event owns it (so it'd be easy to access the worklists in other settings).

Weekly updates:

  • Put together a task (T366273) and meta page for my Q1 hypothesis of the article-country model. Hypotheses will be officially shared in the next few weeks and the request was put out for Meta pages with additional information for interested community members.
  • Provided feedback on Campaign's worklist -> editor invitation candidates scoring approach (T363022)
  • Shared some thoughts on topic infrastructure with Inuka team