Page MenuHomePhabricator

Support for topic infrastructure work
Open, Needs TriagePublic

Description

I've (@Isaac) been doing a variety of tasks in support of our topic infrastructure (with an eye towards our recommender systems and supporting campaigns) that will be captured here.

Topic infrastructure is a catch-all to capture the ecosystem of filters we might provide for subsetting content, especially within our recommender systems. This currently includes the ORES topics but will hopefully be expanded to include countries, Wikidata-based people tags, WikiProjects, quality scores (and potentially related features like image/reference counts), and perhaps others based on needs.

Some expected tasks:

  • Formalize vision sketched out in https://docs.google.com/document/d/1qp-NPbP1pT7S2_VC9wCMIndpDEzMSPj1c62eOcTxbjc/edit?usp=sharing
  • Support hypothesis generation during Annual Planning
  • Build buy-in from the various teams that would use these features (Campaigns, Language, Growth, Apps, Community Growth)
  • Coordinate with teams that this would depend on (ML Platform, Search)
  • Guide technical changes to standardizing existing recommender systems to better use the topic infrastructure hosted in the Search index.

Event Timeline

Weekly updates:

  • Asked for input from Search on adding in the different topic tags we're considering (countries, quality, wikiprojects): https://etherpad.wikimedia.org/p/recsys-search-tags-future
  • Talked with Inuka Team about challenges/opportunities in this space as they consider potential projects to take on
  • Part of discussions with EH at Wikimedia Uruguay and others around their new templates for WikiProjects, which automatically find tasks to surface to editors: https://es.wikipedia.org/wiki/Wikiproyecto:Cambio_clim%C3%A1tico. This is an exciting replication of the infrastructure that Growth has worked on for Newcomer Homepage but by community members within the WikiProject context. It's further motivation for adding WikiProject tags to Search as well because without that, it's much harder to use our structured task filters (add-a-link; add-an-image) because there's no single query that filters by Wikiproject and task availability.
  • Began exploring feasibility of geography model on LiftWing. Ascertained that there could be key-value store support in the future that might be useful (if we use links to infer countries, we'll need to quickly look up the associated countries with each article link). In the meantime, it should be easy to just grab an item's Wikidata JSON and just check the country-related properties as we do with the culture metrics.

Weekly updates:

  • I put forth a draft hypothesis for next year related to a country-level article prediction model: If we build a country-level inference model for Wikipedia articles, we will be able to filter lists of articles to those about a specific region with >70% precision and >50% recall. I had a conversation with Fabian about this too and it'd be easy to pull in the cultural/geographic code that currently exists for inferring countries based on Wikidata properties. To take it a step further and cover articles without Wikidata items or with incomplete items or for geographic aspects that are not really covered in Wikidata -- e.g., geographic extent of flora/fauna -- I'd want to do some inference based on the country topics of the links in an article. Doing this online would be challenging (likely high latency as you'd need to evaluate many articles at once). There are ways to build a cache of predictions for articles and use that for evaluating the links but then you run into challenges with cache invalidation etc. Because the intent is to load the model predictions into the Search index as weighted tags, however, we can actually probably use the Search APIs to gather the country predictions for an article's links (analogous example for articletopic for en:Japanese_iris) and infer from there. This is nice because the Search index will always have up-to-date information and so we won't need to store this source of truth in multiple places.