Page MenuHomePhabricator

Support for topic infrastructure work
Open, HighPublic

Description

I've (@Isaac) been doing a variety of tasks in support of our topic infrastructure (with an eye towards our recommender systems and supporting campaigns) that will be captured here.

Topic infrastructure is a catch-all to capture the ecosystem of filters we might provide for subsetting content, especially within our recommender systems. This currently includes the ORES topics but will hopefully be expanded to include countries, Wikidata-based people tags, WikiProjects, quality scores (and potentially related features like image/reference counts), and perhaps others based on needs. It also touches on building our understanding of how this can relate to WikiProjects and the associated list-building tooling.

October - December 2024:

  • Work with Community Growth on V2 of topic taxonomy: T343241
  • Support productionization of article-country model: T371897
  • Incorporating any feedback from list-building work being led by Community Growth
  • Support for streamlining existing recommendation systems so they are better suited to incorporate topics -- e.g., T367873 for Content Translation and T340854 for Suggested Edits

July - September 2024:

  • Implementing article-country model: T369120
  • Improving the topic taxonomy: T343241
  • Support for list-building -- e.g., T368713

March - June 2024:

  • Formalize vision sketched out in https://docs.google.com/document/d/1qp-NPbP1pT7S2_VC9wCMIndpDEzMSPj1c62eOcTxbjc/edit?usp=sharing
  • Support hypothesis generation during Annual Planning
  • Build buy-in from the various teams that would use these features (Campaigns, Language, Growth, Apps, Community Growth)
  • Coordinate with teams that this would depend on (ML Platform, Search)
  • Guide technical changes to standardizing existing recommender systems to better use the topic infrastructure hosted in the Search index.

Event Timeline

Weekly updates:

  • Asked for input from Search on adding in the different topic tags we're considering (countries, quality, wikiprojects): https://etherpad.wikimedia.org/p/recsys-search-tags-future
  • Talked with Inuka Team about challenges/opportunities in this space as they consider potential projects to take on
  • Part of discussions with EH at Wikimedia Uruguay and others around their new templates for WikiProjects, which automatically find tasks to surface to editors: https://es.wikipedia.org/wiki/Wikiproyecto:Cambio_clim%C3%A1tico. This is an exciting replication of the infrastructure that Growth has worked on for Newcomer Homepage but by community members within the WikiProject context. It's further motivation for adding WikiProject tags to Search as well because without that, it's much harder to use our structured task filters (add-a-link; add-an-image) because there's no single query that filters by Wikiproject and task availability.
  • Began exploring feasibility of geography model on LiftWing. Ascertained that there could be key-value store support in the future that might be useful (if we use links to infer countries, we'll need to quickly look up the associated countries with each article link). In the meantime, it should be easy to just grab an item's Wikidata JSON and just check the country-related properties as we do with the culture metrics.

Weekly updates:

  • I put forth a draft hypothesis for next year related to a country-level article prediction model: If we build a country-level inference model for Wikipedia articles, we will be able to filter lists of articles to those about a specific region with >70% precision and >50% recall. I had a conversation with Fabian about this too and it'd be easy to pull in the cultural/geographic code that currently exists for inferring countries based on Wikidata properties. To take it a step further and cover articles without Wikidata items or with incomplete items or for geographic aspects that are not really covered in Wikidata -- e.g., geographic extent of flora/fauna -- I'd want to do some inference based on the country topics of the links in an article. Doing this online would be challenging (likely high latency as you'd need to evaluate many articles at once). There are ways to build a cache of predictions for articles and use that for evaluating the links but then you run into challenges with cache invalidation etc. Because the intent is to load the model predictions into the Search index as weighted tags, however, we can actually probably use the Search APIs to gather the country predictions for an article's links (analogous example for articletopic for en:Japanese_iris) and infer from there. This is nice because the Search index will always have up-to-date information and so we won't need to store this source of truth in multiple places.

Weekly updates:

  • ML Platform and Search Platform indicated that my plans were fine for the article-country hypothesis and they can support deployment. In particular, EB on Search indicated that the broader expansion of tags on Search index for recommender systems shouldn't pose any issues.
  • Put together basic API for using just the Wikidata properties: https://wiki-topic.toolforge.org/countries
  • Good meeting put together by Miriam in which we charted out that Community Growth could do some outreach to get feedback on the current topic taxonomy and we'd work to make updates based on that but then try to freeze the taxonomy.

As AP currently stands, I'm quite happy as we're seeing progress on almost all fronts and good coordination thusfar:

  • Pre-defined topics:
    • I'll be working on extending geography portion of topic infrastructure
    • Community Growth will work on determining what changes to make to the existing topic infrastructure (which I'll implement in following quarters)
    • Content Translation is working on incorporating topic filters so they should be able to connect in with our existing topics
  • Tasks:
    • Inuka will be focusing on experimenting with the task aspect of recommender systems as it relates to campaigns
    • Martin is also considering the orphaned-article task and Growth is looking at expanding exposure of structured tasks
    • iOS is investigating an alt-text task
  • WikiProjects:
    • Campaigns will begin exploration of surfacing WikiProjects within Event Lists to see how they can connect in with Campaigns
    • Community Growth is intending to start studying what makes for successful WikiProjects
  • List-building:
    • Community Growth is going to work test list-building tool with a few communities

Weekly updates:

  • Added basic form of link-based inference to the country-article prototype -- e.g., https://wiki-region.wmcloud.org/regions?lang=en&title=Japanese%20iris. That was a final feasibility check for me and I'm going to pause on development for that for now until Q1 begins. The next steps for when I pick back up that work:
    • Evaluation:
      • Offline: probably a large stratified sample by geo + language edition to test link-based logic -- i.e. whether it can reproduce what's already in Wikidata. I think I should be able to easily re-write the API logic to use the cluster instead so it's fast to test/iterate.
      • Human: a small sample of articles with Wikidata properties to just verify that those are indeed accurate and complete when present but I think it's fair to assume ~100% precision/recall for those. Focus then would be on articles lacking Wikidata-based country properties. For those, just have folks go through the corresponding Wikipedia article and tag with any relevant countries. Might need to stratify by continent to make sure even-ish coverage but I want to keep the sample size manageable.
    • Guardrails (how to handle links):
      • Motivating challenge here is something like the biodiversity articles -- e.g., https://en.wikipedia.org/wiki/Limonium_strictissimum. This plant is native to Italy/France, which is mentioned in the article, so ideally those two countries would be predicted. There are actually more links, however, to US/UK because many of the orgs linked to in the Taxonbar at the end of the article who track information about plant species are based in those countries.
      • Why it's not trivial to fix: we use the pagelinks API to get info on links because it can easily be run as a generator so with a single API call we can get all the links and their corresponding Wikidata IDs (for looking up countries associated with each). So we can't e.g., exclude links based on how they're presented in the page. We could in theory maintain a list of links to ignore based on how many articles they're present in -- the list is probably not super long and would be effective and filtering out these sorts of links but it's also an additional layer of complexity.
      • My current approach is two-fold:
        • I do apply a tf-idf transformation (code) to the link proportions for each country so e.g., a few links to Ecuador will be treated as a strong signal than a few links to the US. This helps a bit with the US/UK problem (also dampens France quite a bit).
        • I require a minimum count (3) and minimum proportion (0.25) of links in order to elevate a linked country to a prediction (code). This was aimed pretty directly at the taxonbar issue but I'm sure it could be fine-tuned. The challenge is balancing a requirement of enough support to be "real" without making the bar too high for stub articles to exceed. The minimum proportion part also makes it hard for articles that are relevant to many countries to ever reach the threshold, which I don't love but also might be acceptable behavior. For example, the WWII article is certainly relevant to many countries but isn't necessarily a useful result if you're filtering by country to find content to edit.
      • Another possible guardrail is restricting where we apply the link-based logic. One approach could be only running the code for those articles lacking coordinates / any Wikidata-geo-property? This would reduce the possibility of false positives and probably let us better fine-tune the model to articles in topics lacking Wikidata properties about countries. Might confuse things for the end-user but also better latency and maybe nudges editors to improve Wikidata if they find issues with the predictions.

Weekly updates (adding early while it's fresh):

  • I put together some data and thoughts around the next iteration of the topic classification model in discussion with Alex as far as what steps Community Growth should be leading to do the community consultations on making improvements to it (google doc). Summary:
    • We'll have to do some cleaning up of the WikiProject->Topic mapping as WikiProject names etc. have shifted since it was created in 2020. This seems pretty doable though.
    • Big changes we both agree on are the shifting of geographic topics to a country-based model and shifting of model-based outputs for biography/gender to a Wikidata-based output (deterministic).
    • A number of small changes to the arts/science topics -- e.g., perhaps merge a few categories that get low usage and have low coverage.
    • The larger discussion will be around how to handle some of the existing history/society topics and what topics are possible for folks engaged in sustainability and human rights work.
    • Expanding the data pipeline to incorporate WikiProjects from other language editions wouldn't have a large effect at the moment (most major wikiprojects with coverage of non-English articles are for geographic/biographical topics and only a few are in areas where we probably do need more diverse data like history/society topics).

Thanks so much for the updates, @Isaac !! Should we decline T343241 as duplicate? Or maybe add it here as subtask and change the scope?
@Rmaung CC as I we were talking about this yesterday!

@Miriam I don't mind either way but I'll be bold. This is my quarterly goal task so it touches on the topic classification evolution but also other related aspects and I mainly see as a personal tracking task that I intend to close out at the end of this quarter. I think best thing would be to make this a subtask of T343241 (as in I'm playing a supporting role for the taxonomy work) and I'll shift my updates over there when they're about the topic taxonomy.

Not too much movement in this space this week on my end though some additional tasks that I signed up for:

  • Helping with upcoming Community Growth contract hiring
  • Review for T363022

Did have a nice discussion with AS about possibilities for centralized, dynamic list storage -- i.e. the glue that holds together many of these pieces. PageAssessments and the wikiproject structure of English Wikipedia is one approach (tagging every article on their talk page with the relevant wikiprojects via templates that talk with PageAssessments and help it keep its tables up-to-date). Mentally sketched out another possible interaction style where there's a special type of page on wiki that has support for building tables similar to WikiProject Women in Red's classic topic lists (example). The page would be a table with one row per article in a given WikiProject's worklist (or event worklist). You could add standard columns like pageviews or Wikidata properties or tasks available that would be handled by the software. Maybe even columns like quality or importance that could be updated with new assessments. Rows (new articles) could be added manually or via tools like list-building or PagePile or others. PageAssessments could still be used to track all the articles in that page and assign them to whatever WikiProject or Event owns it (so it'd be easy to access the worklists in other settings).

Weekly updates:

  • Put together a task (T366273) and meta page for my Q1 hypothesis of the article-country model. Hypotheses will be officially shared in the next few weeks and the request was put out for Meta pages with additional information for interested community members.
  • Provided feedback on Campaign's worklist -> editor invitation candidates scoring approach (T363022)
  • Shared some thoughts on topic infrastructure with Inuka team

Weekly updates:

  • None (at ICWSM during the week)

Weekly updates:

  • Made sure that I had everything for my hypothesis prepared as it'll be published shortly on Meta (I do)
  • Attended first topic collaboration group, which raised a bunch of interesting questions about the different types of topical collaborations (wikiprojects, campaigns, events, etc.). I'll be presenting at the July rendition about the topic classification work and where ML models can support in this work.

Weekly updates:

  • Supported interviews for contract role that will in part help with evaluating the list-building tool and related functionality
  • Updated Meta page for country hypothesis to include status updates and shared on Annual Plan: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Countries
  • Quality model is now hosted on LiftWing staging which is a big step towards having quality scores available in our infrastructure to use as an additional filter around list-building etc.

Weekly updates:

  • Left some feedback for Language as they begin their explorations for topics on Content Translation
  • Working on some clean-up steps for the quality model so it doesn't get stuck in staging on LiftWing
  • Otherwise waiting for Q1 to kick off to begin planning for eval of the article-country model

Weekly updates:

  • Presented on article-country model at Topical Collaboration Learning Circle. Pau reminded me of the importance of having clear criteria for the countries to present to community members, which exists in this README and is based on the ISO 3166-1 codes.
  • Deprecated GapFinder officially and provided code review on a major overhaul of the recommendation API on LiftWing by Santhosh to simplify the code-base, improve latency, and improve result quality (gerrit). He also has a separate patch for adding in the topic support, which was simple to add on top of his refactoring (gerrit).

Weekly updates:

  • Did some brainstorming and left some feedback about how we might actually make article translation lists accessible from tooling like Content Translation (T368713#9990651)
  • Categories that have Wikidata items with the "category combines topics" (P971) are proving pretty useful for the article-country model (notebook). This is particularly exciting because categories historically are how the Wikimedia communities organize content and annotate similarities between articles but the network itself is so messy that it's very difficult to make use of categories within tooling (background). This modeling on Wikidata, however, provides a nice structured way of using categories that because it depends on explicit Wikidata properties (as opposed to the category hierarchy), seems less prone to unintended consequences and an approach that might be useful in other topic-related initiatives.

Weekly updates:

  • Continued to monitor and be very happy with Language's work to switch all of their recommendation functionality to the new LiftWing API (they moved section translation over this week)
  • Connected with Seddon briefly about Android's recommendation API usage (context: T338430) and my willingness to provide guidance when they port that code over as I've done for Language + Content Translation
  • Prepared data dump of all country predictions (+ topics courtesy of Muniza's new Airflow job) for all Wikipedia articles and made that public so other folks can use it if desired (data; README)
  • Built simple API for mapping WikiProjects to LiftWing topics for Community Wishlist exploration (details: T370951#10015459). This might help with making wikiprojects more discoverable to new editors.

Weekly updates:

  • Article country model: put together draft model card as pre-requisite towards starting conversations about hosting this model on LiftWing. Also uploaded notebooks that cover various aspects of the data pipeline.
  • Discussion with Evelin/Ilana/Alex/Felipe about WikiProjects (context of Spanish Wikipedia Climate Change WikiProject). One of the issues they're dealing with is how store lists of references (e.g., new climate change reports) and the articles/topics they might be relevant to. And do this in a way that future editors can easily discover these resources and use them for editing. I put forth a few ideas. We're not working on building this yet but interesting to ponder and some of it can build on the article-oriented topic infrastructure that we're working directly on:
    • Add the sources to Wikidata and set their main-subject property. Then have a tool that periodically caches the sources in a way that they'd be discoverable from articles related to the main-subject property. Some non-trivial challenges there but probably could make something work.
    • Build separate tool that uses AI to analyze the report and recommend relevant articles. Or use list-building tool or something similar for organizers to quickly build a list of relevant articles. Then store that somewhere in a structured way that lets it be surfaced to editors who are editing one of the listed articles.
    • Incorporate the sources into at least a few relevant articles. Then build generic tooling for recommending sources to editors that could also surface these sources. Some options for that:
      • Use current formula for e.g., images but apply to sources: for any given article, collect sources (this often means external URLs) in other language editions and surface the most prevalent that aren't used yet in that language edition. The drawback here is that presumably many of these sources are in other languages too so they might not be useful.
      • Use list-building tool (or even just based morelike functionality) to find similar articles to the one in question and aggregate sources from these.
      • Same as above but instead of aggregating sources (and therefore decontextualizing them), you instead filter the list to higher-quality articles and present them as potential examples that the editor can follow for improving their current article.
      • You aggregate top-used sources for the WikiProject and surface them to editors (example). This has the challenge that WikiProject is often too high-level of a topic area for a specific source.

Weekly updates:

  • Provided feedback on translation lists prototype and feeling pretty good about where that work by the Language team is headed (T371515)
  • Put in requests for support from ML Platform, Research Engineering, and Search Platform around article-country model to start those conversations early. I'll have to make sure that they're aware that we would like to make updates again later in the year based on feedback about the topic taxonomy so we should make sure that we don't duplicate work unnecessarily.

Weekly updates:

  • Small-scale evaluation of 50 random articles for article-country model as described in T369120#10069239 shows strong early performance. Doing a stratified sample will help us understand if there are any more problematic pockets though.

Weekly updates:

  • Some advance planning with AS around topic taxonomy workshops with the community that we have planned for Q2. I'm going to put together some basic prototype models in September -- e.g., a model that has a "Human Rights" category -- to help us think through the feasibility of various approaches ahead of talking with community members. This is good timing because I'm also having discussions with the Content Translation folks right now about the taxonomy and how to show it in the UI (T113257).

Weekly updates:

  • Article country model is confirmed to be exceeding its precision/recall requirements: T369120#10145803
  • I shared out with LPL team about the different models (topic; country; quality) that are available or could be available soon as filters for Content Translation. We're going to push on the country model being productionized and will continue to think about whether the quality model could be of any use to them.
  • Had a good chat with Chanelle about the list-building work to give her some of the background on how that model came to be. Excited to follow her work and brainstorm different ways that that tool (or the models behind it) could be useful. For instance, originally we had thought of it as mainly a tool for generating article lists for campaign organizers. What I've been finding and came out from our discussion is that there are actually a bunch of use-cases that just use the lists as an intermediate result. For example, take an input article and find potentially relevant editors (for event invitations) or sources (for helping editors actually edit about a topical space). In both of these cases, the tool doesn't have to be perfect but the patterns around editors or sources etc. in the related articles still are clear enough to be valuable.

Would like to briefly mention that this would be great to make Suggested Edits more engaging and interesting by having it suggest tasks for topics the user in interested (and likely somewhat knowledgable) in (which also turns Suggested Edits into something also relevant for more experienced users rather than only new users). I don't see how the task about streamlining existing recommendation systems to incorporate topics would include that so likely a new issue would need to be created for which this is a reminder (+ pls subscribe me to it). Some more input on this (better personalized Suggested Edits etc) here.

This is probably already known but for setting topics what could be very useful are: categories, WikiProjects set on the talk page, analysis of wikilinks (e.g. those in the lead and the WIkiNav most clicked ones), which templates link to the page, which lists link to the page, and AI topic generation (similar to summarization) using the lead text.

@Prototyperspective thanks for chiming in. I asked the Growth team and they were not aware of any tasks oriented at incorporating past edits as a mean of personalization for the Newcomer Tasks or Suggested Edits experience. If you're interested, I'd say create a task that captures what you're thinking and tag GrowthExperiments-NewcomerTasks. A few bigger thoughts as I think about the design of these recommender systems:

I'm thinking mainly about three recommender systems that WMF supports, which I'd personally characterize as:

  • Suggested Edits: tasks on the iOS/Android apps that are oriented towards surfacing quick tasks -- i.e. constrained to mobile.
  • Newcomer Tasks: tasks via the Newcomer Homepage that are oriented towards aiding new editors in discovering simple edits to make to learn/improve as an editor -- i.e. mobile or desktop.
  • Content Translation: high-impact translation tasks that can be on mobile but are still largely desktop-oriented.

All three of these systems can be seen as a combination of a set of tasks, a mechanism for discovering articles of topical interest, and then a nice UI to wrap up the experience. There are of course many other volunteer-developed recommender systems in use across the Wikimedia projects.

When it comes to tasks, we have essentially two types:

When it comes to topical filters, there are essentially four approaches used:

  • Personalized based on past edit history (as you suggested): Content Translation already does this. Suggested Edits / Newcomer Tasks do not currently do this but could presumably borrow the logic used by Content Translation if it was a priority. That logic is essentially to grab a few recently-edited articles by the editor and find articles that are similar using the morelike functionality of Search (i.e. keyword overlap).
  • Popular: Content Translation can also do this as a fallback where highly-viewed articles are more likely to be recommended.
  • Search: find exactly the article you want to edit. Content Translation enables this and Suggested Edits does this too. I don't think Newcomer Tasks has this functionality.
  • Topical filters: current approach by Newcomer Tasks, which is especially appropriate for new editors who don't have edit history and may have trouble coming up with good searches. Provide a standard list of topics to select from. We currently have a model for doing this that is trained based on WikiProject data and uses article links to make predictions (two of your suggestions). We did look at categories but have largely found it difficult to use them in coherent ways. Because this topic taxonomy will never be perfect for all the different needs, I've also been working on tools like list-building that might help organizers to more quickly build campaigns/wikiprojects and then enable ways for these lists to be used in our recommender systems as additional filters (e.g., ongoing project with Content Translation: T368718).

One pathway for your ask could be adding this functionality into the specific Suggested Edits or Newcomer Tasks system as a whole, but another pathway could be making each of these pieces a bit more modular so that you could e.g., re-use the tasks logic but have your own topical filtering and UI. For instance, Growth has moved their configuration that matches various templates to their relevant unstructured tasks into Community Configuration so that it's easier to manage and for other tools to build upon. This set of tasks also builds on longstanding community tools like SuggestBot, which does in fact do the personalized recommendations based on previous editing history. So while the interface is different and it doesn't have the structured tasks, it gives you some of the desired functionality and might be worth checking out if you haven't already. There's also a request to standardize the URL for structured tasks used by Growth (T362579) so it's easier to incorporate into other tooling / lists.

weekly update:

  • See T343241 for details on some explorations for the next generation of topic classification models
  • No input from me needed yet but ML Platform began looking at productionizing the article-country model (yay!)

weekly update:

weekly update:

  • continued work on V2 of the model (T343241#10222700)
  • Pau left some great feedback on unexpected outputs from the current model in T376987 that I explored and left comments about where the issues are coming from and what we can do to address -- mostly this helps to motivate the work we're doing on V2 and points to some things to consider as we hopefully incorporate it into the Search Index at some point in the future
  • Kevin continuing to work on putting the article-country model on LiftWing (thank you!)

Thanks a lot for this heads-up. I didn't know newcomer tasks were this different from suggested edits and some links on the NT page link to suggested edit pages so I'll look into what the differences are.

In the meantime I also found the list-building tool and it's very useful and the closest to what I was proposing. In its repo, I created a new issue for some ways to enable using it to find subjects and/or tasks of interest. I think the best way for what I was proposing could be extending that tool and then making it available through Suggested Edits Desktop / a second version of Newcomer Tasks for more experienced editors.

I don't know how many would use this and think it could be that most editors are also overflowing with things to do instead of being out of ideas&motivation of what to do. Nevertheless, there's probably many who would greatly benefit from this, e.g. to keep being engaged when for some reason shortly running out of things to do or looking for easier or novel/uncommon tasks. One can always go back to such tasks and tackle some and this kind of thing itself is kind of fun because it's more interactive, has more feedback, etc.

I created this Community Wishlist proposal and suggest further discussion is done on its talk page. I intend to revise and improve it and at some point create a phabricator issue for it if nobody has created one at a later point. The proposal is under development and could be better, if you have some ideas or change suggestions for it please let me know. I quoted you on the three tasks types. I think there are downsides to tasks getting editors who would not be involved with an article to be involved with them but the benefits outweigh, especially if this is designed well. Btw, I don't think it would be good to get more of the scarce contributors to spend time translating things manually when machine translation for major languages is about to reach high-quality (albeit it's arguably better than writing articles from scratch that already exist with high quality in a WP like ENWP). Isn't selecting interests also a feature of suggested edits? I don't think the Topic Classification currently works well, e.g. for "European Green Deal" it does not suggest anything environment-related. As noted, I think WikiProject tags should be used more and whatever the problem with categories was can probably be fixed such as using them only with various exceptions and/or with various criteria for reducing their weighting or not using specific category branches that are not well-suited for similarity (e.g. aren't about topic). In relation to that I was thinking of a way to group categories somehow.

@Prototyperspective thanks for sharing and writing these up for Community Wishlist. I know that I have a few other questions/requests you've shared with me elsewhere. I just wanted to acknowledge and let you know that I'll work to get you a reply soon. One quick thought:

Isn't selecting interests also a feature of suggested edits? I don't think the Topic Classification currently works well, e.g. for "European Green Deal" it does not suggest anything environment-related.

We're working on a new generation of topic models -- part of this is hopefully improving the environmental aspect of the taxonomy. You can see in this UI how the example you shared fares. It registers as low-confidence for Sustainability as well as Humans+Environment: https://wiki-topic.toolforge.org/topic-prototype?lang=en&title=European_Green_Deal. The low-confidence piece is something I'm trying to improve by getting the model better training data. You can see the WikiProjects that are used to generate the training data for each topic in this file if you're curious: https://github.com/geohci/wikitax/blob/master/datasets/wikiproject_taxonomy.isaac_prototype_20240925.yaml#L499. When it comes to the actual recommender system, we aim to use the model outputs combined with some additional features pulled from categories and Wikidata (mostly around people, species, and geographic aspects) but there is some complementary work that is thinking about how to better elevate WikiProjects and those community tagging efforts.

Weekly update:

  • recorded overview video for sustainability/biodiversity group about the new topic models and what we're trying to do in this space
  • re-trained model without STEM* category (too broad) and with some fixes to make Entertainment a bit more straightforward. I didn't notice any sizeable changes in the model which is nice because it suggests that it's pretty stable even when re-training with a slightly adjusted taxonomy and with new data (October instead of August '24).
  • I also added logic to the API to use SPARQL to identify pathways from occupation values on Wikidata to a set of high-level occupations that we can confidently map to topics. This will help a bit with coverage for people.
  • identified a bug that I had in the article-country API (just an uncaught exception, not incorrect results) that Kevin let me know he would fix

weekly update:

  • Had the first focus group with the sustainability/biodiversity group and it was very useful! One of the core takeaways was that that group doesn't just need to be able to e.g., distinguish between different taxons, but more importantly wants to be able to explore the intersections between species and human impacts. This might mean subsetting by whether a species is endangered/invasive/extinct or has medicinal/cultural uses, etc. These are currently hard to do but we could potentially achieve through intersections in the topic taxonomy -- e.g., Plants+Medicine. We seem to be doing a good job of getting the core topic (Plants) but I'm going to look into using additional categories/Wikidata items to capture these secondary topics (Medicine, Food, etc.). For example, IUCN status of Endangered is something we could gather easily from Wikidata (example) or perhaps the "has use" property could also be mined (example).
  • Folks are also interested in further subdividing animal/plant and adding in a bacteria-related kingdom distinction. That involves an expansion of the topic taxonomy, which is a bit harder, but is something that I'll look into as well.

weekly update:

  • "Age" as a facet of the taxonomy was discussed and something we can bring for feedback to the GLAM folks. We have existing code for extracting dates from Wikidata properties -- the tricky thing is how to surface that data as a topic filter given that years are largely continuous as opposed to discrete categories.
Miriam triaged this task as High priority.Wed, Nov 20, 1:51 PM

weekly update:

  • Had our second focus group with the Art folks -- they raised some valuable points about being careful about putting too much structure around those topics given the long and not always great and constantly changing nature of how we classify art. They also expressed interest in the "age" facet so that's certainly something we should dig into more.
  • Article-country model is getting very close to its first launch on LiftWing and then we'll be able to work on an event stream and incorporation into Search