Page MenuHomePhabricator

[Q3 FY 24-25 Applied Science] Knowledge Gaps Research
Closed, ResolvedPublic

Description

This is a parent task to capture the Q3 work by Applied Sciences (Research) related to Knowledge Gaps. It will capture prioritization decisions and major weekly updates related to tasks in this bucket from January - March 2025. More fine-grained updates and coordination with occur in the subtasks as appropriate.

Confirmed Projects

ProjectResponsiblePrioritizationDetails (if applicable)Final status
Language Gaps metrics@CMyrick-WMFEssential WorkT383925; T388201Extended a week to finalize status report
Topic infrastructure@IsaacEssential WorkT361637Pending discussions in Q4 about implementation hypothesis
Reading List Data@MGerlachEssential WorkT378420Resolved
Add-a-link closeout@MGerlachEssential WorkT361926Resolved
Epistemic Injustice Paper (UMN)@IsaacEssential WorkN/AComplete; waiting on notification (8 April)
Small Language Projects@CMyrick-WMF (support)Essential WorkT381903Will revisit if needed in Q4
Editor Metric Consultation@TAndic (support)OKR (WE 1.5.2)T369096Wrapping up in April
Identify Web Scraping@MGerlach (support)OKR (WE 4.3.11)T384855Will revisit if needed in Q4

Event Timeline

Isaac triaged this task as High priority.Jan 13 2025, 11:02 PM
Isaac updated the task description. (Show Details)

Updates:

  • @CMyrick-WMF presented the language gap metric proposals to the team and scoped out the work for the next steps for this quarter under T383925, which will focus on the Vital Article component.
  • ML Platform has picked up the model stream for the article-country work so I coordinated with Purity (LPL) and notified Search that we will soon be ready to pick up that leg of the work: T301671#10468557
  • As part of the reading list exploration, @MGerlach identified vector search evaluation as a useful application and is scoping out a baseline there (T383865).
  • I provided feedback to UMN team on their draft for DIS. Overall looks good and provides a nice framework for thinking about how to mentor newcomers to become editors who can help address knowledge gaps on the projects -- viewing it not just as a technical challenge but also the ability to help Wikipedia policy evolve too.

Updates:

  • @CMyrick-WMF prepared a presentation on Language Metrics (and Knowledge Gaps more broadly) for the Knowledge Equity offsite
  • Coordination work ongoing between Search, ML Platform, DPE, LPL, and me to get the article-country predictions into the Search index where Content Translation can consume them. There's now a hypothesis for this though owned by LPL (WE2.5.1 Country-level inference model for translation suggestions) so we're getting much closer to clarity on roles and next steps.
  • Epistemic Injustice paper was submitted to DIS. Notifications go out 8 April.
  • @MGerlach continued exploration of vector search with reading lists as the evaluation. There's now a text-embedding baseline to complement morelike (keyword-based baseline). The text-embedding aspect was greatly facilitated by a previous pipeline developed by Research Engineering (code) so many thanks to Muniza for that foresight!
Isaac added a subscriber: TAndic.

Weekly update:

  • @CMyrick-WMF presented the Language Metrics work to the Knowledge Equity offsite
  • We added some support for Movement Insights related to editor metrics that they are developing (see task description)
  • The final topic consultation was delayed until Feb 14th

Weekly update:

  • Prioritized new task: Martin's consultation support for WE 4.3.11 hypothesis
  • Martin also closed out his investigation of reading lists as a source of data for vector search. Full details can be seen at T378420#10531770 but some highlights:
    • morelike works quite well for this use-case compared to off-the-shelf embedding models. This aligns with similar findings for predicting WikiProject lists and recommending related articles to read (and anecdotally good performance for Content Translation recommendations). As an aside, I find morelike to be quite powerful and easy to use for any use-cases involving Wikipedia articles within a specific language -- the cases where it likely suffers are cross-lingual searches (not really possible but you could hack together via sitelinks), Help/Policy/Technical documentation where articles are often about many different topics, when the user likely doesn't know the right terminology (e.g., new developers or editors), perhaps in languages that are more difficult to tokenize into consistent units. It also doesn't directly solve the issue of finding a specific passage in response to a query but can still potentially be a good way to generate candidates that can then be embedded/re-ranked.
    • These reading lists are a nice and unique source of data for evaluating (or potentially fine-tuning) various embedding-based approaches though so plenty of potential follow-up work that could be done there.
  • Caroline and I discussed follow-up work for Language Gaps metrics so we can better map out what the timeline for that work should look like. I met with Omari and got some useful guidance on how to prioritize these metrics with his team -- mainly having a clear Product fit and strategy that they can be linked to so we know they'll inform our broader work.

Weekly update:

  • Topics: Article-country stream (T382295) is live! I'm meeting next week with Search to discuss concerns they have with a component of the model that uses the Search index as essentially a key-value store (T385970) but while being blocked on that would decrease the coverage of the model and is important to figure out, it's not a blocker to Product being able to use the model outputs in Content Translation or other systems. Final topic consultation delayed another week (21 February).
  • Growth has been curious about improved add-a-link retraining and the language-agnostic model. @MGerlach has been providing feedback and we'll see if there is any momentum here that we can use to update how we handle add-a-link training/predictions in production.

Weekly update:

  • Feedback provided by @MGerlach in the project planning doc for approaches in WE 4.3.11
  • @MGerlach validated the language-agnostic add-a-link training pipeline for several languages and updated the documentation accordingly (T361926). Conversations will be ongoing about how to incorporate some of these learnings into our production pipeline for add-a-link but the current work for that project is complete.
  • Topic infrastructure: I participated in our third topic model feedback session, with a focus this time on Ethnic Identity and Indigenous Knowledge. My larger takeaway was that high-level, language-agnostic topics like what we use in the topic model might not be appropriate for this context given how important the local nuance and framing is and how often that does not align with current, dominant narratives that we can capture in the data. The one space that does allow for capturing some of this nuance on the Wikimedia projects is via categories. They aren't perfect but they are flexible and were brought up by several of the participants as a space that they think is more appropriate. So one potential course of action: rather than trying to force this particular area of content into a very high-level taxonomy, we should be working on exposing categories as an alternative discovery mechanism through our tooling. This is partially in place because Search (where we host the topics) also supports category-based filters. The main challenge would be how to make that usable without someone having to know an exact category name.
  • Article-country: We identified with Search and ML Platform that our use of the Search cirrusdoc API wasn't a great long-term solution and so we switched to using a static database dependency. I created a fresh dump for that (P73436#294761) and filed a request to REng to create an Airflow DAG for regenerating this (T387041).
  • Language Gap Metrics: @CMyrick-WMF compared pagepile vs. Wikidata as a source for 1000-articles (notebook) and found Wikidata to be slightly behind. We discussed and the pagepile is probably more desirable as a stable source of articles that we can version when updating (as opposed to the underlying data slowly shifting month-to-month). Analysis updated as well (code).

Weekly update:

  • NOTE: now that Martin has closed out much of his work in this space (reading lists + add-a-link), the updates will be focused mainly on topic infrastructure and language gap metrics. Occasional consultation-related updates expected for the other projects but the Research inputs have been lightweight (as expected) thusfar so I haven't been raising here. March should see more updates around language gap metrics too as work related to DSS/Admins slows down for Caroline.
  • I participated in our final topic model focus group, which was focused on the topic of Gender. There was general support for breaking out biographies as their own topic that is derived from Wikidata. We will just need to propose a concrete approach for how we'd map gender values on Wikidata to high-level categories. There was a lot of interest and discussion then around what a gender studies topic might look like. So rather than this more constrained "Women's X" type topic, which perhaps mirrors a bit how some WikiProjects are setup, we expand to the broader gender studies umbrella that should capture a broader set of WikiProjects and be more inclusive. Still remains to be seen how effective the model would be at predicting this topic and how it should relate to the Human Rights topic (likely at least some overlap). The WikiProjects in this space mix in a lot of biographies and we'd have to think about whether to include those too (and risk the model just learning a general mapping between biographies and gender studies) or add some additional code to try to filter out biographies from these WikiProjects when training. There will also be some planning conversations soon to figure out who in Q4 might own a hypothesis related to bringing these changes to production then (ideally a Product owner though it'll have dependencies on Research, ML, and a tiny bit on Search).

Weekly update:

  • I met with Alex S. to discuss reporting out about topic focus groups. In the next week, he's going to work on cleaning up our thoughts about major takeaways and I'm going to work on documenting the changes between the current LiftWing model, the prototype we shared with participants, and new iteration based on their feedback. Most of this is incremental but there are a few areas (gender studies; extending Plant+Animal labels a little; age) that might take a bit more trial-and-error to figure out where we stand. That should put us in a place to share out with the groups and propose a concrete v2 model.
  • @CMyrick-WMF is collaborating with Movement Insights to incorporate in language+incubator data into our canonical datasets! That work will be tracked under T388201 and is a parallel project to this quarter's additional development (T383925).

Weekly update:

  • I didn't make the progress that I wanted to yet on the topic prototype so that remains a high priority task for me. On the plus side, article-country is open fully ready for Content Translation to incorporate: the model outputs are available via a keyword in Search and the update for incorporating in link-based predictions should go out early next week (so coverage will improve). Now it'll be on the LPL team to make the UI/backend adjustments to incorporate into Content Translation.
  • @CMyrick-WMF began working on incorporating Glottolog data into our canonical data repo for support of Language-related metrics. That code will be tracked in https://gitlab.wikimedia.org/cmyrick/third-party-language-data/-/tree/main/glottolog.

Weekly update:

  • Some renewed discussion about whether Ethnologue data can be used as part of the Language metrics
  • I worked on implementing the feedback we got from our topic infrastructure groups and updated the prototype. Alongside more incremental improvements, we are testing out a Gender Studies topic (though performance has been low) and Time topic (though we need some more clarity on technical feasibility in terms of surfacing the predictions via Search). The aim is to publish the summary next week so we can get any additional feedback and start thinking about actual updating of the production model and tools that use it. It's been a really fun and useful process overall of doing these focus-group discussions with community experts aimed at aspects of the model and working to incorporate that feedback.

I added a final status to each of the project in the task description. A few notable things:

  • @CMyrick-WMF will put together a final status report for the Language Gaps metrics work by April 7th -- that will be added to T348246 which can then be resolved.
  • We wrapped up the focus group phase of the topic model V2 project. This led to a project report and updated prototype and description of taxonomy change.
  • @TAndic 's Editor Metric Consultation work has been slightly extended into April by Movement Insights but should wrap up then.
  • We will reopen consultation support for Small Language Projects and Identify Web Scraping as needed in Q4 though nothing specific expected at this time.
  • We should receive initial paper notification for the Epistemic Injustice Paper on April 8th but no further work should be required there.