Page MenuHomePhabricator

[Q1 FY 25-26 Applied Sciences Team] Knowledge Gaps Research
Closed, ResolvedPublic

Description

This is a parent task to capture the Q1 work by Applied Sciences (Research) related to Knowledge Gaps. It will capture prioritization decisions and major weekly updates related to tasks in this bucket from July - September 2025. More fine-grained updates and coordination will occur in the subtasks as appropriate. It follows the Q4 task (T391707).

Confirmed Projects

ProjectResponsiblePrioritizationTicket
Language vision exploration@CMyrick-WMFEssential WorkN/A
Topic Infrastructure@IsaacTBDT361637
Article creation sustainability@CMyrick-WMFOKRT397374
Language data consultation@CMyrick-WMFEssential WorkT388201
Rescoping Simplification@MGerlachOKRT399567
Community Insights Test@TAndicEssential WorkT400692

Details

Due Date
Sep 30 2025, 4:00 AM

Event Timeline

Isaac set Due Date to Sep 30 2025, 4:00 AM.Jul 2 2025, 4:26 PM

Weekly update:

Article creation sustainability

  • Created sample dataset (create_dataset.ipynb) consisting of English Wikipedia articles translated --> into Spanish Wikipedia articles; and those articles' features (via mwparserfromhtml) from two timepoints: day of article translation, most recent article version.
  • Began visualizations comparing enwiki vs. eswiki article trajectories (r_viz.ipynb)
  • Began answering stats questions comparing enwiki vs. eswiki articles (r_stats.ipynb)

Weekly update:

Rescoping Simplification

  • Started data collection for articles from English Wikipedia
  • Thinking through other potential metrics to consider; consulting with Design Research on previous insights

Article creation sustainability

  • Continued calculating comparative stats between enwiki articles and eswiki articles (r_stats.ipynb)
  • Brainstormed options for random sampling of API cxpublishedtranslation.

Weekly update:

Rescoping Simplification

  • @MGerlach identified 3 potential approaches for article prioritization, and explored on enwiki:
    • Topics: Some topics are generally more difficult to read/understand than others. In fact, readability of articles varies substantially across topics when calculating the Flesch-Kincaid grade level for all articles and calculating averages across the set of 64 topics from the article-topic model. This suggests to prioritize articles from topics that are more difficult to read.
    • Difficult to read / high pageviews / few edits: On an article level, we could identify articles for which we might anticipate a need for a simpler version if it is i) difficult to read, and ii) has many pageviews, and iii) has few/no editing activity (i.e. dont seem controversial).
    • Maintenance templates: Another option is to consider articles which are marked with relevant templates. Specifically, two templates that could indicate a need for simplification are {{Confusing}} and {{Technical}}.

Article creation sustainability

  • @CMyrick-WMF finished up the groundwork for dataset creation
    • (The dataset will be a large, random selection of ~10,000 translated articles. To start with, the sample will be limited to articles translated from English Wikipedia; but once the analysis is complete, will expand to other languages. There will be no limitation on the number of target languages; but simple random sampling so some target languages will appear in the data more than others.)
    • Determined best way to do random sampling is to query the wmf_product.cx_translations table rather than using the API.
    • Finished writing random selection query
    • Finished writing query & loop to pull HMTL for each article randomly selected

Weekly updates

Topic infrastructure

  • Met with the Growth team and Moderator Tools to understand how the new topic models might relate to their product plans in the next several months. We still have a few more teams to talk with before figuring out the next steps.

Language data consultation

  • Uploaded notebook for generating canonical dataset with Incubator projects (T393075) for code review
  • Received feedback on notebook for generating canonical dataset with language codes (T346855); decisions need to be made regarding a few languages that don't have matching ISO 639-3 codes.

Article creation sustainability

  • Rewrote sampling script to include translations from enwiki -> arwiki, fawiki, frwiki, hewiki, itwiki, ptwiki, trwiki, viwiki, and zhwiki, in addition to enwiki -> eswiki
  • Pulled feature counts from multiple timepoints; merged as a before-and-after dataset
  • Uploaded create_sample.ipnyb to Gitlab to show sampling method and dataset creation method
Isaac added a subscriber: TAndic.

Weekly updates:

Rescoping simplification

Article creation sustainability

  • research data engineering consultation (i.e. met with Fabian) to discuss best practices for querying and comparing sets of html data

Weekly updates:

Rescoping simplification

  • Updated analysis to remove disambiguation/list pages
  • Generated list of 1000 example articles for each of the three approaches
  • Shared results with stakeholders
  • Closed task (initial goal of this analysis is completed) 🎉

Weekly updates:

Article creation sustainability

  • Updated sample_creation.ipynb to create an additional dataframe from which to calculate article source stats
  • Started sources.ipynb notebook to calculate article source stats

Weekly updates:

Topic Infrastructure

  • The topic model refresh is stalled at the moment as @Isaac looks for product stakeholders. While many teams would benefit, the motivation comes from community feedback and not a Product need, which is making prioritization difficult.

Community Insights Test

  • Began work on the CI test again now that the Newcomers Survey is ready for deployment

Weekly updates:

Topic Infrastructure

@Isaac fixed a gap in the determination of relevant timespan for articles based on Wikidata properties for the V2 topic model prototype. Previously, many entities for living people or e.g., towns that were incorporated but didn't have any clear end date would have a starting year but no real ending year so they'd be constrained to just their birth/incorporation date. The model now sets the end year to today if there is a "date of birth" but not "date of death" or a "start time" but no "end time". Might require further tweaks but the new implementation at least should be able to support more flexibility in how this is determined. Explore: https://wiki-topic.toolforge.org/topic-prototype

Article creation sustainability

  • Published r_stats.ipynb with high-level initial findings. Included are summary stats and plots comparing the before-and-after states (i.e., state at day of translation and state at most recent query) for the source article on enwiki and the translated article
  • Focused on reference wrangling: completed sources.ipynb, which includes now a df creation that includes columns for the full lists of sources as matrices that can be compared; added reference related summary stats to r_stats.ipynb

Weekly update

Article creation sustainability

Weekly updates:

Community Insights Test

Decision made: with some known eventlogging barriers with QuickSurvey, our Community Insights enwiki pilot will aim for a smaller test sample next quarter, with the aim for a global distribution of the Community Insights at the end of Q3/early Q4 (in alignment with previous data collection years).

Topic Infrastructure

Isaac has agreed to be a tertiary mentor for an Outreachy project related to identifying micro-tasks associated with articles worklists: T405754: Outreachy 31: Micro-task Generator for Organizers on Wikipedia (application). This will help in mapping out the realm of potential tasks and how easy it is to aggregate the relevant task sources.

Article creation sustainability

Using two new samples and a new analysis, Caroline compared articles created using cx vs. articles created "from scratch, looking specifically at the features of those articles in their very first versions (i.e. their first revision, i.e. their creation). This addresses the question How do translated articles compare to “from scratch” articles in terms of their very first revision? and will shed light on what Cx makes easier or harder than creating articles from scratch.

  • New random samples (see create_sample.ipynb): ~1000 cx'ed articles (across ar-, es-, fr-, tr-, & uz- wiki); ~1000 “from scratch" articles (same wikis).
  • New comparative analysis (see r_article_creation_comparison.ipynb), of which preliminary findings show:
    • the sample of cx'ed articles have more citations, headings, sources, and greater page length at time of creation
    • the sample of "from-scratch" articles have more categories, interwiki links, wikilinks, and presence of infoboxes at time of creation.

Preliminary findings might suggest that features provided by the cx tool make it easier for editors to add sources, cite sources, create more sections, and add more text during their very first edit than if they were writing the article from scratch. Larger samples, additional analyses, and statistical testing are are needed before drawing conclusions or making recommendations.

Weekly update

Article creation sustainability: Finished initial analysis and visualizations (see r_article_creation_comparison.ipynb). The visualized data show the preliminary findings described in the previous post, across multiple bins (wikipedia language edition, article edit bins, article pageview bins, editing experience of the article creator, and article topic). These show that the preliminary findings described above mostly remain constant when controlling for these four types of conditions.

Weekly update

Article creation sustainability: Finished analyses and visualizations of article features at the timepoint of 24 hours after translation/creation (see r_article_hrs24_comparison.ipynb). Findings show that between initial cx/creation edit and 24 hours later, the Cx'ed articles see bigger increases in infobox presence, interwiki links, wikilinks, images, and categories while "from-scratch" articles see bigger increases in page length. Closed out task.

Miriam subscribed.

Closing this task as this was the tracker for Q1