Page MenuHomePhabricator

develop a content diff index design plan
Closed, ResolvedPublic

Description

This task is to track research exploration on how AI-assisted tools can enhance content discovery, moderation, and newcomer support on Wikipedia. The initial focus is on building a content diff index to make Wikipedia’s content history more accessible for non-article-centric questions.

We will validate the index by targeting a specific, high-value query pattern and creating an early prototype. This will help us test technical feasibility, gather early feedback, and inform design decisions for future iterations.

Sprint 1: Content Diff Index

  • Implement a first version of the content diff index for a chosen query pattern (e.g., tracking who added specific terms or sources).
  • Validate the output and performance on a realistic subset of the dataset to confirm feasibility and identify any infrastructure constraints.
  • Output: will be a database that allows efficient requests for the chosen query pattern

Sprint 2: Prototype UI for PM Feedback

  • Build a simple UI prototype (possibly just a json api) that demonstrates how the content diff index can be used to answer practical user questions.
  • Share the prototype with product managers to gather concrete feedback on usefulness, desired features, and next steps.
  • Output: a documented planned next steps informed by the previous step.

Details

Due Date
Sep 5 2025, 12:00 AM

Event Timeline

leila renamed this task from [Q1 FY 25-26 Research Engineering] AI Assisted Tools Research to AI Assisted Tools Research.Jul 3 2025, 6:13 PM
fkaelin renamed this task from AI Assisted Tools Research to [Q1 FY 25-26 Research Engineering] AI Assisted Tools Research.Jul 3 2025, 6:14 PM
fkaelin updated the task description. (Show Details)
leila renamed this task from [Q1 FY 25-26 Research Engineering] AI Assisted Tools Research to develop a content diff index design plan.Jul 3 2025, 6:20 PM
leila triaged this task as High priority.
leila updated the task description. (Show Details)
leila set Due Date to Aug 8 2025, 12:00 AM.

Weekly updates

  • pipeline to aggregate an editor centric view of the content diff dataset. For every wiki_db, editor (user name, ip, or temp name), we aggregate information about each token the editor acted on. The information includes the action (added/removed), the count of how many times that action happened, and the affected revision_ids and page titles.
  • the mwtokenizer library is used to tokenize the diff
  • dataset for 1 year of data, for wikis "simplewiki", "arwiki", "dewiki", "enwiki", is available on hdfs `/user/fab/content_diff_index/
  • evaluation of database options for prototype
Miriam changed Due Date from Aug 8 2025, 12:00 AM to Aug 26 2025, 12:00 AM.Aug 14 2025, 2:28 PM

Changed the deadline as there were delays as we needed to switch the pipeline to use edit types instead of raw content diffs. The plan is to send the prototype out for feedback this week and summarize learnings in a couple of weeks.

Weekly updates:

  • a prototype for querying editor history is deployed on cloud vps for 1 month of data (August 2024)
  • a design plan draft that describes the approach
  • ongoing discussion for how and to whom to present this work in product management
Miriam changed Due Date from Aug 26 2025, 12:00 AM to Aug 29 2025, 12:00 AM.Aug 26 2025, 12:37 PM
Miriam changed Due Date from Aug 29 2025, 12:00 AM to Sep 5 2025, 12:00 AM.Aug 28 2025, 2:00 PM

Extended the deadline as the need to speak with more PMs emerged. @fkaelin will report back the learnings of this round of meetings and the next stpes.

  • Summary of the outputs:
    • A data pipeline that generates a editor history dataset, based on the content diff dataset and using the edit types library.
    • A prototype UI to explore possible query patterns (for 1 month of data)
    • Doc with background, approach, example queries
  • Discussion with product management
    • Moderator tools WE1.3 (Sam Walton). Editor history/discovery is a a need, but there are limited tools. Related to content investigations (activity around certain terms/topics), spam links (also see T221397), sockpuppet detection. Moderator hub: providing patrollers with a central location that suggests things that they could do. Need for data sources that can help inform those recommendations. Editor history could be such a data source, we need to bridge the gap between the existing data and product needs.
    • Anti abuse WE4.3 (Kosta/Madalina). The work on "suggested investigations" focuses on signals that are not visible through other means. Patterns/sequences of events that are interesting to check-users, can we consolidate into a risk signal to be displayed, examples: shared email for signup, suspicious hCaptcha activity. Editor history a good candidate for such signals, example queries discussed: 1. the top X editors that have added the most external links in the last Y days, per wiki. 2. for top X external links, list of IP that added these links.
  • Next steps
    • Define one or two signals for suggested investigations teams, current candidates
    • Propose a hypothesis for the next phase of this work, in discussion with product teams
    • PMs will share the prototype for feedback with community members
    • Decide if edit types should use the HTML or wikitext (cc @Isaac ). T378617
    • Start process with DPE to have a productionized edit types dataset T351225