Page MenuHomePhabricator

Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors
Closed, ResolvedPublic

Description

Hypothesis

If we apply the Tone Check model to published articles, we will learn whether we can identify the ≥10,000 tone issues with a probability score of 0.8 or higher, needed to build a high-quality pool of suggestions to help guide editors in improving article tone.

Scoping details

Use case:

This model will support a new Suggested Edit task that invites contributors—especially newcomers—to improve the neutrality of existing Wikipedia articles by identifying and rewriting biased or promotional language, and peacock language. The intended audience includes users engaging with Suggested Edits via the Newcomer Homepage. The model’s outputs will be surfaced as highlighted sentences or paragraphs within articles, accompanied by calls to action encouraging users to revise them to align with Wikipedia's neutral point of view (NPOV) policy.

This task explores the broader hypothesis that Edit Checks and Suggested Edits can share underlying detection logic. If successful, this approach could improve efficiency, consistency, and scalability across structured editing workflows.

Related tasks:

Model purpose:

The model should analyze article content and detect instances of biased tone or peacock language at a sentence or paragraph level. These detections will inform Suggested Edits, guiding contributors to revise non-neutral phrasing.

Goal:

This project aims to improve article quality by encouraging neutral, policy-aligned contributions. Specific goals include:

  • Increasing the number of constructive Suggested Edits
  • Reducing the burden on moderators by proactively addressing biased language
  • Supporting newcomers in learning and applying Wikipedia’s NPOV guidelines
  • Key success metrics include:
    • Accuracy of model detections (precision/recall)
    • Revert rate and/or qualitative review of resulting edits
    • Completion rate of "neutral tone" Suggested Edits
Prior art:

This project builds on that work by adapting UX for existing Suggested Edits and Edit Checks:

Prioritization details

Timing:

The Growth team hopes to start work on this project in July 2025. Depending on timelines, we could simply start by assisting with a model evaluation while finalizing design and architectural decisions in Q1 FY25/26.

This work complements, but does not block, other active projects.

KR impact:

FY25/26 WE1.1 KR:
Increasing newcomer constructive activation and retention:

Increase constructive edits [i] by X% for editors with less than 100 cumulative edits, as measured by experiments by the end of Q2.
i. "Constructive edits" = edits that are not reverted within 48 hours of being published

Other comments

Model requirements:

  • Detection should be precise enough (sentence or paragraph level) to support actionable user suggestions
  • Low false positive rate is essential to maintain user trust and minimize disruption
  • Ideally the suggestion queue is built in a way to allow for Community Configuration (e.g., ability for admins to define rules to exclude certain pages, sections, or words) would improve usefulness and community adoption
  • The model should be efficient and scalable for use across many articles and languages
  • The model should ideally exclude suggestions that target direct quotes, as peacock language or non-neutral tone may be appropriate in these contexts (e.g., when quoting historical texts, public statements, or notable quotations).

Reporting format

Progress update on the hypothesis for the week, including if something has shipped:

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

Any emerging blockers or risks:

Any unresolved dependencies:

New lessons from the hypothesis:

Changes to the hypothesis scope or timeline:

Related Objects

Event Timeline

KStoller-WMF renamed this task from AI/ML Model Request: **Peacock Language Suggested Edit** to AI/ML Model Request: Peacock Language Suggested Edit.
KStoller-WMF renamed this task from AI/ML Model Request: Peacock Language Suggested Edit to AI/ML Model Request: "Improve Tone" Suggested Edit.Jun 12 2025, 6:05 PM
KStoller-WMF updated the task description. (Show Details)

Just dropping a quick note to acknowledge the Research tag/dependency here and that we're excited about this work but will wait for guidance from Machine Learning around prioritization and needs.

Sucheta-Salgaonkar-WMF renamed this task from AI/ML Model Request: "Improve Tone" Suggested Edit to Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors.Jul 24 2025, 10:13 PM
Sucheta-Salgaonkar-WMF renamed this task from Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors to FY2025-26 Q1 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors.
Sucheta-Salgaonkar-WMF updated the task description. (Show Details)
Sucheta-Salgaonkar-WMF renamed this task from FY2025-26 Q1 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors to Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors.Jul 24 2025, 10:15 PM

@SSalgaonkar-WMF hi! Is this task in progress? What support do you need from @diego / Research?

@Miriam hi!! yes this is! I don't think anticipate that we'll need support from Research, other than pinging @diego with occasional questions in the #structured-task-improve-tone channel. Is that okay?

Sure! I might remove the Research tag then if that's ok? If you identify a larger task for Research we can put it back and/or create a subtask.

@Miriam yes totally, sorry for not doing that sooner! I'll do a quick pass of our board to see if there are other places where y'all are unnecessarily tagged

We currently have two tracks in progress. The first is an analysis effort to determine if we can create a pool of high-quality structured tasks for new editors. The second track involves understanding the product requirements for the "beta experience" targeted for Q1 delivery, and making architectural decisions on the ML side.

Analysis work

Goal: Determine whether we can gather sufficient high-quality structured tasks with tone issues (target: 10K). The detailed process is outlined in this document. Here's a summary:

  • Step 1: Using a predefined list of article types and query data from two Wikipedia: English (representing large wikis) and Czech (representing small wikis). Calculating the total number of articles and paragraphs we would be passing into the model.
  • Step 2: Estimating the number of structured tasks generated by sampling articles, processing their paragraphs through our model to identify high-probability positive predictions, and then evaluating the quality of these tasks through both metadata analysis and manual assessment.

This work is currently in the planning phase and gathering feedback from the Growth/Research/ML team. It will likely be done using an one-off Jupyter notebook with Spark job for data collection and manual evaluation using spreadsheets.

Product requirements for the user-facing launch

Note: Before implementing any architectural decisions, the ML team needs to understand the product requirements. We should remain open to various approaches that meet these requirements. Unless explicitly stated as product requirements, we should not specify any particular architecture.

Based on the discussion with Michael from the Growth team, here are key requirements:

  1. Articles identified with a tone issue must be queryable in Cirrus Search/WeightedTags
  2. Only articles that have remained unedited for at least 24 hours are considered valid
    • If an identified article gets edited, it becomes invalid and must be updated in Cirrus Search/WeightedTags
  3. Plaintext paragraphs with a tone issue and their model predictions need to be served in some way (Data Gateway is one option, but not the only possibility)

Architectural options

Idea 1: Offline + serving predictions via Data Gateway:
  1. Task generation: a Airflow DAG running Spark job
    • Collect articles from a predefined list of article types
    • Only process articles that have remained unedited for at least 24 hours (requirement 2)
    • Parse content, split article into paragraphs, and run the tone check model.
    • Save high-probability positive predictions and metadata in Hadoop
  2. Populate Cassandra (AQS) for serving: a Airflow DAG to feed Cassandra with improve tone suggestions (possibly similar to cassandra_dag.py)
    • Possible schema: wiki_db, page_id, page_title, revision_id, paragraphs_list, predictions_list
    • Need to investigate: "What steps are needed after adding the data to Cassandra so it can be served via Data Gateway?"
  3. Populate and update CirrusSearch weighted_tags (requirement 1,2): a minimal stream application in LiftWing
    • Initially, produce mediawiki.cirrussearch.page_weighted_tags_change.rc0 events to populate weighted_tags for articles that identified with tone issue (generated in the step 1)
    • The application listens to the mediawiki.page_change.v1 events stream. When a page from our original pool gets a new revision, it produces mediawiki.cirrussearch.page_weighted_tags_change.rc0 events to unset weighted_tags for the article.
    • We don't need to remove entries from the database, as it serves as a pool. Over time, some articles become invalid after being edited, and they will no longer be surfaced in CirrusSearch. We'll track the number of valid articles, and if it drops below a certain threshold, we can re-run the generation pipeline to create a new pool.
Idea 2: Streaming + serving predictions via Lift Wing
  1. Task generation: a stream application in LiftWing that listens to the mediawiki.page_content_change.v1 events stream.
    • Only articles from the predefined list of article types will be processed
    • Parse content, split article into paragraphs, and run the tone check model
    • If the model identifies paragraphs with tone issues, save the page_id, revision_id, paragraph plaintext, prediction, and timestamp in ML Cassandra, but mark it as invalid initially
    • If the page entry already exists in ML Cassandra, update the revision_id and timestamp, and produce mediawiki.cirrussearch.page_weighted_tags_change.rc0 events to unset weighted tags for the article (requirement 2)
  2. Populate CirrusSearch weighted_tags (requirement 1,2): a script running continuously to mark page entries in ML Cassandra as valid after they've remained unedited for more than 24 hours, then produces mediawiki.cirrussearch.page_weighted_tags_change.rc0 events to set weighted tags for the article
  3. Serving: an inference API for serving data in ML Cassandra
    • If it is a cold start, it'll be uncertain how long it will take to collect enough improve tone suggestions (10,000) from page change events, and some articles we’re interested in (e.g. those with low pageviews) may not get any new edits soon. One solution is to run a one-time backfill of selected articles that identified in our initial analysis.

We discussed this during our ML:Research:Data Platform meeting last week.

Next steps:

We discussed this during our ML:Research:Data Platform meeting last week.

Next steps:

  • Follow up with the Growth Team to see if adding filters in ElasticSearch could satisfy requirement 2, since this requirement significantly increases design complexity.

Could you briefly outline what filter you have in mind? I glanced at https://www.mediawiki.org/wiki/Help:CirrusSearch but didn't see a filter that would seem to be applicable here.

We discussed this during our ML:Research:Data Platform meeting last week.

Next steps:

  • Follow up with the Growth Team to see if adding filters in ElasticSearch could satisfy requirement 2, since this requirement significantly increases design complexity.

Could you briefly outline what filter you have in mind? I glanced at https://www.mediawiki.org/wiki/Help:CirrusSearch but didn't see a filter that would seem to be applicable here.

@Michael: Yes, @dcausse mentioned that ElasticSearch can filter on last modification date of article, and this filter can be added to the search query. This means on the ML side, we don't need to delay publishing weighted tags for 24 hours. We could just normally consume the mediawiki.page_content_change.v1 events and, when we detect an article with tone check issues, produce mediawiki.cirrussearch.page_weighted_tags_change.rc0 events to set weighted tags for that article.

This would greatly simplify the streaming solution's logic. For the offline solution, not much, but we could potentially skip the step of verifying that an article hasn't been edited since the prediction was created when we produce mediawiki.cirrussearch.page_weighted_tags_change.rc0 events in step 3.

@dcausse: I would appreciate any input and corrections if I've misunderstood anything :)

@achou thanks for the ping!

Yes you're correct (just a quick correction here the weighted_tag stream is now in v1) it should be fairly easy to add a new keyword on our side to allow filtering the articles based on their last edit timestamp. At a glance it could look like this: lasteditdate:<now-24h to exclude pages edited in the last 24h (details TBD).

@Michael if this is easy enough on your side I believe that it would bring a lot more flexibility, you could adjust this threshold easily without involving a change in the data pipeline. And at the same time IIUC greatly reduce the complexity of this data-pipeline. Please let us know if/when you will need this and I'll file a new task.

Relatedly for search we prefer in general the streaming approach for several reasons:

  • if tags are sent to mediawiki.cirrussearch.page_weighted_tags_change.v1 (with the "rev_based" = true flag) within 10mins of the edit time we will join them with normal index updates reducing the number of updates we perform to the search backend.
  • search indices reflect the current state of the MW database (~10mins) and thus processes sending tags based on batch datasets (which can reflect the MW state as it was hours or even weeks ago) is prone to much more inconsistencies.

Yes, that is something that would work well on our side! And it might even unlock some simplifications for the other newcomer tasks down the road!

However, and just to make sure we're still aligned on this, the logic to clear the weighted tags for pages that have been edited (and no longer have a tone-issue) is still required. Does that match your understanding as well?

[...] For the offline solution, not much, but we could potentially skip the step of verifying that an article hasn't been edited since the prediction was created when we produce mediawiki.cirrussearch.page_weighted_tags_change.rc0 events in step 3.

Not sure if I fully follow here. We don't want to add a weighted tag that corresponds to an old revision if the new revision does not also have a tone issue, right?

the logic to clear the weighted tags for pages that have been edited (and no longer have a tone-issue) is still required.

Q: If we we make a slight alteration to Idea 2, perhaps not?

Idea 2b: Streaming + serving predictions via Lift Wing cassandra cache/materialized view

  1. Task generation: a stream application in LiftWing that listens to the mediawiki.page_content_change.v1 events stream.
    • Only articles from the predefined list of article types will be processed
    • Parse content, split article into paragraphs, and run the tone check model
    • If the model identifies paragraphs with tone issues:
      • save the page_id, revision_id, paragraph plaintext, prediction, and timestamp in ML Cassandra
      • Emit mediawiki.cirrussearch.page_weighted_tags_change.v1 to update tone check tag for article
    • If no tone issue:
      • If an entry previously exists in Cassandra, mark it entry as invalid (or just delete it?)
      • Emit mediawiki.cirrussearch.page_weighted_tags_change.v1 to clear tone check tag for article.

search indices reflect the current state of the MW database (~10mins)

There is a race condition here: search will take ~10mins before weighted_tag is updated, but invalidating (or deleting?) the Cassandra record will happen immediately. The product could avoid this by just not showing the task if no record was in Cassandra.


I'll offer one more streaming idea, that is much cleaner, but maybe a bit more complicated to operate.

Idea 2c: Streaming + serving predictions via Lift Wing cassandra cache/materialized view populated from page_tone_prediction_change stream

  1. Task generation: a stream application in LiftWing that listens to the mediawiki.page_content_change.v1 events stream.
    • Only articles from the predefined list of article types will be processed
    • Parse content, split article into paragraphs, and run the tone check model
    • Emit the mediawiki.page_tone_prediction_change.v1 event stream (similar to T328899 and T382295)
  • Separate processor(s?) consumes mediawiki.page_tone_prediction_change.v1
    • If the page has tone issues
      • save the page_id, revision_id, paragraph plaintext, prediction, and timestamp in ML Cassandra
      • Emit mediawiki.cirrussearch.page_weighted_tags_change.v1 to update tone check tag for article
    • If no tone issue:
      • If an entry previously exists in Cassandra, mark it entry as invalid (or just delete it?)
      • Emit mediawiki.cirrussearch.page_weighted_tags_change.v1 to clear tone check tag for article.

2c pros:

  • mediawiki.page_tone_prediction_change.v1 stream becomes source of updates for both cassandra and opensearch
  • backfilling is easier: batch generate mediawiki.page_tone_prediction_change.v1 events and emit them to mediawiki.page_tone_prediction_change.v1 (or another 'reconciliation' stream that uses the same schema and is also consumed by the stream processor).

2c cons:

  • extra stream processing service to maintain

the logic to clear the weighted tags for pages that have been edited (and no longer have a tone-issue) is still required.

Q: If we we make a slight alteration to Idea 2, perhaps not?

Idea 2b: Streaming + serving predictions via Lift Wing cassandra cache/materialized view

  1. Task generation: a stream application in LiftWing that listens to the mediawiki.page_content_change.v1 events stream.
    • Only articles from the predefined list of article types will be processed
    • Parse content, split article into paragraphs, and run the tone check model
    • If the model identifies paragraphs with tone issues:
      • save the page_id, revision_id, paragraph plaintext, prediction, and timestamp in ML Cassandra
      • Emit mediawiki.cirrussearch.page_weighted_tags_change.v1 to update tone check tag for article
    • If no tone issue:
      • If an entry previously exists in Cassandra, mark it entry as invalid (or just delete it?)
      • Emit mediawiki.cirrussearch.page_weighted_tags_change.v1 to clear tone check tag for article.

Isn't that exactly "logic to clear the weighted tags"?^^ To me, this sounds like it should work well enough.

search indices reflect the current state of the MW database (~10mins)

There is a race condition here: search will take ~10mins before weighted_tag is updated, but invalidating (or deleting?) the Cassandra record will happen immediately. The product could avoid this by just not showing the task if no record was in Cassandra.

Yes, that is something that should not be too much effort, and we have done similar things for the other structured tasks as well.

@Ottomata Thanks for the input! <3

First I want to clarify one thing — the current tone check service in Lift Wing is built for the edit-check project. The service receives paragraph text directly from VisualEditor without needing to fetch or process article content on our end. We simply feed the received text to the model for predictions.
However, for the improve tone structured tasks, we need to handle this part. We will need to either extend the current service or create a new service in Lift Wing that fetches/processes article content, parses it into paragraphs, and runs these through the tone check model, if we go for a streaming solution.

Back to the two streaming ideas, for me 2b has fewer unknowns and design decisions. It can be implemented more quickly than 2c, while 2c has cleaner component decoupling but requires more work and design decisions, for example: 1) We need to design the schema for mediawiki.page_tone_prediction_change.v1. Could it be based on the prediction_classification_change? 2) The separate processor is unlike any other inference services we have on Lift Wing (I think it would be more similar to the mediawiki-event-enrichment?), so should it live in our inference service repo? However, this separate stream processing service could potentially serve as a foundation for future structured tasks' update pipelines.

For this project, ideally we want to launch something beta in Q1. Currently we're looking at a predefined list of article types to generate at least 10K tasks offline (see Analysis work in this comment). A question I have is: How can we build this initial solution while ensuring we're designing a thoughtful and extendable architecture for the long term?

For this project, ideally we want to launch something beta in Q1. Currently we're looking at a predefined list of article types to generate at least 10K tasks offline (see Analysis work in this comment). A question I have is: How can we build this initial solution while ensuring we're designing a thoughtful and extendable architecture for the long term?

For the beta launch, one idea is to store generated tasks (results from T401968) as a one-off dataset and serve them via Data Gateway. We would use the finalized data model/schema decided in T401021, but wouldn't need to update the tasks. These tasks just serve for this prototype, not for production. This way would give us more time to plan and build the update pipeline, while enabling the Growth team to integrate into their improve tone POC sooner.

@Eevans Would this be viable for the Data Persistence team?
@Michael What do you think?
cc @SSalgaonkar-WMF

Looping in @KStoller-WMF and @Urbanecm_WMF for more perspectives.

For this project, ideally we want to launch something beta in Q1. Currently we're looking at a predefined list of article types to generate at least 10K tasks offline (see Analysis work in this comment). A question I have is: How can we build this initial solution while ensuring we're designing a thoughtful and extendable architecture for the long term?

For the beta launch, one idea is to store generated tasks (results from T401968) as a one-off dataset and serve them via Data Gateway. We would use the finalized data model/schema decided in T401021, but wouldn't need to update the tasks. These tasks just serve for this prototype, not for production. This way would give us more time to plan and build the update pipeline, while enabling the Growth team to integrate into their improve tone POC sooner.

@Eevans Would this be viable for the Data Persistence team?
@Michael What do you think?
cc @SSalgaonkar-WMF

While I think that this would be useful, I don't think that this initial batch is absolutely necessary for us. We (Growth) just discussed a related question and agreed that using entirely mock-data on patch-demo is technically good enough to finish our hypothesis for Q1. So if this would be a large amount of additional effort, and from the meeting today I gather that it might be, then this might not be worth it and we maybe should put that effort more towards getting the actual pipeline up and running.

Though I wonder, if there could be a smaller step than going from nothing to full-scale production usage. Maybe a first version of that pipeline could target testwiki or test2wiki?

For this project, ideally we want to launch something beta in Q1. Currently we're looking at a predefined list of article types to generate at least 10K tasks offline (see Analysis work in this comment). A question I have is: How can we build this initial solution while ensuring we're designing a thoughtful and extendable architecture for the long term?

For the beta launch, one idea is to store generated tasks (results from T401968) as a one-off dataset and serve them via Data Gateway. We would use the finalized data model/schema decided in T401021, but wouldn't need to update the tasks. These tasks just serve for this prototype, not for production. This way would give us more time to plan and build the update pipeline, while enabling the Growth team to integrate into their improve tone POC sooner.

@Eevans Would this be viable for the Data Persistence team?

So —if I understand correctly— this would be the final solution in every respect, with the exception that updates wouldn't (yet) propagate? If this were useful in making your milestones, I think that would be OK.

Keep in mind that we do have a staging environment as well (complete with Data Gateway, and a staging Cassandra cluster). If Growth is able to run their POC from there, that could be a good choice as well.

Keep in mind that we do have a staging environment as well (complete with Data Gateway, and a staging Cassandra cluster). If Growth is able to run their POC from there, that could be a good choice as well.

That sounds promising! Is this documented anywhere? I've seen a URL with -staging on https://wikitech.wikimedia.org/wiki/Data_Gateway but not sure what I should do with that.
And does staging imply that it is accessible from the beta wikis aka deployment-prep/labs?

Keep in mind that we do have a staging environment as well (complete with Data Gateway, and a staging Cassandra cluster). If Growth is able to run their POC from there, that could be a good choice as well.

That sounds promising! Is this documented anywhere? I've seen a URL with -staging on https://wikitech.wikimedia.org/wiki/Data_Gateway but not sure what I should do with that.

There is https://wikitech.wikimedia.org/wiki/Cassandra/Staging, but that is mostly about obtaining/using developer access to the Cassandra cluster. Otherwise, there isn't much else to document. Data can be written to the staging storage cluster, same is you do to production. It's a smaller cluster, so it may make sense to only push a (stable) subset of what you would to production. As you noted above, there is a staging version of the gateway that is backed by this cluster, which you can test requests to accordingly.

And does staging imply that it is accessible from the beta wikis aka deployment-prep/labs?

No, the staging cluster is in the "production" network, which is inaccessible from deployment-prep. :(

@achou, @KStoller-WMF: Quick question: what team will own the tone edit suggestion data pipeline after it is built? We should try to avoid a repeat of image suggestions data pipeline, which as of this year is no longer owned.

Last week I met with @Michael to discuss the goal for Q1.

  • For Growth team, the plan is still to use mock-data on patch-demo.
  • For the ML team, the goal is to analyze whether we have 10K quality suggestions (T401968).
  • A stretch goal - PoC on testwiki. Growth team runs the whole pipeline on testwiki with a small set of articles with tone issues (say 10 articles) (Note: "pipeline" here refers to the UI/Growth experience workflow, not the data pipeline). This would ensure that all the integration works.

To enable the PoC on testwiki, the ML team will need to:

  • Work with the Data Persistence and DPE team to finalize the data model/schema, and place a small set of articles with tone issues in the staging Cassandra cluster, making them accessible via the staging Data gateway.
  • Work with the Search team to add the new weighted tags for tone suggestions, and produce weighted tags for the small set of articles to the search indices, which Growth can then search with the keyword hasrecommendation:

Following the convention used for other structured tasks, the tag family for tone suggestions can be:

"recommendation.tone/exists|1"

Regarding the final solution for data pipeline, the ML team plans to use an event-based solution, which is much preferable to a snapshot-based approach (from a meeting last week). We might/want to use Flink, which we haven't worked with before. Therefore, this work needs to be planned properly and the actual work could span an entire quarter.

@achou, @KStoller-WMF: Quick question: what team will own the tone edit suggestion data pipeline after it is built? We should try to avoid a repeat of image suggestions data pipeline, which as of this year is no longer owned.

ML team will own the data pipeline.

So —if I understand correctly— this would be the final solution in every respect, with the exception that updates wouldn't (yet) propagate? If this were useful in making your milestones, I think that would be OK.

The PoC work for testwiki described above doesn't cover the pipeline for data generation and updates. The data will be produced in an ad-hoc way (we only need a small set of like 10 articles), and they don't need to be updated.

cc @SSalgaonkar-WMF @KStoller-WMF

Great stuff thank you Aiko!

the ML team plans to use an event-based solution, which is much preferable to a snapshot-based approach (from a meeting last week). We might/want to use Flink, which we haven't worked with before. Therefore, this work needs to be planned properly and the actual work could span an entire quarter.

FWIW, as noted here T401021#11159027 event based does not necessarily mean stream processing / Flink. You can source the updates from Hive event database tables, e.g. event.mediawiki_page_content_change.v1, and still use Airflow. This might actually be advantageous here, especially if you only want to generate new recommendations daily-ish. You can choose to generate a task for only the latest edit (or change e.g. a page delete) per page in your time period, potentially reducing the number of updates you need to send to Cassandra.

Great stuff thank you Aiko!

the ML team plans to use an event-based solution, which is much preferable to a snapshot-based approach (from a meeting last week). We might/want to use Flink, which we haven't worked with before. Therefore, this work needs to be planned properly and the actual work could span an entire quarter.

FWIW, as noted here T401021#11159027 event based does not necessarily mean stream processing / Flink. You can source the updates from Hive event database tables, e.g. event.mediawiki_page_content_change.v1, and still use Airflow. This might actually be advantageous here, especially if you only want to generate new recommendations daily-ish. You can choose to generate a task for only the latest edit (or change e.g. a page delete) per page in your time period, potentially reducing the number of updates you need to send to Cassandra.

It does have the potential to reduce the number of updates to Cassandra for a day, but a streaming approach would amortize those updates (rather than having them all at once). Which is better? That's hard to say; There are pages edited very frequently, but most are not.

Based on my understanding of what is being built, an event-based approach would produce a better user experience, which I think will be true of similar services (so we should work to get better at this pattern). I would just do what makes sense for the product here, and not worry about optimizing the number of overall updates (if that even ends up being an optimization).

an event-based approach would produce a better user experience, which I think will be true of similar services

I think Eric here means a 'realtime/stream event based approach'. We can do batch event based as noted above.

EDIT: Oh, Eric edited his original comment to say exactly this! :)

And, that def makes sense. I'm very pro event based approach, batch or realtime, and I am also pro realtime, but whichever is better for Aiko and ML team here is fine with me.


As for updates to Cassandra, what I would really like to see one day is:

Data transfer (of this kind) is standardized using events with Kafka in the middle.

E.g.

page_change event
-> <transform events to tasks, stream or batch>
-> page_tone_task_change event stream
-> <standardized cassandra|data store ingestion process>
-> cassandra|data store.

This is similar to what Search does with their Search Update Pipeline and weighted tags stream (there are some differences in what I'm suggesting but we can discuss those later).

But, all of that is probably for the future. :)

Great stuff thank you Aiko!

the ML team plans to use an event-based solution, which is much preferable to a snapshot-based approach (from a meeting last week). We might/want to use Flink, which we haven't worked with before. Therefore, this work needs to be planned properly and the actual work could span an entire quarter.

FWIW, as noted here T401021#11159027 event based does not necessarily mean stream processing / Flink. You can source the updates from Hive event database tables, e.g. event.mediawiki_page_content_change.v1, and still use Airflow. This might actually be advantageous here, especially if you only want to generate new recommendations daily-ish. You can choose to generate a task for only the latest edit (or change e.g. a page delete) per page in your time period, potentially reducing the number of updates you need to send to Cassandra.

It does have the potential to reduce the number of updates to Cassandra for a day, but a streaming approach would amortize those updates (rather than having them all at once). Which is better? That's hard to say; There are pages edited very frequently, but most are not.

Based on my understanding of what is being built, an event-based approach would produce a better user experience, which I think will be true of similar services (so we should work to get better at this pattern). I would just do what makes sense for the product here, and not worry about optimizing the number of overall updates (if that even ends up being an optimization).

I concur with @Eevans here, I'll add that on our side we found it harder to operate search weighted tags based on batch compared to streaming (comparing our experience between image recommendations and article topics for instance):

  • we have to rate-limit big pushes, if we multiply the number of batch processes it becomes very hard for us to align proper rate-limits across all these batch producers
  • dag operations can be error prone say the batch scheduled two days ago failed (which will most certainly happen in practice): can you safely re-run it if subsequent days have already ran? For search no, you will certainly override tags with old data. This is where I find batch approaches to update a realtime updated dataset like search far from ideal.

Weekly Report

Progress update on the hypothesis for the week, including if something has shipped:

  • Completed gathering data for article list analysis and adding additional samples. Summarized findings in T401968#11176559
  • Discussed the Q1 goal with the Growth team. Conclusions documented in T392283#11163883
  • Created architecture diagrams for data pipeline solutions for team discussion.

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • N/A

Any emerging blockers or risks:

  • N/A

Any unresolved dependencies:

New lessons from the hypothesis:

  • Learned different ways for weighted tag ingestion (from the recent ML<>Research<>Data Platform meeting):
    • Small/test dataset (for Q1 PoC) can be pushed manually
    • For production, we can use either batch or realtime ingestion, with realtime being preferred. The search platform team can assist with bootstrap/initial ingestion if needed.

Changes to the hypothesis scope or timeline:

  • N/A

Weekly Report

Progress update on the hypothesis for the week, including if something has shipped:

  • Working with Data Persistence team on the data model design in T401021

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • N/A

Any emerging blockers or risks:

  • N/A

Any unresolved dependencies:

  • N/A

New lessons from the hypothesis:

  • If hasrecommendation could be adapted to allow filtering by score, e.g. hasrecommendation:tone>0.7, it would be valuable and could also apply to Add a Link, which would greatly simplify the current workflow (discussion in T401021#11193688 and ticket created T405059)

Changes to the hypothesis scope or timeline:

  • Updated timeline:
    • Finalize the data model design with Data Persistence [Sep 23]
    • Make decisions about task generation - based on the analysis results: (1) what article types to use (2) what preprocess/postprocess steps we need e.g. exclude quotes [Sep 25]
    • Prepare POC data for testwiki. Coordinate with Data Persistence (DataGateway/Cassandra) and Search team (CirrusSearch). [will create a ticket for this] [Sep 30]
    • Make decisions about data pipeline for task generation/update (streaming vs. batch approach) [Oct 2]
    • Build the data pipeline for task generation/update [TBD, depends on what approach we decide]

Weekly Report

Progress update on the hypothesis for the week, including if something has shipped:

  • Gathering feedback on generated tasks for articles with tone issues for the analysis in T401968
    • This will help us identify potential issues in the data generation pipeline.
  • Continuing work on Cassandra data model design with the Data Persistence team in T401021

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • N/A

Any emerging blockers or risks:

  • N/A

Any unresolved dependencies:

  • N/A

New lessons from the hypothesis:

  • N/A

Changes to the hypothesis scope or timeline:

  • Growth team won't do the testwiki PoC, only the patchdemo PoC with mock-data this quarter. Therefore, we no longer need to prepare a one-off dataset for testwiki.

Weekly Report

Progress update on the hypothesis for the week, including if something has shipped:

  • Completed analysis of article types to include/exclude for generating structured tasks. Summary in T401968#11242701.

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • N/A

Any emerging blockers or risks:

  • N/A

Any unresolved dependencies:

  • N/A

New lessons from the hypothesis:

  • N/A

Changes to the hypothesis scope or timeline:

  • N/A

@Sucheta-Salgaonkar-WMF I'm thinking if we should have a new hypothesis for building the data pipeline for revise tone structured tasks. IMO, the current hypothesis only serves analyzing whether we can generate enough high-quality structured tasks, and it is completed, rather than actually building a production data pipeline. The good thing is that during our analysis and investigation, we've gained valuable knowledge about product requirements and our existing infrastructure, which has helped us answer several key questions about the data pipeline design. What are your thoughts on this?

@AikoChou Great question, and great thinking! I'll start a thread for this in the WE1.1 channel so we can propose this and get feedback from Kirsten, Lauren, and Peter.