Page MenuHomePhabricator

Enable Peacock model to identify positive examples with finer granularity
Open, Needs TriagePublic

Description

In T368274, we developed a model capable of detecting Peacock behavior at the article level.

This task involves the work of iterating upon this approach to detect peacock behavior at the level of paragraphs, and ideally, sentences.

In doing so, we'll gain the ability to offer people Edit Checks that are relevant to the specific pieces of content they added/affected during a given edit session.

Story

As someone adding new information to Wikipedia and/or changing the meaning of information that's already present, I want to be made aware when the specific additions and/or changes I'm making are at risk of introducing non-neutral language to Wikipedia so that I can increase the likelihood that these edits remain published and avoid making destructive changes that will impact readers and create more work for moderators.

Requirements

@ppelberg to discuss with @diego.

Event Timeline

The main challenge here is to find data to train and test this model. Currently, the data we have is at article label. I see to possible ways to work around this problem:

  • Using section level templates (eg. {{Peacock|section|date=November 2024}} ): Doing this we well be able to reduce the granularity from article to section level. To do this, we require time to collect data and run new experiments.
  • Trying a unsupervised approach: The idea here would be to take articles with tagged as peackock and try to find common patterns across them. This would be basically an extension of T371158#10202597, and would require research work but also some type of community or expert validation (per language).

Ideally we should try both approaches because they are complementary. The first one would allow a robust evaluation methodology, although the granularity would be just at section level. The second one, can give more insights/feedback for editors, but it is more difficult to evaluate and would require some type of user testing, implying more resources and time.

ppelberg added a subscriber: dchan.

@dchan: are we aligned in thinking this task can be resolved considering we've figured out a way for Tone Check to identify issues at the level of sentences?

Per today's team meeting, we've not yet evaluated the reliability of either a) the model's ability to output tone issues at a sentence level or b) VE's ability to present sentence-level model output in a reliable UX.

The above, combined with the UX choice to descope feedback of this sort from the MVP, is prompting us to remove this from

@ppelberg is this something you still want us to explore?