Page MenuHomePhabricator

Determine technical approach for Automoderator testing interface
Closed, ResolvedPublicSpike

Description

Internal documentation and overview

Design

TBD - Sketches are available in the documentation linked above.

Annotool

We may consider using Annotool in some capacity - Tool - Gitlab

This already provides functionality for loading a dataset and having users review edits.

Investigation

We want to determine the technical approach we will take for building this tool, answering questions such as:

  • Should we build a new tool or extend an existing one (e.g. Annotool)?
  • How will we store data such that it is accessible to our data analyst?
  • Are there concerns about any of the design features, which we should reconsider to simplify the solution?
  • How should we ingest revert risk scores into the interface?

Findings

Should we build a new tool or extend an existing one (e.g. Annotool)?

After spending about a week experimenting with annotool, I believe that we can either extend it or fork it to meet our needs. I've already been able to add a view for filtering lists of revisions on probability:
https://gitlab.wikimedia.org/jsn/annotool/-/tree/Jsn.sherman/threshold-filter?ref_type=heads

In this case, I just grabbed a slider with both a min and max value out of the ui library that annotool is already using.

image.png (832×1 px, 94 KB)

How will we store data such that it is accessible to our data analyst?

Annotool supports csv export. We could just use that, or potentially integrate it with google sheets without too much hassle. I suggest that we don't use the internal db for long term storage or analysis.

Are there concerns about any of the design features, which we should reconsider to simplify the solution?

Not at this time. I do think it's worth having both a min/max threshold that can be tested for "maybe"/"marginal" buckets for scored revs.

How should we ingest revert risk scores into the interface?

Annotool supports bulk csv import as well in addition to accepting individual scores via its api.

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptAug 30 2023, 12:03 PM
jsn.sherman triaged this task as High priority.
jsn.sherman moved this task from Ready to In Progress on the Moderator-Tools-Team (Kanban) board.

Answers so far:

Should we build a new tool or extend an existing one (e.g. Annotool)?

After spending about a week experimenting with annotool, I believe that we can either extend it or fork it to meet our needs. I've already been able to add a view for filtering lists of revisions on probability:
https://gitlab.wikimedia.org/jsn/annotool/-/tree/Jsn.sherman/threshold-filter?ref_type=heads

In this case, I just grabbed a slider with both a min and max value out of the ui library that annotool is already using.

image.png (832×1 px, 94 KB)

How will we store data such that it is accessible to our data analyst?

Annotool supports csv export. We could just use that, or potentially integrate it with google sheets without too much hassle.

A little more info:
Internally, annotool uses a mariadb database to store things. I don't think we want to automatically log scores for all revisions on an ongoing basis because then we would need to care quite a bit about capacity and performance. Annotool itself doesn't actually ask the model to score things. It can save the scores that some other tool/script posts to it via its api. I suggest that we clear out the db when we're not actively collecting reviews.

I'm still thinking about how we want to select and ingest revisions.

Are there concerns about any of the design features, which we should reconsider to simplify the solution?

Not at this time. I do think it's worth having both a min/max threshold that can be tested for "maybe"/"marginal" buckets for scored revs.

How should we select and ingest revision + revert risk scores into the interface?

The selection part is out of scope for this spike, as that is being tackled elsewhere. Annotool supports bulk csv import as well in addition to accepting individual scores via its api.

jsn.sherman changed the task status from Open to In Progress.Sep 29 2023, 6:27 PM
jsn.sherman updated the task description. (Show Details)
jsn.sherman moved this task from Eng review to Done on the Moderator-Tools-Team (Kanban) board.

@jsn.sherman I am currently working on gathering a test dataset (T346916) for the pilot wikis (25,000 revisions each). The dataset along with metadata will have the revert risk scores for each revision. So no API calls will be required. But we will still need to think about saving the users' responses for each revision and also randomization of the dataset for each reviewer.

@jsn.sherman I am currently working on gathering a test dataset (T346916) for the pilot wikis (25,000 revisions each). The dataset along with metadata will have the revert risk scores for each revision. So no API calls will be required. But we will still need to think about saving the users' responses for each revision and also randomization of the dataset for each reviewer.

Thanks for the info:

The reviews can be exported to csv, so that one is straightforward, IMO. How we implement randomization is basically an implementation detail and not really a first order element of the technical direction.

Thanks for this review. We're going to pilot the spreadsheet-based approach, but it's good to know we have clear direction if we decide we need a more technical implementation.