Page MenuHomePhabricator

Change language describing "Likely" filters to avoid mentioning "May" filters
Closed, ResolvedPublic

Description

The current description of the ORES "Likely" filters is designed to get around the fact that they cover a wide range of precision/recall results—by just describing the recall relative to other filters:

With medium accuracy, finds more problem edits than the “Very Likely” filter but fewer than “May.”

But with the more flexible way we're setting levels now, three wikis (at present) don't have the "May" filters at all. So a new solution is needed that doesn't refer to them. Also, the existing formula is very wordy.

Technical changes

  • We will create two "new" "Likely" filters—one in Quality and one in Intent— that have the same names as the current ones and the same threshold assignments. The only change is that these new filters have different description language.
  • We'll assign the "new" filters to some wikis and the "old" ones to others. (Strictly as internal names, I'm calling the variations "Low" (for low-ft model) and "High) (high-fit). ) Here are the filter assignments:
  • Quality filter assignments:
    • Low-fit: en, pt, cs, fa, nl, ru, tr, et, fi, ro, sq, fr
    • High-fit wikis: pl, wd, he.
  • Intent filter assignments:
    • Low-fit: en, pt, cs, fa, nl, ru, tr, et, fi, pl, ro, sq, fr
    • High-fit wikis: wd, he.

Language changes

Note that in addition to changing the descriptions of the "Likely" filters, we will also ammend the "Very likely" descriptions, by changing "highly accurate" to "very highly accurate," to help distinguish these from the high-fit "Likely" filters.

Quality filters

[Description text for "Likely have problems"—Low]
With medium accuracy, finds an intermediate fraction of problem edits.

[Description text for "Likely have problems"—High]
With high accuracy, finds most problem edits.

**[Description text for "Very likely have problems"-Both]
Very highly accurate at finding the most obviously flawed or damaging edits.

Intent filters

[Description text for "Likely bad faith"—Low]
With medium accuracy, finds an intermediate fraction of bad-faith edits.

[Description text for "Likely bad faith"—High]
With medium accuracy, finds most bad-faith edits. [yes, "medium" accuracy is correct here.]

**[Description text for "Very likely bad faith"—Both]
Very highly accurate at finding the most obvious bad-faith edits.

Related Objects

StatusSubtypeAssignedTask
DuplicateQgil
ResolvedQgil
ResolvedQgil
OpenNone
ResolvedJohan
ResolvedTrizek-WMF
Resolved jmatazzoni
Resolved DannyH
Resolved DannyH
Resolved jmatazzoni
Resolved jmatazzoni
ResolvedSBisson
ResolvedMooeypoo
ResolvedPginer-WMF
ResolvedPginer-WMF
ResolvedPginer-WMF
ResolvedPginer-WMF
OpenNone
ResolvedPginer-WMF
OpenNone
ResolvedPginer-WMF
ResolvedPginer-WMF
ResolvedMooeypoo
ResolvedMooeypoo
ResolvedMooeypoo
ResolvedCatrope
ResolvedSBisson
ResolvedNone
ResolvedTrizek-WMF
ResolvedCatrope
ResolvedCatrope
DuplicateNone
OpenNone
ResolvedTrizek-WMF
ResolvedTrizek-WMF
ResolvedMooeypoo
ResolvedSBisson
Resolved jmatazzoni
ResolvedSBisson
InvalidNone
ResolvedSBisson
ResolvedSBisson
ResolvedNone
ResolvedSBisson
ResolvedSBisson
ResolvedSBisson
ResolvedSBisson
Resolved jmatazzoni
ResolvedNone
ResolvedSBisson
ResolvedSBisson
Resolved jmatazzoni
ResolvedTrizek-WMF
ResolvedTrizek-WMF
ResolvedTrizek-WMF
ResolvedTrizek-WMF
ResolvedTrizek-WMF
ResolvedTrizek-WMF
ResolvedTrizek-WMF
Resolved jmatazzoni
ResolvedCatrope
ResolvedCatrope
ResolvedSBisson
ResolvedHalfak
ResolvedTrizek-WMF
Resolved jmatazzoni
Resolved jmatazzoni
Resolved jmatazzoni
ResolvedTrizek-WMF
ResolvedPginer-WMF
Resolved jmatazzoni
ResolvedCatrope
ResolvedPginer-WMF
Resolved jmatazzoni

Event Timeline

Likely have problems [Low]
Likely have problems [High]

After a first quick read, I though "Low what?", "High what?". I'm afraid most people just do a quick scan.

Likely have problems [Low]
With medium accuracy, finds an intermediate percentage of problem edits.

The sub-line helps, but the [Low] one has a medium accuracy? The [High] one has a "high accuracy":

Likely have problems [High]
With high accuracy, finds most problem edits.

What about have a description focusing on the spectrum that catches problematic edits?

"Likely have problems [Large spectrum]"
"Likely have problems [Narrow spectrum]"

"Likely have problems [Global]"
"Likely have problems [Precise]"

Roan suggested one way to solve the problem: we can create "new" filters that have the same names as the old ones, but different descriptions. Then use the "new" filters to some wikis and the "old" ones to others. Strictly as internal names, I'm calling the variations "Low" (for low-ft model) and "High) (high-fit).

Even if these are separate filters internally, the final results for our users is that the "Likely have problems" filter will use a different description based on how precise it's underlying ORES model is. This sounds good to me, and the proposed text for the descriptions work well. I also assume that the "[Low]" and "[High]" indicators are just clarifications for the ticket, ant they won't be part of the filter titles or exposed to users in any way.

I also assume that the "[Low]" and "[High]" indicators are just clarifications for the ticket, ant they won't be part of the filter titles or exposed to users in any way.

That's how I read it too, yes.

It'd be good to avoid referring to other filters - the filters maybe dropped/merged etc, e.g. in plwiki where the number of filters differ from enwiki; hopefully, Polish translation of descriptions does not mention the non-existing filter.

Screen Shot 2017-05-17 at 12.17.20 PM.png (459×689 px, 95 KB)

BTW, if you want to check the actual stats agains the new language, here are the precision/recall figures:

[Description text for "Likely have problems"—Low]
With medium accuracy, finds an intermediate fraction of problem edits.
[actual precision/recall stats: 64/26, 61/38, 61/82, 61/57, 46/39, 47/14, 63/46]

[Description text for "Likely have problems"—High]
With high accuracy, finds most problem edits.
[actual precision/recall stats:: pl: 80/91, WD: 76/95, HE: 49/89]


[Description text for "Likely bad faith"—Low]
With medium accuracy, finds an intermediate fraction of bad-faith edits.
[ precision/recall stats: 62/23, 61/38, 63/66, 49/18, 62/62, 53/22, 52/12, 63/61, 62/38]

[Description text for "Likely bad faith"—High]
With medium accuracy, finds most bad-faith edits. [yes, "medium" accuracy is correct here.]
[ precision/recall stats: WD: 60/96, HE: 55/88 ]

Technical question here - it seems like we need to have 2 sub-messages for what we current have as a single message (so, splitting the description for "Likely have problems" to the "low" case and "high" case) and then making sure that specific wikis (regardless of interface language used on them!) receives each specific one.

@Catrope how do we technically do this? Do we create 2 messages for translation, and then let the back-end decide which to ship based on a list of wikis? Do we need to create another sort of global-global variable (or config option?) to set apart the list of wikis per message?

Am I over-analyzing this, or do we need to create the infrastructure to do this? I was going to implement, but then got blocked on how to make sure each wiki gets the correct low/high representation.

Technical question here - it seems like we need to have 2 sub-messages for what we current have as a single message (so, splitting the description for "Likely have problems" to the "low" case and "high" case) and then making sure that specific wikis (regardless of interface language used on them!) receives each specific one.

@Catrope how do we technically do this? Do we create 2 messages for translation, and then let the back-end decide which to ship based on a list of wikis? Do we need to create another sort of global-global variable (or config option?) to set apart the list of wikis per message?

Am I over-analyzing this, or do we need to create the infrastructure to do this? I was going to implement, but then got blocked on how to make sure each wiki gets the correct low/high representation.

My suggestion (and feel free to disagree with it or offer other suggestions) was to create two filters/levels, one called e.g. likely-high and the other likely-low. They would both have separate i18n messages, of course, but we would configure the wikis so that one of them is always disabled.

My suggestion (and feel free to disagree with it or offer other suggestions) was to create two filters/levels, one called e.g. likely-high and the other likely-low. They would both have separate i18n messages, of course, but we would configure the wikis so that one of them is always disabled.

That works. Another somewhat similar option is to allow for messages override where the filter levels are configured (per wiki).

"OresFiltersThresholds": {
	"damaging": {
		"likelygood": { "min": 0, "max": "recall_at_precision(min_precision=0.995)" },
		"maybebad": false,
		"likelybad": { "min": "recall_at_precision(min_precision=0.6)", "max": 1, "messages": { "description": "...likelybad-high" } },
		"verylikelybad": { "min": "recall_at_precision(min_precision=0.9)", "max": 1 }
	},

Yet another option is to have both likelybad-low and likelybad-high messages defined as described in this ticket but let the code pick one based on the presence of the maybe-bad filter level. Yes, it's hardcoded, like similar rules around filters (subset, conflict).

Thinking about this again, all of these solutions are about equally messy, but your second one (have the code pick the message) requires the least outgoing maintenance in the config file, so I think we should prefer that.

Thinking about this again, all of these solutions are about equally messy, but your second one (have the code pick the message) requires the least outgoing maintenance in the config file, so I think we should prefer that.

I agree. Also very easy to implement.

To save translators' time, should we say that the current messages correspond to -low and new ones will be created with the suffix -high or should we remove the current ones and create both -low and -high?

EDIT: Looking at the text of the messages, both -low and -high have new wording and need to be translated. I don't know if there's any point in keeping the old messages (without any suffix).

Change 368463 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[mediawiki/extensions/ORES@master] Messages for low and high accuracy likelybad filters

https://gerrit.wikimedia.org/r/368463

Change 368463 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Messages for low and high accuracy likelybad filters

https://gerrit.wikimedia.org/r/368463

Message 'ores-rcfilters-goodfaith-bad-desc-high' currently says "With medium accuracy...". Message name and analogous 'damaging' filter message suggest that it intends to say "With high accuracy..." instead.

In T164997#3487319, @Pikne wrote:

Message 'ores-rcfilters-goodfaith-bad-desc-high' currently says "With medium accuracy...". Message name and analogous 'damaging' filter message suggest that it intends to say "With high accuracy..." instead.

@Pikne , thanks for your comment. Can you say what wiki you're looking at?

@Pikne , thanks for your comment. Can you say what wiki you're looking at?

It's from the above patch for which I was translating new messages on translatewiki.net.

Message 'ores-rcfilters-goodfaith-bad-desc-high' currently says "With medium accuracy...". Message name and analogous 'damaging' filter message suggest that it intends to say "With high accuracy..." instead.

From the task description:

[Description text for "Likely bad faith"—High]
With medium accuracy, finds most bad-faith edits. [yes, "medium" accuracy is correct here.]

So it looks like this asymmetry was a deliberate choice on @jmatazzoni's part.

Message 'ores-rcfilters-goodfaith-bad-desc-high' currently says "With medium accuracy...". Message name and analogous 'damaging' filter message suggest that it intends to say "With high accuracy..." instead.

From the task description:

[Description text for "Likely bad faith"—High]
With medium accuracy, finds most bad-faith edits. [yes, "medium" accuracy is correct here.]

So it looks like this asymmetry was a deliberate choice on @jmatazzoni's part.

Yes, the precision for the two high fit models is 58 and 60%. So medium.

3 wikis have been added since this task was written: ro, sq, fr. They are all low-fit models, and I'm adding them to the Description. But @SBisson, are you definitely automating this? Good idea!

@SBisson Out of the list

High-fit wikis: pl, wd, he

only hewiki exist in betalabs, but ORES-based filters are no enabled there. hewiki prodaciton has such filters. Any reason why we do not have them in the beta? I checked both types of users (just in case) with ores-enabled up_value equals 0 and 1.

This is how it works in the code (for both damaging and goodfaith):

if maybebad is present -> likelybad uses the -low message
if maybebad is NOT present -> likelybad uses the -high message

There is no new config to deploy for individual wikis.

Wording for Low-fit wikis was checked in betalabs. Waiting for wmf.12 deployment to check High-fit wikis

Checked in wmf.12 - all filters' descriptions have been updated.

QA Recommendation: Resolve