ORES Filters for Japanese Wikipedia are scheduled to be deployed Monday, June 17th. Please add filters to Special:RecentChanges.
Description
Event Timeline
@Catrope have a look at the stats for those models... Have you seen anything like that before?
If we exclude the thresholds with precision or recall at the edge, there's almost nothing left. We could maybe come up with 2 levels (likelygood + maybebad) but even then, I'm not sure how they can be configured so they have a little overlap.
vagrant@vagrant:/vagrant/mediawiki$ mwscript extensions/ORES/maintenance/ConfigureThresholds.php -m damaging -t jawiki Configuring damaging on jawiki MIN MAX CONFIG PREC REC FILTER 0 0.97 P 0.15 0.989 1 0 0.97 P 0.45 0.989 1 0 0.97 P 0.6 0.989 1 0 0.97 P 0.75 0.989 1 0 0.97 P 0.9 0.989 1 0 0.97 P 0.98 0.989 1 0 0.662 P 0.99 0.99 0.997 0 0.173 P 0.995 0.995 0.844 likelygood (default) 0 0.218 R 0.9 0.994 0.902 0.433 1 P 0.15 0.155 0.181 0.893 1 P 0.45 0.502 0.007 1 P 0.6 likelybad (default) 1 P 0.75 1 P 0.9 verylikelybad (default) 1 P 0.98 1 P 0.99 1 P 0.995 0.042 1 R 0.9 0.025 0.906 maybebad (default) vagrant@vagrant:/vagrant/mediawiki$ mwscript extensions/ORES/maintenance/ConfigureThresholds.php -m goodfaith -t jawiki Configuring goodfaith on jawiki MIN MAX CONFIG PREC REC FILTER 0 0.994 P 0.15 0.996 1 0 0.994 P 0.45 0.996 1 0 0.994 P 0.6 0.996 1 likelybad (default) 0 0.994 P 0.75 0.996 1 0 0.994 P 0.9 0.996 1 0 0.994 P 0.98 0.996 1 0 0.994 P 0.99 0.996 1 0 0.994 P 0.995 0.996 1 0 0.987 R 0.9 0.998 0.906 maybebad (default) 0.995 1 P 0.15 1 0.157 0.995 1 P 0.45 1 0.157 0.995 1 P 0.6 1 0.157 0.995 1 P 0.75 1 0.157 0.995 1 P 0.9 1 0.157 0.995 1 P 0.98 1 0.157 0.995 1 P 0.99 1 0.157 0.995 1 P 0.995 1 0.157 likelygood (default) 0.902 1 R 0.9 0.008 0.9
I have, and in the past it's been tracked down to extremely skewed inputs from the labeling campaign. I think the first version of the arwiki goodfaith model was built from a data set that had <1% of the edits labeled as bad faith.
The damaging model is pretty bad, but borderline workable: we could set maybebad to P=0.15 instead of R=0.9 (which the rules say we should do here anyway: of R=0.9 and P=0.15, take the one with the narrower score range), and that wouldn't overlap with the default choice for likelygood (overlap between likelygood and maybebad is allowed, but not required). The goodfaith model is useless, and we shouldn't enable it. In the past, I have enabled only the damaging model and sent the goodfaith model back to the ORES team for a re-do.
In this case, we should definitely send the goodfaith model back to the ORES team, for them to try and make a better one. I personally think that while we're doing that, we might as well send the damaging model back too. It's not useless, and we could squeeze a little bit of utility out of it, but if they're already going to try to fix the goodfaith model I think they should take a look at the damaging model as well.
Interesting! We struggled to get good performance out of the reverted models for jawiki too. It seems like we're either missing something really important in feature extractor or damage in jawiki is just very subtle. We'll need a local collaborator to make progress here. We have been blocked on that in the past.
See T230953: Why is jawiki's goodfaith model so bad? for our followup task.
Unassigning myself and moving back to incoming since this is not actionable by the Growth team for the moment.
I think we could call this "done" since the damaging filters -- while minimal -- are actually deployed. I think we'll create a followup task once we're ready with jawiki fixes -- which could take a long time.
Hey folks. I just went to go do some work on this and it looks like the "damaging" filters were never deployed despite them being determined useful. This is a blocker for me getting feedback on them from real users. Could we get them deployed?
@Halfak -- thanks for bringing this back up, and I'm sorry it got stuck. We'll discuss again as a team after the holidays.