Page MenuHomePhabricator

AbuseFilter: Function ccnorm shouldn't convert "I" and "L" to "1", "O" to "0" and "S" to "5"
Closed, ResolvedPublic8 Story Points

Description

Currently the result of

ccnorm("ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuwxyz")

is

ABCDEFGH1JK1MN0PQR5TUVWXYZ_ABCDEFGH1JK1MN0PQR5TUWXYZ

This makes the creation of filters on [[Special:AbuseFilter]] not intuitive, since if we want to catch all variations of a word like "testing" and try to use something like

words :="TESTING|VANDALIZING";
ccnorm(added_lines) rlike words)
& !(ccnorm(removed_lines) rlike words)

it won't work. Instead of this natural approach, the text would need to be changed to

words :="TE5T1NG|VANDA11Z1NG";

You can confirm the problem on [[Special:AbuseFilter/tools]], by using the following:

words :="TESTING|VANDALIZING";
ccnorm("I'm testing here. I'm vandalizing the article!") rlike words

The regex above will not match, but it will match in the following:

words := "TE5T1NG|VANDA11Z1NG";
ccnorm("I'm testing here. I'm vandalizing the article!") rlike words

Could this be fixed?

Details

Reference
bz27987
Related Gerrit Patches:
mediawiki/extensions/AbuseFilter : masterUpdate tests for AntiSpoof fixes

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
MusikAnimal added a comment.EditedJun 14 2016, 7:12 PM

@kaldari but also let's query all private filters. You cannot do this against the replica dbs, which is actually my fault (T123895) =P (that was a different table, actually). It's probably safe to assume that most filters using ccnorm or norm are private

kaldari added a comment.EditedJun 14 2016, 7:17 PM

I've reverted the AntiSpoof changes for now, and will SWAT deploy the revert. In other words, we're going to put fixing this bug on hold again.

Next step is to do some queries against the production database to find all the active filters that use ccnorm() or norm(). Previously, we were only looking for ccnorm().

@MusikAnimal: Could you explain what the difference between ccnorm and norm are? I'm not an AbuseFilter user, so I actually have no idea.

norm(arg) is the same as rmwhitespace(rmspecials(rmdoubles(ccnorm(arg))))

I've pinged folks about the global filter.

Johan added a comment.Jun 15 2016, 4:52 PM

The filter on Meta is global now, for whenever we want to do the change. Everyone who's been contacted on-wiki and said they'll help out has been notified of our postponing this update.

Now that the poop filter is global, it should be removed from enwiki, enwikibooks, enwikivoyage, hiwiki (twice?), and simplewiki.

Now that the poop filter is global, it should be removed from enwiki, enwikibooks, enwikivoyage, hiwiki (twice?), and simplewiki.

Global abusefilters don't run on all wikis...

@Legoktm: That's news to me. I thought they all did (since they all have $wgAbuseFilterCentralDB = metawiki). Which ones are non-global and how do you tell?

@Legoktm: That's news to me. I thought they all did (since they all have $wgAbuseFilterCentralDB = metawiki). Which ones are non-global and how do you tell?

InitialiseSettings.php
'wmgUseGlobalAbuseFilters' => [
	'default' => false,
	'small' => true,
	'medium' => true,
	'private' => false,
	'fishbowl' => false,
	'nonglobal' => false,

	// Effectively repeat nonglobal entry above because both labswiki and labtestwiki are also medium
	'labswiki' => false,
	'labtestwiki' => false,

	'enwikisource' => true, // T78496
	'frwiki' => true, // T120568
	'metawiki' => true,
	'testwiki' => true,
	'test2wiki' => true,
	'mediawikiwiki' => true,
	'specieswiki' => true,
	'incubatorwiki' => true,
	'wikidatawiki' => true,
],

OK, looks like all the known poop wikis are using the global abuse filters except for enwiki, so we should leave a local copy at enwiki.

Nemo_bis removed a subscriber: Nemo_bis.Jun 16 2016, 6:26 AM

Fuller list of filters that will need to be fixed at P3522. (The list is restricted access since some of the filters are private.)

I'll write my thoughts I posted on the paste here: Since this is a breaking change, I think the time between updating the filters and deploying should be minimal. Do you think we should update the filters ourselves? I worry not everyone is going to be very responsive. At the very least we should disable all affected filters prior to deployment.

I figure we could first update all the filters ourselves since that part is fairly easy and does not require prior knowledge of how each filter works. With this we'll also disable them, and in the description link to this phab task and the diff updating mw:Extension:AbuseFilter/Rules format (the latter will help with the language barrier). Then after deployment, we can notify and leave it to the individual filter managers to re-enable and monitor the filters at their leisure, ensuring they are working as expected. How does that sound?

Change 294358 restored by Kaldari:
Update tests for AntiSpoof fixes

Reason:
Take 2...

https://gerrit.wikimedia.org/r/294358

@MusikAnimal, @Johan: So the plan will be (in PST):

  • Tuesday afternoon: Test everything on Test Wikipedia
  • Tuesday evening: MusikAnimal and I will disable and update the affected filters on Wikidata, Commons, and Meta.
  • Wednesday noon-2pm: Train deployment
  • Wednesday afternoon: Once train deployment is complete, re-enable affected filters (except for filters that were previously disabled and global filters on Meta)
  • Wednesday evening: MusikAnimal, myself, and volunteers will disable and update affected filters on the Wikipedias.
  • Thursday noon-2pm: Train deployment to Wikipedias
  • Thursday afternoon: Once train deployment is complete, re-enable affected filters on Wikipedias and Meta global filters (except for filters that were previously disabled)

Does that sound like a good plan?

DannyH renamed this task from Function ccnorm shouldn't convert "I" and "L" to "1", "O" to "0" and "S" to "5" to Abuse filter: Function ccnorm shouldn't convert "I" and "L" to "1", "O" to "0" and "S" to "5".Aug 11 2016, 9:09 PM
kaldari renamed this task from Abuse filter: Function ccnorm shouldn't convert "I" and "L" to "1", "O" to "0" and "S" to "5" to AbuseFilter: Function ccnorm shouldn't convert "I" and "L" to "1", "O" to "0" and "S" to "5".Aug 11 2016, 9:09 PM
kaldari set the point value for this task to 8.
  • Tuesday afternoon: Test everything on Test Wikipedia
  • Tuesday evening: MusikAnimal and I will disable and update the affected filters on Wikidata, Commons, and Meta.
  • Wednesday noon-2pm: Train deployment
  • Wednesday afternoon: Once train deployment is complete, re-enable affected filters (except for filters that were previously disabled and global filters on Meta)
  • Wednesday evening: MusikAnimal, myself, and volunteers will disable and update affected filters on the Wikipedias.
  • Thursday noon-2pm: Train deployment to Wikipedias
  • Thursday afternoon: Once train deployment is complete, re-enable affected filters on Wikipedias and Meta global filters (except for filters that were previously disabled)

This sounds good except Tuesday evening I'll likely be on the road back home from Maine, and I probably shouldn't try to edit the filter regex on my phone :) Wednesday also happens to be NYC WikiWednesday, but that will be over by 6 PM your time, and I should be able to work from there if need be.

@kaldari I'll write here what I said on IRC: We will need staff or sysadmin rights to update filters on other wikis, as there isn't a global abuse filter editor, only a read-only. Could we attain these rights temporarily? I just think we should be responsible for updating the filters since we're breaking them :)

I just checked and I should be back home around 8 PM your time on Tuesday, if that's not too late, otherwise I can help Monday morning/night for the first round.

MarcoAurelio added a comment.EditedAug 11 2016, 10:50 PM

@MusikAnimal FWIW Stewards can create global groups, so if you're only going to update those filters, I can create for you a global group with the abusefilter viewing and modifying privs required for this task, unless you qualify for any of those groups. Best regards.

I already have global sysadmin rights on my staff account, and we can get those temporarily added to MusikAnimal's staff account as well. That's probably easier than creating a new global group.

I already have global sysadmin rights on my staff account, and we can get those temporarily added to MusikAnimal's staff account as well. That's probably easier than creating a new global group.

It's just a couple of clicks less, heh :)

@He7d3r: Are you on board for helping with Portuguese Wikipedia on Wednesday and Thursday? See T29987#2544834.

Samtar added a subscriber: Samtar.Aug 12 2016, 12:02 PM

OK, I've started contacting users who have affected filters. I'm telling them we hope to fix the filters ourselves but are, of course, grateful for their assistance, and am pointing them to this task and the plan here. If the filters are hidden, I'm emailing them the list of all their affected filters.

Works for me.

Guys, your list above is missing some filters. Specifically, as an example, enwiki filter 58 uses ccnorm, but is private, which I suspect is why Pathoschild's query above didn't catch it. Please make sure you fix all the filters including private ones.

Fuller list of filters that will need to be fixed at P3522. (The list is restricted access since some of the filters are private.)

Oops, I just saw this. Disregard the above. Still, posting only public ones initially was somewhat unhelpful.

Base added a subscriber: Base.Aug 12 2016, 5:41 PM

Got tired reading all the comments. Does this task mean that if one would like to e.g. catch 1488 with a filter they would need to write L488 now or whatever?

@Base: Why would you want to catch "1488"? This change will let you catch "SHIT", "5HIT", and "5H1T", by using just "SHIT". In other words, it will make norm() and ccnorm() behave how you would intuitively expect them to behave. You just use the actual word you want to catch and it will catch all the variations. You'll no longer need to use numbers and weird spellings.

Rich_Farmbrough added a comment.EditedAug 12 2016, 6:04 PM

Great work everyone (especially Ryan!).
Simple is good.
I have changed hidden filter 12 on en:wiki so that it will work with either version of ccnorm, after the Mediawiki change it can have the code for old version removed.
This approach means that we get no gap where filters are inactive, and hence no narrow window for implementing the changes, nor any rush after the Mediawiki change.
It also means that if we have to revert the Mediawiki change within a short time, the rules should still work.

I will probably update some of the other rules too.

Akeron added a subscriber: Akeron.Aug 12 2016, 6:48 PM
Johan added a comment.Aug 12 2016, 6:59 PM

@He7d3r: I didn't notify anyone about Portuguese Wikipedia filters, since you seemed better suited to take care of that if necessary. Tell me if I should and I'll deal with it. (:

With that exception, everyone who has a filter mentioned in P3522 should now be notified, with the exception of @MarcoAurelio, @MusikAnimal, @He7d3r and @Billinghurst who seemed to be aware of the issue anyway.

eswiki AF 20 updated (I don't have access to P3522, but from Johan messages, seemed to be the only one on this wiki)

Note the rollout plan listed above; The Wikipedias won't get the new code until Wednesday, so you may want to wait so that the filters are fully functional between now and then. If you aren't around to update them come the rollout we'll do so for you.

@Base: If you wanted to catch "1488", you probably wouldn't use norm() or ccnorm(), you would just use something like:

added_lines rlike "\b1488"

norm() and ccnorm() are for catching spoofed text, and neither of those functions will work correctly for detecting numbers (spoofed or otherwise). The assumption (which is usually correct) is that numbers are used to spoof letters rather than vice versa. If you also need to detect for I488, you would need to look for that specifically:

added_lines rlike "\b1488" |  added_lines rlike "\bI488"

Hope that answers you're question.

@MusikAnimal: I made it match with both behaviors

@Platonides: I just added you to the paste. There are actually 3 affected filters on eswiki: 9, 20, and 55. 20 is the only one that is public.

@Platonides There are a couple more, but I emailed the individual filter owners about the filters they've created/imported. (:

Chenzw added a subscriber: Chenzw.Aug 16 2016, 1:08 AM

@Johan @kaldari Sorry if you have already taken care of this but won't this change also influence TitleBlacklist, ie. wikis which have some content on [[MediaWiki:Titleblacklist]] with lines with <antispoof>?

@matej_suchanek: I was already planning on fixing the global title blacklist on meta, but I forgot about the local ones. Thanks for the reminder. @Johan, can you add a note about that to the Tech News?

As an example, the following...

.*AD+M1+N.*                             <newaccountonly|antispoof>
.*5Y5[0Ø]P.*                            <newaccountonly|antispoof>
.*M[0Ø]DERAT[0Ø]R.*                     <newaccountonly|antispoof>
.*arbit(?:er|rator).*                   <newaccountonly>
.*CHECKU5ER.*                           <newaccountonly|antispoof>
.*[0Ø]VER51GHT.*                        <newaccountonly|antispoof>
.*5+T[E0]+(W|VV)+A+RD.*                 <newaccountonly|antispoof>

...would be changed to...

.*AD+MI+N.*                             <newaccountonly|antispoof>
.*SYS[OØ]P.*                            <newaccountonly|antispoof>
.*M[OØ]DERAT[OØ]R.*                     <newaccountonly|antispoof>
.*arbit(?:er|rator).*                   <newaccountonly>
.*CHECKUSER.*                           <newaccountonly|antispoof>
.*[OØ]VERSIGHT.*                        <newaccountonly|antispoof>
.*S+T[EO]+(W|VV)+A+RD.*                 <newaccountonly|antispoof>

Note for anyone fixing filters/blacklists:
0 -> O
5 -> S
1 -> either I or L (depending on context)

Change will go live on Test Wikipedia and MediaWiki.org today, all non-Wikipedia sites tomorrow, and all Wikipedia's on Thursday.

Change 294358 merged by jenkins-bot:
Update tests for AntiSpoof fixes

https://gerrit.wikimedia.org/r/294358

@Base: Actually, I just realized that you can detect for variations of 1488 with ccnorm using the new code:

ccnorm(new_text) contains ccnorm("1488")

That would detect 1488, I488, and IA88. I'm not sure how you would do it with the old code, but I imagine it would be more complicated.

kaldari added a comment.EditedAug 16 2016, 9:53 PM

Tested on Test Wikipedia and everything seems to be working correctly. I checked and updated the filters and title blacklists on mediawiki.org, test.wikipedia.org, and test2.wikipedia.org.

This comment was removed by MusikAnimal.
This comment was removed by MusikAnimal.

Most everything has been updated on meta, check P3522 for details

@He7d3r: The Portuguese filters that will need to be updated are 7, 18, 70, 112, 134, and 135. These appear to be very complicated filters, so we haven't touched them (plus we don't know Portuguese). Will you be able to fix these? The switch is going to happen about 2 to 3 hours from now (~2pm PST). See T29987#2557600 for migration instructions.

All wikis are updated to the new code now. You should be able to remove any backwards-compat code now from the filters. I've updated the global title blacklist and checked local ones (haven't found any others using antispoof). MusikAnimal and I checked or fixed all the filters for the top 10 wikis (with the exception of Portuguese which He7d3r is working on).

I updated some of them on ptwiki (and noticed we can still improve a few of the conversions to avoid using [Iï], [õO] and [UÚ]):
https://pt.wikipedia.org/wiki/Especial:Filtro_de_abusos/history/7/diff/prev/2121
https://pt.wikipedia.org/wiki/Especial:Filtro_de_abusos/history/18/diff/prev/2122
https://pt.wikipedia.org/wiki/Especial:Filtro_de_abusos/history/70/diff/prev/2123
but I'll have to leave the others for tomorrow...

kaldari closed this task as Resolved.Aug 19 2016, 8:51 PM
kaldari moved this task from Needs Review/Feedback to Q1 2018-19 on the Community-Tech-Sprint board.