Page MenuHomePhabricator

Make rmspecials preserve whitespace
Open, LowPublic

Description

There have been a request that we make rmspecials preserve whitespace. I do agree with the request, because spaces are not "special" per se, and especially because we already have rmwhitespace to remove the whitespace from a string.

This is trivial, but we first need to figure out how rmspecials is being used in production filters.

Event Timeline

Daimona created this task.Sep 16 2020, 12:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 16 2020, 12:54 PM
Daimona triaged this task as Low priority.Sep 16 2020, 12:57 PM
Daimona edited projects, added AbuseFilter (Overhaul-2020); removed AbuseFilter.

Keeping an eye on this task, will determine how to proceed once the subtask is resolved.

Urbanecm added a subscriber: Urbanecm.

@Daimona Subtask was resolved, see P12854 for results.

Huji added a comment.EditedOct 2 2020, 4:55 PM

The next step is to review everyone of these and make sure that the anticipate change in the rmspecials function will not impact the use cases. I will do that in the next few days.

In below, I a few keywords:

  • Would break: the output of rmspecials is compared to a hardcoded value that doesn't contain spaces; not removing spaces in rmspecials could break the filter.
  • Would benefit: the output of rpsecials is compared to a hardcoded value that contains space. In its current form, that would not work as desired; in the future state, it will!
  • No impact: both sides of comparison are going through rmspecials in a way that changing the functionality will be unlikely to change the logical output.

I also indicated if the filter is deleted (i.e. updating it would be less urgent, if necessary at all)

wikifilterstatus
acewiki6Copy of enwiki #96; no impact
azwiki25
bnwiki5
ckbwiki19
ckbwiki24
ckbwiki25
commonswiki10
commonswiki21
commonswiki37
commonswiki44
cswiki48
cswiki77
cswiki90
cswiki108
dawiki14
dewiki66
dewiki102
dewiki144
dewiki209
dewiki213
dewiki226
enwiki4Deleted filter; no impact
enwiki7Deleted filter; no impact
enwiki17Deleted filter; would benefit!
enwiki20Deleted filter; would break
enwiki21Deleted filter; would break
enwiki22Deleted filter; would break
enwiki23Deleted filter; would break
enwiki96Deleted filter; no impact
enwiki119Deleted filter; would break
enwiki137Deleted filter; would break
enwiki148No impact
enwiki149No impact
enwiki154Deleted filter; would break
enwiki166Deleted filter; would break
enwiki170Deleted filter; would break
enwiki188Deleted filter; no impact
enwiki294Would break
enwiki355Deleted filter; no impact
enwiki414Deleted filter; would break
enwiki595Deleted filter; would break
enwiki676Would break
enwiki764Deleted filter; would break
enwiki793No impact
enwiki923Would break
enwiki941Would break
enwiki996Would break
enwiki1019Would break
enwiki1071Would break
enwikibooks31
eswiki36
eswiki47
eswiki55
eswiktionary8
fawiki202Deleted filter; no impact
fiwiki82
frwiki122
frwiki125
glwiki11
hiwiki82
huwiki19
idwiki15
itwiki43
itwiki556
jawiki36
kkwiki23
kkwiki58
kowiki4
kowiki29
kowiki90
kowiki101
kowiki106
kowiki108
kowiki123
kowikisource6
metawiki174
nlwiki32
plwiktionary2
ptwiktionary2
ptwiktionary6
rowiki55
rowiki65
ruwiki62
ruwiki96
ruwiki104
ruwiki118
scowiki7
simplewiki23
simplewiki76
skwiki29
svwiki10
svwiki28
tawiki9
tawiki10
testwiki1
testwiki6
testwiki18
testwiki19
testwiki37
testwiki75
testwiki151
testwiki153
testwiki177
testwiki200
testwiki203
ukwiki48
usabilitywiki3
zhwiki108
zhwiki123
zhwiki140
zhwiki166
Huji added a comment.Oct 2 2020, 5:45 PM

@Daimona quick question: does ccnorm remove extra spaces? Do we have any other function that does?

@Huji if you can explain what replacements should be made / link to some example edits you made, I might be able to help (global sysop, local sysop on simplewiki and meta)

Huji is an Abuse filter maintainers, he can edit filters everywhere ;).

Huji added a comment.Oct 2 2020, 6:02 PM

Huji is an Abuse filter maintainers, he can edit filters everywhere ;).

Correct. Right this second, I am only screening these to estimate the effort needed. I will postpone editing to after the screening done.

@Daimona quick question: does ccnorm remove extra spaces?

Nope, it doesn't alter spaces in any way.

Do we have any other function that does?

I don't think we do.

Huji added a comment.Oct 5 2020, 12:44 AM

I went through all enwiki filters and some from other wikis. The bottom line is: several filters use a rules similar to rmspecials(somevar) irlike 'BLAH' which would previously match the string B L A H but wouldn't if we implement the change suggested in this task. In many cases, this was probably not specifically intended, but I don't want to mess with the status quo.

There are two options in front of us:

(1) Update all those use cases to be like rmwhitespace(rmspecials(somevar)) irlike 'BLAH'
(2) Keep everything as is, and make the space preservation a secondary parameter for rmspecials which is false by default.

I like the former better. After all, we have a function that explicitly removes spaces (rmwhitespaces) and space is not a special character so rmspecials should not treat it as such.

Thoughts?

Huji added a comment.Oct 10 2020, 12:38 AM

@Daimona or @DannyS712 do you have any advice on the above options?

So if I understand it correctly
rmspecials currently removes <special characters> and <whitespace>
rmwhitespace currently removes only <whitespace>
and goal is to make it so that
rmspecials only removes <special characters>

(where <special characters> and <whitespace> match specific character sets that I'm not listing individually)

Going with option 1:
rmspecials combined with rmwhitespace would, prior to the change, do the same as rmspecials does now
rmspecials combined with rmwhitespace would, after the change, do the same as rmspecials does now
I.e. updating filters like this would be a no-op before the code change, and after the code change would result in maintaining the current functionality?

If this is indeed a correct understand, I would prefer going with option 1 over option 2.
I suggest not trying to determine if the filters in question are meant to match with or without whitespace, but simply mass-replace rmspecials(foo) with rmspecials(rmwhitespace(foo)) (or the functions could be in the opposite order, but either way we should be consistent) - ensuring that we don't try to determine the intend of filters ensures that we don't miss anything or result in filters catching things they weren't meant to.
I would also suggest leaving a comment that references this task, as well as an issue of tech news where this is explained.

tl;dr
Option 1 sounds like the best way forward to me, assuming I'm understanding it correctly

Huji added a comment.Oct 12 2020, 2:22 PM

Great! I will start planning the transition then.

Belatedly: while option 1 might turn out to be a painful migration (mostly because of T213037), I do agree that it's the best option. Perhaps we might first notice the communities of the upcoming change, and give them some mean to exclude certain filters from the list. This might be done either here on phabricator (e.g. by editing a table in the task description), or on-wiki (e.g. writing something in the notes of the filter).

Huji added a comment.Oct 12 2020, 4:56 PM

@Daimona we should follow the steps mentioned in T191978. Steps 1 and 2 are done (we know filters must be updated, and we have queries requested and done). Step 3 is to ask the local communities to edit their filters. If a community member acknowledges that they will take this one, we're done. If that person forgets to do so, it's on the community, not us. What should do, however, is step 4 (clearly document what needs to be changed).

To that end, here is what I was planning to do:

  • Create a user notice about this.
  • Post a message on all appropriate village pump pages and ask local sysops (or abusefilter editors, wherever applicable) to volunteer to take on the changes
  • Maintain a list on this task of all wikis to which communication was made and whether or not they responded.
  • After some time (I suggest 2 weeks), use my own abusefilter-maintainer access to update all filters on wikis on which no one volunteered to do the changes
  • After every filter is accounted for (either as changed by a maintainer i.e. me, or as claimed by a local user), to submit and merge a patch that would do what this task desires.

Any thoughts on that itinerary?

Also, as for sending notices on village pumps, I know we have a mechanism for using MediaWiki Mass Message for this purpose, but cannot remember where it is documented. Any pointers on that? I am guessing we already have a list of technical village pumps but if not, we could also use https://www.wikidata.org/wiki/Q4582194 (and fallback to https://www.wikidata.org/wiki/Q16503 whenever a wiki has no page on Q4582194). Do you have a better idea?

@Daimona we should follow the steps mentioned in T191978. Steps 1 and 2 are done (we know filters must be updated, and we have queries requested and done). Step 3 is to ask the local communities to edit their filters. If a community member acknowledges that they will take this one, we're done. If that person forgets to do so, it's on the community, not us. What should do, however, is step 4 (clearly document what needs to be changed).

Yup, I was specifically referring to the third bullet:

  • Maintain a list on this task of all wikis to which communication was made and whether or not they responded.

I was proposing that local communities should maintain this list, by flagging a filter either as "updated" or "doesn't need updates". The remaining entries can then be fixed manually (I can request af-maintainer to help out with this).

Any thoughts on that itinerary?

Seems good.

Also, as for sending notices on village pumps, I know we have a mechanism for using MediaWiki Mass Message for this purpose, but cannot remember where it is documented. Any pointers on that? I am guessing we already have a list of technical village pumps but if not, we could also use https://www.wikidata.org/wiki/Q4582194 (and fallback to https://www.wikidata.org/wiki/Q16503 whenever a wiki has no page on Q4582194). Do you have a better idea?

Last time, I used those WD elements and posted messages manually, so I'm afraid I cannot really help with this point.

Ad posting messages, docs for the tool itself is at https://meta.wikimedia.org/wiki/MassMessage, a list of all technical village pumps is at https://meta.wikimedia.org/wiki/Distribution_list/Technical_Village_Pumps_distribution_list. Once you need a mass message to be sent, feel free to ping me, I'm happy to either send it on your behalf, or grant you temporary mass message sender permissions, so you can do it yourself :-). Let me know if you need any more details about mass message.

Huji added a comment.Oct 13 2020, 2:05 AM

@Urbanecm there are 72 distinct wikis in our list. Do you have to manually create the MassMessage recipient list? If so, I think it might make sense for you to start doing so right now, as I am working in parallel on the message itself.

I'm not sure which wikis we are talking about - I guess the ones yielded by the filter search query I ran some time ago? If not, please post the list of wikis here.

I have to create it semi manually, but it shouldn't be hard to do it.

Huji added a comment.EditedOct 14 2020, 12:51 AM

Correct. There are 72 distinct wikis in the list you put in P12854

Huji added a comment.Oct 14 2020, 1:20 AM

@Urbanecm: here is the message I suggest we use for MassMessage:

We are making a change to the [[mw:Extension:AbuseFilter|AbuseFilter extension]], which may impact the behavior of some existing filters. The <code>rmspecials()</code> function currently removes spaces in addition to special characters. We will change it such that it will only remove special characters. The existing <code>rmwhitespace()</code> can be used to remove spaces whenever applicable.

As reported on https://phabricator.wikimedia.org/P12854 we believe at least one filter on your wiki has been identified to use the <code>rmspecials()</code> function. Please consider updating these filters by wrapping <code>rmspecials()</code> inside <code>rmwhitespace()</code> like this: <code>rmwhitespace(rmspecials(....))</code>

We need you to update the relevant filters within 2 weeks of this notice. If one of the community members with proper access is volunteering to take this on, we ask them to please respond below and notify User:Huji in their response or in the edit summary. If we don't hear back from you within 2 weeks, [[:m:User:Huji|Huji]] will edit the relevant filters on your wiki per [[:m:Abuse filter maintainer|abuse filter maintainer]] policy. Thank you for your consideration!

Feel free to edit as necessary.

I thought about having a task (maybe even this task) and asking the communities to come here and check off the filters they volunteered to fix themselves, but I think Phab is just one additional step. I am volunteering to receive tens of notifications, but keep the process as simple and easy for our end users as possible.

Kizule added a subscriber: Kizule.

This is announced in the last "Tech news" (https://meta.wikimedia.org/wiki/Tech/News/2020/43), so I'm moving this task into the column "Already announced".

@Huji I've generated the list via https://wdmm.toolforge.org/, filtering via a bit of commandline to give only our 36 wikis (the reported 39 minus testwiki minus usabilitywiki, which can't be contacted in any way, as first one is devwiki (and can be left to break IMO) and second one is a closed wiki). Then, I've sent the message (as-of https://meta.wikimedia.org/w/index.php?title=User:Martin_Urbanec/sand&oldid=20579657) to all wikis listed at https://meta.wikimedia.org/w/index.php?oldid=20579648 via massmessage.

Let me know if there is anything else I should do.

Huji added a comment.Tue, Oct 27, 1:36 AM

I see that the mass messages were sent. So far, GeneralNotability has helped us tick off both metawiki and enwiki.

Sounds great! Should we add a checklist here or in a subtask, so everyone is aware of how is it going?