Page MenuHomePhabricator

Devise a process for finding and fixing filters that will be affected by changes in AbuseFilter
Open, MediumPublic

Description

Every so often, we need to make a change in AbuseFilter code that could potentially affect the meaning of existing filters. For instance:

  • In T191715 we learned that some system variable names were not properly reserved by AbuseFilter, and before we fix it we need to identify all filters on WMF wikis that use a homonym local variable and change their code to use a different name for that local variable.
  • In T181024 we found out that arrays may be improperly cast to strings, and before fixing the code we want to fix the filters that use this incorrect approach.
  • A similar issue was also observed in T190639 about the way string() function is used by filters to cast lists into strings. Again, we would like to fix the filters that use that function before updating the code.
  • In T187973 we want to find all filters that use deprecated variables and replace the variable names with their non-deprecated alternatives.

For each of these we would like to run a query against production databases (for all WMF wikis, possibly one wiki at a time) to identify filters that need to be modified, and then use a global sysop account to go into each and every one of these filters and apply a fix.

To that end, we need to have a very clear process for the following:

  1. To decide when each of those steps is worth doing. The alternative is to not find/modify the affected filters, and let the local admins do it. Each approach has cons and pros, and we need to devise a guideline to decide which approach to use when.
  2. To request for the queries to be run (who asks it, who runs it, does it need legal approval each time, is there a Phabricator tag to use, etc)
  3. To request for the affected filters to be edited (who does that, do we ask one of the existing global sysops or do we temporarily promote one of the developers to a global sysop each time)
  4. To communicate and document the filter modifications (the last thing we want is for the local sysops to see a random person come and change something with their filters and freak out)

Since Brian and Aeryn were (briefly) engaged with some of the tasks enumerated above, I have copied them here as well.

Event Timeline

Of note, @Krinkle has proposed in T187973#3994713 that we should use the tag #wikimedia-site-requests for the query task and he believes that "the task of a global sysop performing on-wiki changes is outside the scope for any Phabricator task" which leaves me unsure as to how to manage that step.

As a side note, I tried to search for existing examples (already done), but I could only find T140791.

Good work! Of note, in T140791 @kaldari elects to post results in a private manner because some filters are private. I assume the results included more than just the filter ID?

Huji triaged this task as Medium priority.Apr 11 2018, 4:16 PM
Huji updated the task description. (Show Details)

Yeah, probably the results included the pattern as well. I think the needed queries, if they only return IDs, shouldn't be private.

I had some thoughts about the error-fixing part. What if we create a maintenance account (like "AbuseFilterManager"), with shared login and global sysop rights, in order to keep all contribs on the same account and maybe provide a userpage with some lines of explanation?

I was told you could also tag #DBA but I think #wikimedia-site-requests probably makes more sense. One issue here we currently have to do this manually, wiki by wiki (I think?). I'm not sure if there's a script to run a query across all wikis, but if there isn't, we should create one!

Created as T193894. I saw this a bit too late but added site-requests as well. Of course a script should be used for this, but I don't think it should be too hard to create one.

Of note, @Krinkle has proposed in T187973#3994713 that we should use the tag #wikimedia-site-requests for the query task and he believes that "the task of a global sysop performing on-wiki changes is outside the scope for any Phabricator task" which leaves me unsure as to how to manage that step.

As for the latter part, I have made a request to become a global sysop. I think having a global sysop that is extensively familiar with AbuseFilter and is dedicated to the filter cleanup can tremendously boost the process of cross-wiki filter updates

Create a maintenance script to edit the filters for us automatically. Going based off of normalizeThrottleParameters.php, a script to edit filters is doable.

I think this RFC is worth mentioning: https://meta.wikimedia.org/wiki/Requests_for_comment/Creating_abusefilter-manager_global_group. Note that there's no clear consensus on this issue.

A note in case someone is going to work on the update script: it should not process concrete syntax. That is, don't treat the pattern textually (e.g., using regex to search and replace). Most programming languages, including the filter language, has rich structures, and naive textual processing is very likely to be buggy for non-trivial tasks. Instead, the script should first parse the pattern into an abstract syntax tree, then perform a tree transformation, and finally print the tree back to a textual pattern. Note that the current parser is not sufficient because it doesn't keep track of whitespace and comments, and we need these information so that they are preserved when the tree is printed back to text.

Create a maintenance script to edit the filters for us automatically. Going based off of normalizeThrottleParameters.php, a script to edit filters is doable.

It's definitely doable, the problem is telling it what to do. Sometimes it's hard even for a human to understand how a filter should be fixed.

I think this RFC is worth mentioning: https://meta.wikimedia.org/wiki/Requests_for_comment/Creating_abusefilter-manager_global_group. Note that there's no clear consensus on this issue.

Yes, indeed.

A note in case someone is going to work on the update script: it should not process concrete syntax. That is, don't treat the pattern textually (e.g., using regex to search and replace). Most programming languages, including the filter language, has rich structures, and naive textual processing is very likely to be buggy for non-trivial tasks. Instead, the script should first parse the pattern into an abstract syntax tree, then perform a tree transformation, and finally print the tree back to a textual pattern. Note that the current parser is not sufficient because it doesn't keep track of whitespace and comments, and we need these information so that they are preserved when the tree is printed back to text.

Agreed, that should be a must-have. But I wouldn't waste time in working on that, given what I said above.

Filters are stored in externalstorage, right? It would be very difficult to do that in a single query. But a maintenance script to find all that match a certain regex is certainly do-able (To find them, as said above, autofix is hard). If someone writes a maintenance script, then anyone with shell access can run it (e.g. make a request with Wikimedia-Site-requests ). I imagine, for who can see the answer, it would be anyone who in principle would have the rights to see the answer on wiki (I guess that means stewards?) and people who have signed developer related ndas.

I see theee main ways to get the data about filters:

  • Use an abusefilter-helper account to query all filters, which could be made in a simple OAuth-based tool running at Toolforge (this is relatively easy to build, and easy for everyone with enough permissions to use)
    • Pros: easy to write and use, can be written as a global search tool; easy to scale, low barrier to get permissions to view all filters (as opposed to shell access)
    • Cons: need of new code to be written and maintained
  • Build SQL queries to be run at prod, and get someone run them (this can be either be in DBA or Wikimedia-Site-requests scope, deployers have access to database and can run arbitrary queries, and DBAs manage the databases, and obviously can also run the queries)
    • Pros: no new code needs to be written, there is relatively high number of individuals with access)
    • Cons: easy to make a mistake in the query, need to coordinate, getting a new individual with access is hard
  • Create a maintenance script to search for filters (similar to mwgrep to search in JavaScript/css files)
    • Pros: easy to use, hard to make mistakes
    • Cons: need of new code to be written and maintained, need of a deployer/restricted person with shell access

I prefer the first solution, because it's easy to build and allows AbuseFilter devs to maintain their extension without need for coordination with other teams.

For editing, now that we have global AbuseFilter managers, we can easily grant core AF devs that permission, and let them self service.

Thoughts?

I also agree that an OAuth-based tool hosted on Toolforge would be great. I think the code wouldn't be too complicated to write or maintain.

So, now we have a shell script we can use:

urbanecm@deployment-deploy01:/srv/mediawiki-staging/php-master$ mwscript extensions/AbuseFilter/maintenance/searchFilters.php --wiki=metawiki --pattern='rmspecials'
wiki    filter
enwiki  20
enwiki  96
enwiki  119
enwiki  154
metawiki        3
testwiki        1
testwiki        18
testwiki        19
urbanecm@deployment-deploy01:/srv/mediawiki-staging/php-master$

Once it lands to production, I'm happy to run it as-needed.

Summarization of progress with some thoughts from me:

  • We created a maintenance script that can be used to search for filters
    • This can be easily requested by any AF dev, preferably by creating a dedicated task for the query, and putting it to Wikimedia-Site-requests and Wikimedia-maintenance-script-run, and Im sure I or other person with access would be happy to run it
    • Personal opinion on legal: This is not confidential and protected information in the NDA sense of view, and can be accessed by sysops. I think we should paste the output restricted to WMF-NDA, which should be enough. Usage of the data to fix the stuff is obviously okay, just as logstash data can be used to fix an issue on wiki.
  • Preferred approach: a web tool via OAuth or gadget

To summarize steps to do:

[ ] Write an OAuth tool that would implement the same functionality as the script, for easier use by broader amount of people (@Urbanecm)

Open questions:

  • How exact should the guideline be? What should it exactly document? When to make changes and when to defer to communities?
  • Where should the guideline live? Mw.o? Meta? Somewhere else?

@Huji @Daimona Did I miss something?

Hello everyone,

I wrote https://search-filters.toolforge.org/ as the OAuth-based tool I mentioned. It allows only global abusefilter helpers/maintainers in, because it also queries local filters. It also doesn't query live data, because fetching all filters took quite some time (it's one request/wiki :/). Right now, the cached database of filters was generated locally using my steward account, and uploaded to toolforge, as soon as https://meta.wikimedia.org/w/index.php?title=Steward_requests/Global_permissions&oldid=20531959#Abusefilter_helper_for_Abusefilter_global_search_service_account is approved, I'll switch it to a dedicated service account, and run the caching process regularly.

I guess we can call this resolved?

@Daimona @Huji ^^^

@DannyS712 also said he'll explore the user scripts way,.

Yeah, since I can't access any pastes created (still waiting for T256367: WMF-NDA access for DannyS712) and don't have the global rights needed for the new tool, I'll probably write a script that allows querying each wiki so that per-wiki user rights are accounted for (eg GS can view private filters on all GS wikis, even if they lack the global rights)