Page MenuHomePhabricator

Determine the usage of rmspecials on WMF wikis
Closed, ResolvedPublicSep 24 2020

Description

We would like to know how many filters in each wiki are using the rmspecials() function so that we can decide on how to modify this function so that it would not treat white space as a special character.

This would require running a queries on the production servers hence tagging Wikimedia-Site-requests.

Details

Due Date
Sep 24 2020, 10:00 PM

Event Timeline

Huji updated the task description. (Show Details)

@Daimona did we ever document the process for how to engage DBAs to run queries for Wikimedia-abusefilter-global-maintainers tasks? Is "site-requests" the right tag? Or is it DBA?

@Daimona did we ever document the process for how to engage DBAs to run queries for Wikimedia-abusefilter-global-maintainers tasks? Is "site-requests" the right tag? Or is it DBA?

No, T191978 is still open. Unsure about what tag should be used.

Based on information I scraped from last year, rmspecials is used 137 times in 28 wikis. I can rerun the query today if you want. Note that I don't have an access to production servers. Just regular API queries with abusefilter helper right.

@Daimona did we ever document the process for how to engage DBAs to run queries for Wikimedia-abusefilter-global-maintainers tasks? Is "site-requests" the right tag? Or is it DBA?

No, T191978 is still open. Unsure about what tag should be used.

Right, right. T191978#5704225 is what I vaguely remembered. I'll make a task for writing that shell script.

Based on information I scraped from last year, rmspecials is used 137 times in 28 wikis. I can rerun the query today if you want. Note that I don't have an access to production servers. Just regular API queries with abusefilter helper right.

Do you have access to private filters too?

Yes. Private filters are included.

Would you mind running an API-based query again? It'll take us a while to address T262052 and there is no reason to wait for it.

Additionally, would you mind sharing the program you use to run these queries? Perhaps we can create a nimble JS gadget instead of, or in addition to, the shell script described in T262052. That way, abusefiler helpers and abusefilter maintainers can run queries too, not just those with shell access.

Would you mind running an API-based query again? It'll take us a while to address T262052 and there is no reason to wait for it.

I can run a query tonight (PT). What information do you need precisely? A list of sites and filter ids?

Additionally, would you mind sharing the program you use to run these queries? Perhaps we can create a nimble JS gadget instead of, or in addition to, the shell script described in T262052. That way, abusefiler helpers and abusefilter maintainers can run queries too, not just those with shell access.

Sorry, the code is written in Racket. I don't think it's gonna be helpful to you. It also has a bunch of other stuff as a part of my research project that I want to keep private for now.

@Nullzero the reason I asked is because to the best of my knowledge, AbuseFilter does not expose an API end point for its filter search functionality. So I am curious how you are searching the filters. If you provide some guidance, I should be able to write the JS Gadget myself.

Oh, I simply downloaded everything. Here's an excerpt of a request I sent:

(define resp (query `([action . "query"]
                      [list . "abusefilters"]
                      [abfstartid . ,start-id]
                      [abflimit . "500"]
                      [abfshow . "!deleted|enabled"]
                      [abfprop . "id|description|pattern|private"])))

Then, you can either grep or parse+traverse the AST for a more advanced search.

It would make sense to have a tool at toolforge that cron this request regularly (every day?) for a more responsive search. But also note it needs some authentication since private filter details should not be viewable in public.

Got it. That should be doable. All I need to figure out is the best way to get a current list of all wikis' base URLs. The rest is a bunch of API calls.

That said, I'm not going to start working on it right away. I have a few reservations, which I will describe on T262052

Got it. That should be doable. All I need to figure out is the best way to get a current list of all wikis' base URLs. The rest is a bunch of API calls.

What about https://meta.wikimedia.org/w/api.php?action=sitematrix&format=json&smtype=language, @Huji?

That said, I'm not going to start working on it right away. I have a few reservations, which I will describe on T262052

That is great!

I started working on the script, and got the first part down: fetch sitematrix and loop through the wiki URLs. But querying the APIs of each wiki failed, because of CORS restrictions. @Nullzero how did you get around CORS?

@Huji Could you please share an example of your code (minimum failing example please) and the error message? I'll be happy to help!

The problem with CORS as I understand, occurs because you are running it from a site. I scraped information from a script running from my computer directly, so there's no problem.

The problem with CORS as I understand, occurs because you are running it from a site. I scraped information from a script running from my computer directly, so there's no problem.

As stated above, i was trying to make it a Gadget. So yes, I was running it off of a site. First I tried to run it via Meta; that way the action=sitematrix part worked, but the calls to the individual wikis failed. Then I tried to run it off of my own computer (a file on my drive, opened in the browser); this time, even the call to meta failed. I guess I can run it from the command line with node or something, but I'm trying to make a tool that is easy for people without shell access to use, and most people know how to run a script from the browser, but not necessarily from the command line.

@Huji Could you please share an example of your code (minimum failing example please) and the error message? I'll be happy to help!

See P12512. For now, I'd rather keep it private.

[...]

@Huji Could you please share an example of your code (minimum failing example please) and the error message? I'll be happy to help!

See P12512. For now, I'd rather keep it private.

That would be caused by T210790: Allow cross-origin requests by default in the Action API, AFAICS, it is currently not possible to do an cross-origin request for an API. That's going to change, however - see the task linked :-). It should be already possible to make cross-origin requests to the rest API (T262425). It seems this can't be a gadget rn :/.

Now that T262052 has been fixed, I am going to assign this to Urbanecm who I think will be able to fulfill this task in ~2 weeks, when the maintenance script reaches WMF production servers.

Urbanecm changed the subtype of this task from "Task" to "Deadline".Sep 19 2020, 7:29 PM
Urbanecm set Due Date to Sep 24 2020, 10:00 PM.

@Huji Done, but I had to change the script locally to ignore labtestwiki, because wikiadmin user didn't have user rights to that database. That wiki is for tests only, and we can ignore that for now.

Results:

1Script started on Wed 30 Sep 2020 12:01:35 PM UTC
2urbanecm@mwmaint2001:~$ mwscript extensions/AbuseFilter/maintenance/searchFilters.php --wiki=enwiki --pattern 'rmspecial'
3wiki filter
4acewiki 6
5azwiki 25
6bnwiki 5
7ckbwiki 19
8ckbwiki 24
9ckbwiki 25
10commonswiki 10
11commonswiki 21
12commonswiki 37
13commonswiki 44
14cswiki 48
15cswiki 77
16cswiki 90
17cswiki 108
18dawiki 14
19dewiki 66
20dewiki 102
21dewiki 144
22dewiki 209
23dewiki 213
24dewiki 226
25enwiki 4
26enwiki 7
27enwiki 17
28enwiki 20
29enwiki 21
30enwiki 22
31enwiki 23
32enwiki 96
33enwiki 119
34enwiki 137
35enwiki 148
36enwiki 149
37enwiki 154
38enwiki 166
39enwiki 170
40enwiki 188
41enwiki 294
42enwiki 355
43enwiki 414
44enwiki 595
45enwiki 676
46enwiki 764
47enwiki 793
48enwiki 923
49enwiki 941
50enwiki 996
51enwiki 1019
52enwiki 1071
53enwikibooks 31
54eswiki 36
55eswiki 47
56eswiki 55
57eswiktionary 8
58fawiki 202
59fiwiki 82
60frwiki 122
61frwiki 125
62glwiki 11
63hiwiki 82
64huwiki 19
65idwiki 15
66itwiki 43
67itwiki 556
68jawiki 36
69kkwiki 23
70kkwiki 58
71kowiki 4
72kowiki 29
73kowiki 90
74kowiki 101
75kowiki 106
76kowiki 108
77kowiki 123
78kowikisource 6
79metawiki 174
80nlwiki 32
81plwiktionary 2
82ptwiktionary 2
83ptwiktionary 6
84rowiki 55
85rowiki 65
86ruwiki 62
87ruwiki 96
88ruwiki 104
89ruwiki 118
90scowiki 7
91simplewiki 23
92simplewiki 76
93skwiki 29
94svwiki 10
95svwiki 28
96tawiki 9
97tawiki 10
98testwiki 1
99testwiki 6
100testwiki 18
101testwiki 19
102testwiki 37
103testwiki 75
104testwiki 151
105testwiki 153
106testwiki 177
107testwiki 200
108testwiki 203
109ukwiki 48
110usabilitywiki 3
111zhwiki 108
112zhwiki 123
113zhwiki 140
114zhwiki 166
115urbanecm@mwmaint2001:~$ exit
116
117Script done on Wed 30 Sep 2020 12:01:47 PM UTC