Page MenuHomePhabricator

Deploy StopForumSpam extension to production
Open, MediumPublic

Description

Primary task for deploying the StopForumSpam extension to Wikimedia production.

  • Add the new extension submodule to the git mediawiki/extensions repo (https://gerrit.wikimedia.org/r/101014)
  • Add extension to the make-wmf-branch release tool (https://gerrit.wikimedia.org/r/650167)
  • Add StopForumSpam extension with steward and maintainer info to https://www.mediawiki.org/wiki/Developers/Maintainers
  • Security review (likely unneeded in this case)
  • Move extension CI config to wikimedia-deployed section (done)
  • Deploy to beta cluster and evaluate (T181217)
  • Performance review - complete (T266904)
  • Set $wmgUseStopForumSpam to true (and other relevant config e.g. this and T273211) for pilot production wikis[0] in InitialiseSettings.php
  • Initially set to report-only mode (wgSFSReportOnly = true;) on pilot production wikis
    • Internal discussion task: T309900
  • Verify no need to convert relevant sql schema to abstract schema format (no database interactions at this time)
  • Update https://www.mediawiki.org/wiki/Extension:StopForumSpam ("Release status" etc)
  • Create a logstash dashboard (permalink here)
  • Create a basic README for SFS' git repo, even if it just points the mw.org doc page (c868500)
  • Improve extension test coverage (T316963)
  • Analyze quantitative and qualitative impact of SFS in Report-Only mode on candidate wikis
  • Enable SFS in Enforce mode across candidate wikis within Wikimedia production
  • Enable SFS in Enforce mode across all wikis within Wikimedia production
  • Write a short (1-3 sentence) simple explanation for the Tech News newsletter, when editors need to know about this

[0] A group of initial pilot wikis will need to be determined, perhaps ptwiki (T261133) and others?

Also, T255208 needs more work including the likely addition of informal methods of evaluation (admin surveys, etc, perhaps using similar metrics as this ongoing experiment) as the evaluation of StopForumSpam would likely use many of the same metrics and evaluation tools.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Update: Aside from the various technical items still remaining within the task description, most of which are fairly trivial, I think the following will be needed to successfully get ext:StopForumSpam into production:

  1. Set ext:StopForumSpam to enforce on the beta cluster. That's T304111 and I think the patch can be merged next week. This will give us better insights into how many false positives SFS is likely to generate (when people complain) and thus how feasible the extension will be to use within Wikimedia production.
  2. Assuming the previous item goes well, actually determine a group of pilot wikis for which to deploy ext:StopForumSpam. This will likely involve collaboration with the global stewards and local admins who will be involved in determining the efficacy of ext:StopForumSpam and managing false positives.
  3. Determine a set of simple criteria to evaluate the efficacy of ext:StopForumSpam when enabled for a certain period of time on the pilot wikis. This will likely be a very simple set of survey questions sent to global stewards and local admins, which can then be further analyzed.
  4. If the ext:StopForumSpam proves useful for the pilot wikis, determine a plan to enable on most/all Wikimedia projects.

Thanks @sbassett are you comfortable resolving 2 through 4 as written or would it be helpful if I host you and interested stewards in a call to align among us and ensure timely deployment? :)

sbassett updated the task description. (Show Details)
sbassett updated the task description. (Show Details)

Hello @JanWMF - Sorry for the delayed response. I'm pretty comfortable with building out some processes for items 3 and 4. For item # 2, getting some kind of consensus opinion from the stewards (and possibly other admins and functionaries) would be best. I would be happy to attend a stewards call or any other meeting that resulted in the determination of some pilot group of projects, as I think that is a key blocker that I cannot resolve by myself. Thanks.

Thanks @sbassett :) @jrbs please make sure to add the topic to the next stewards call that works for Scott and, ideally, @Tks4Fish @MarcoAurelio and @Urbanecm or, if not, set us a separate call for them and other interested stewards in a timely manner to help resolve #2. Thank you :)

Change 823789 had a related patch set uploaded (by SBassett; author: SBassett):

[operations/mediawiki-config@master] Enable StopForumSpam on initial candidate projects

https://gerrit.wikimedia.org/r/823789

Change 823790 had a related patch set uploaded (by SBassett; author: SBassett):

[operations/mediawiki-config@master] Enable StopForumSpam on initial candidate projects

https://gerrit.wikimedia.org/r/823790

Change 823790 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable StopForumSpam on initial candidate projects (CommonSettings)

https://gerrit.wikimedia.org/r/823790

Mentioned in SAL (#wikimedia-operations) [2022-08-17T16:50:35Z] <sbassett@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Enable StopForumSpam on candidate wikis (IS.php) - T273220 (duration: 03m 20s)

Mentioned in SAL (#wikimedia-operations) [2022-08-17T16:54:49Z] <sbassett@deploy1002> Synchronized wmf-config/CommonSettings.php: Enable StopForumSpam on candidate wikis (CS.php) - T273220 (duration: 03m 26s)

sbassett updated the task description. (Show Details)

Change 868500 had a related patch set uploaded (by SBassett; author: SBassett):

[mediawiki/extensions/StopForumSpam@master] Add basic README file for ext:StopForumSpam

https://gerrit.wikimedia.org/r/868500

Change 868500 merged by jenkins-bot:

[mediawiki/extensions/StopForumSpam@master] Add basic README file for ext:StopForumSpam

https://gerrit.wikimedia.org/r/868500

I wonder if we should not add some sort of privacy warning for onOtherBlockLogLink in rESFS includes/Hooks.php for stopforumspam-is-blocked to indicate that the link leads you to stopforumspam, a external website which has a different Privacy Policy than ours. We could customise this via WikimediaMessages for Wikimedia if needed.

I wonder if we should not add some sort of privacy warning for onOtherBlockLogLink in rESFS includes/Hooks.php for stopforumspam-is-blocked to indicate that the link leads you to stopforumspam, a external website which has a different Privacy Policy than ours.

Probably a good idea, though I can't think of a good precedent message to copy here? Maybe Privacy Engineering could provide some feedback re: best pracitices.

Thanks for the ping @sbassett. We could borrow some ideas from the generic message currently displayed when logged in users visit external links, and a privacy notice(T65598#6914486) which was provided by WMF-Legal. Privacy best practices encourage both brevity and clarity of notices. So, a more privacy-conscious message could be something along these lines:

You are about to leave WIKI_SITE to visit EXTERNAL_SITE which may receive data from your device. Please check their respective privacy policies.
Click here to continue on to EXTERNAL_SITE.

Irrespective of the wording chosen, I’d recommend to have it vetted by WMF-Legal to make sure it is aligned with the other notices and policies on the platform.

Irrespective of the wording chosen, I’d recommend to have it vetted by WMF-Legal to make sure it is aligned with the other notices and policies on the platform.

Just the usual note that if you would like Legal to look at this (the Privacy team in this case) it is best to email them directly at privacy@wikimedia.org - they don't tend to monitor Phab

Irrespective of the wording chosen, I’d recommend to have it vetted by WMF-Legal to make sure it is aligned with the other notices and policies on the platform.

Just the usual note that if you would like Legal to look at this (the Privacy team in this case) it is best to email them directly at privacy@wikimedia.org - they don't tend to monitor Phab

Thank you - I sent them an email with CC to you, Scott and Samuel just in case.

Hi, the extension is making editing in any wiki it's deployed much slower. e.g. in https://performance.wikimedia.org/excimer/profile/70c8903135703a38 one third of the time saving the edit is just being spent in checking SFS and mostly it's checking IP deny list. We got also this in IRC:

the trace shows IPSet as being slow, I don't think it was designed for such a large IP list

There should be some easy ways to optimize this.

Hi, the extension is making editing in any wiki it's deployed much slower. e.g. in https://performance.wikimedia.org/excimer/profile/70c8903135703a38 one third of the time saving the edit is just being spent in checking SFS and mostly it's checking IP deny list. We got also this in IRC:

Is that bad enough to where we should consider disabling for now? I really have no idea how significant a performance hit that is for most real-world users.

the trace shows IPSet as being slow, I don't think it was designed for such a large IP list

There should be some easy ways to optimize this.

There is a benchmarking maint script. In retrospect, IPSet may have been the wrong choice here, since it seems to be more optimized for efficiently dealing with CIDR ranges as opposed to a giant list of individual IP addresses, which is how the SFS deny lists are packaged (and with some additional meta-data). It might make sense to refactor this to just use isset, ?? or similar, as those approaches are high-performing for array searches and could hopefully scale to 200k or so elements, as that is typically what the production-configured 90-day IPv4 + IPv6 deny list averages.

Hi, the extension is making editing in any wiki it's deployed much slower. e.g. in https://performance.wikimedia.org/excimer/profile/70c8903135703a38 one third of the time saving the edit is just being spent in checking SFS and mostly it's checking IP deny list. We got also this in IRC:

Is that bad enough to where we should consider disabling for now? I really have no idea how significant a performance hit that is for most real-world users.

It's bad enough to block further deployment but not to disable it IMHO.

the trace shows IPSet as being slow, I don't think it was designed for such a large IP list

There should be some easy ways to optimize this.

There is a benchmarking maint script. In retrospect, IPSet may have been the wrong choice here, since it seems to be more optimized for efficiently dealing with CIDR ranges as opposed to a giant list of individual IP addresses, which is how the SFS deny lists are packaged (and with some additional meta-data). It might make sense to refactor this to just use isset, ?? or similar, as those approaches are high-performing for array searches and could hopefully scale to 200k or so elements, as that is typically what the production-configured 90-day IPv4 + IPv6 deny list averages.

If it needs to do exact matching in large set of IPs, that's rather and similar to what we did for T337431: Rework MediaWiki:SpamBlacklist, just make an array of IPs like $foo = ['1.2.3.4' => true, ...] and then do key search in that array (we did array_key_intersect but for you it might be different.) See the benchmark in P48956. You could even do a CLDR check first (if you have a range) to let anything not in those ranges move forward (=common cases), and then do a check on the remaining. On top of that have a list of ips that would simply not be checked, e.g. WMCS that a large chunk of edits come from see $wgGlobalBlockingAllowedRanges in production

Another possible optimization is to first initial an asynchronous job to search the IP from the list when users opens the edit page, and then cache the result for some time.

@sbassett: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action...Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!

BTW, you can use a bloom filter to make the look up extremely fast and if there is a potential match, do the actual look up. I can implement it if it can be reviewed and would make the project move forward (what's this blocked on?)

We've only just recently dropped a Bloom Filter library from MW core...

We've only just recently dropped a Bloom Filter library from MW core...

Any particular reason?

Regardless, re-implementing it is quite easy. At least for this specific case.

We've only just recently dropped a Bloom Filter library from MW core...

Any particular reason?

T212460: Adopt static array files for local disk storage of values (epic) for CommonPasswords.

But it's probably needed again to meaningfully increase the dataset size, ala https://gerrit.wikimedia.org/r/c/mediawiki/libs/CommonPasswords/+/868769

BTW, you can use a bloom filter to make the look up extremely fast and if there is a potential match, do the actual look up. I can implement it if it can be reviewed and would make the project move forward (what's this blocked on?)

Not sure we'd even need to? TorBlock just uses in_array to search over wanCache results. The SFS deny-lists aren't that enormous, particularly the listed_ip_90_ipv46_all file we have configured for Wikimedia production - comes in under 6 Mb, uncompressed. I think the actual culprit here is somewhat-needlessly using IPSet, as far as I can see. From what I can tell, this was introduced as part of the refactor work from years ago, and I'm not really sure why I made that choice.

Anyhow, the primary reason why this work has stalled stems from some debate about how useful the SFS deny-lists actually are within Wikimedia production (T332003, T322263, conversations with @Urbanecm). With various proxy- and spam-blocking bots already in place, the tireless work of the stewards and admins and the introduction of newer tools like Spur, I'm not entirely certain how valuable this particular extension is to Wikimedia production anymore.

This effort has been become quite dusty, largely due to me not really being able to work on it much. I'm wondering though, if a better approach might be to propose integrating stopforumspam.org data within the new iPoid-Service. I'm not sure exactly how much overlap there is between SFS's and Spur's data sets - that would likely be critical in determining if this could be a useful path forward.

MediaWiki-extensions-IPReputation seems to be the exact project. It seems to use the risk data from spur, https://docs.spur.us/data-types?id=risk-enums. Spur and stopforumspan do not track the same thing. According to a security researcher, there definitely are groups that do lower risk level stuff like spamming, but not higher risk stuff like spur tracks. There are also groups that do both. LTA (Long term abusers/vandals) reports created by users on WMF projects show the same thing.
As for handing the task over to the team responsible for iPoid and IPReputation, they clearly allready have their hands full.

ext:IPReputation (along with other services and tools) uses data collected by the underlying iPoid service, which currently only ingests data from Spur's API. I would know a bit about this, as I was one of the original architects of the project and a WMF liaison to Spur. Anyhow, I've chatted directly with a couple of the WMF folks who are now actively supporting ext:IPReputation, iPoid and related components and it turns out they were already considering adding SFS as another data source to the next incarnation of the iPoid service. Again, the determining factor of that value proposition would be how much overlap there is with SFS and Spur data. When I last looked at this, there was indeed a healthy amount of overlap in their IP data, but I'm not sure if that's changed or if the non-overlapping SFS data would still provide significant value.