Page MenuHomePhabricator

Allow Wikifunctions-generated references to have links to non-Wikimedia sites (but not in general Wikifunctions output)
Closed, ResolvedPublic

Description

The current Security-blessed HTML sanitisation bans links to anywhere that's not a Wikimedia SUL wiki. This is great for e.g. article content, but not for references. We need to agree with Security a special set of rules for this kind of use, and apply it in a way that doesn't open up abuse vectors.

User impact & Example

Function Z32053 is an example of a reference-generating function intended to include external links. These links are currently displayed as raw HTML, so the reference is not clickable. The community has mentioned this as an issue, as this means contributors cannot create references to non-Wikimedia sites that work as expected.

Event Timeline

Jdforrester-WMF renamed this task from Allow references to have links to non-Wikimedia sites (but not in general content) to Allow Wikifunctions-generated references to have links to non-Wikimedia sites (but not in general Wikifunctions output).

I took the liberty to send a message to the security team.

sbassett subscribed.

Added Security-Team for us to triage at our weekly clinic.

They are still valid reason to have external links in text, though they should pass through SpamBlacklist. They are usually discouraged in article body (i.e. other than navbox and external links section) though.

They are still valid reason to have external links in text, though they should pass through SpamBlacklist. They are usually discouraged in article body (i.e. other than navbox and external links section) though.

Please file a different task, rather than derailing this one.

SpamBlacklist and AbuseFilter are one of the solutions mentioned for this. We could use those when saving a Z89 with a reference inside it.

However, after looking into it, neither extension works automatically with custom content models like ours. We'd need to add custom integration code either way.

  • For SpamBlacklist, the cleanest approach is to extract URLs from Z89 content in our save pipeline and run them through SpamBlacklist's domain list ourselves. That way we get the benefit of the community-maintained blocklist without SpamBlacklist needing to understand our content model.
  • For AbuseFilter, the story is similar — we'd need custom code to extract and expose the URLs to its filtering layer. The upside is that AbuseFilter is rule-based, meaning community admins could update filters without code changes. Worth discussing whether that flexibility is worth the extra integration complexity.

Of course, checking at save time is only one half of the solution. The sanitizer (which currently strips all external links) would still need to be updated to allow URLs through for Z89 reference content — otherwise nothing would render even if the links passed the blocklist checks. So both pieces need to land together.


Bots as an additional safety net

IABot and GreenC bot were also mentioned as a potential option. These bots run on the wikifunctions.org side and could potentially be extended to monitor and revert Z89 objects containing bad reference links. However, it's unclear whether their maintainers would be interested in covering this use case, so this is more of a complementary fallback than something we could rely on.

And lastly: do we want to remove the check on save/edit entirely ?

SpamBlacklist ideally shouldn't be used because its being replaced with AbuseFilter blocked external domains: https://www.mediawiki.org/wiki/Extension:SpamBlacklist#Special:BlockedExternalDomains

However, all the existing solutions except the SpamBlacklist's whitelist page work on the concept of blocking URLs as opposed to allowing then AFAIK

SpamBlacklist ideally shouldn't be used because its being replaced with AbuseFilter blocked external domains: https://www.mediawiki.org/wiki/Extension:SpamBlacklist#Special:BlockedExternalDomains

However, all the existing solutions except the SpamBlacklist's whitelist page work on the concept of blocking URLs as opposed to allowing then AFAIK

Yup, I think we'll want to implement SpamBlacklist support just for the global deny list, as people will expect it to work; naturally I'll be delighted when we can pull it back out again!

Jdforrester-WMF changed the task status from Open to In Progress.Apr 21 2026, 3:34 PM

ScottB: Ok, I think we can go ahead with what we discussed above and on the bug then?  Given the enormous amount of potentially valid external links and that there aren't currently any processes or APIs available for sanitizing such content, aside from running external link content through the various *Blacklist pages and manual review, that's likely the best we can do at this time.  On the bug, it appears that James is already planning to integrate with SpamBlacklist and potentially (the more recommended it seems) AF in the near future. 

Il move to ready.

Ideas from my end: just add the AbuseFilter/SpamBlocklist checks on asbtract/zobject save/edit:

Screenshot 2026-04-24 at 13.10.35.png (2,670×1,736 px, 546 KB)
SpamBlockList error
Screenshot 2026-04-24 at 14.12.27.png (2,730×1,842 px, 585 KB)
Screenshot 2026-04-24 at 13.13.15.png (3,506×1,578 px, 381 KB)
Screenshot 2026-04-24 at 14.10.39.png (2,728×1,124 px, 198 KB)

Abuse Filter Special Page - Should probably not use this it will receive the json stringified wikitext as input, unless we stringify differently|Abuse Filter Blocked domains Special Page|Abuse Filter error blocked domain|

Perhaps we should change the messages.

Is this the direction we want to head @Jdforrester-WMF

Change #1277079 had a related patch set uploaded (by Daphne Smit; author: Daphne Smit):

[mediawiki/extensions/WikiLambda@master] Allow external reference URLs in Z89 fragments; block spam/blocked domains

https://gerrit.wikimedia.org/r/1277079

I am no expert on SpamBlacklist or AbuseFilter, so I will not comment on the use of these tools.

I understand that the proposal is to run these tools at the Abstract Wiki publish step. I'm understanding this from the screenshots, and your comments:

extract URLs from Z89 content in our save pipeline
...
Of course, checking at save time is only one half of the solution.

I'm not sure how we plan to do this. The filters need to be run against the Z89 generated as an output from the function calls, but on publish action, what we are saving is not the output generated but the function calls per se. This would require we synchronously execute every function call in the article in order to validate its output, which goes against our model.

But even considering that a function call had been executed, and its output had passed the filter, there could be remote changes (a change in a Wikifunctions function implementation) that make the output from that function call go from something good to something back, so the publish step is neither viable nor trustworthy.

If we want to make sure that all content rendered in an Abstract Wikipedia article shown inside a language wiki has passed the filters, we should make this part of the script that renders the output and stores it in our persistent section store
(so, part of T422621: Build a maintenance script that pre-generates and stores whole Abstract article sections)

I just realized that on save makes no sense because a user could concatenate multiple z89's and still output a bad link.
So I need to change it to render time. which means in the sanitizer. Which means the feedback will probably currently be limited to a link that does not work. (Maybe UX can think of a better solution for the future)

Ill refactor the code.

Yeah, agreed, I think the URL-checking can only reasonably happen at the HTML sanitiser step; if a given link fails the test (from either AbuseFilter or SpamBlacklist), it'll be treated in the same way a link in normal abstract HTML is treated: the <a> tag will be escaped so it renders as text. Adding an error message later about the link not being allowed is a nice improvement for the existing workflow too.

New Approach:

caller
  └─► WikifunctionsPFragmentRenderer::render($html)
           │
           ├─ 1. loadBlockedDomains()       ← AbuseFilter APCu cache (stateless, no user needed)
           ├─ 2. sanitiseHtmlFragment($html, $blockedDomains)
           │        └─► WikifunctionsPFragmentSanitiserTokenHandler
           │                  └─ for <a> in reference context:
           │                       • check href host against $blockedDomains
           │                       • blocked → tagAllowed = false (escaped as text)
           │                       • clean  → allow href through as before
           └─ return sanitised HTML

@Jdforrester-WMF you mentioned SpamBlackList but I see the following issues with that on render time:

  • BaseBlacklist::filter() requires a User $user parameter

callers:

  • AbstractWikiRequest::generateSafeFragment() — background process, zero user context
  • WikifunctionsPFragmentHandler — Parsoid rendering, no interactive user
  • ApiWikifunctionsHTMLSanitiser — the only one that actually has a user

It probably requires a user because you can set rules like blocking a user if too many abuse urls, its probably usually used on save.

And also:
SpamBlacklist's filter() doesn't just match a regex list — it also checks whether the user has permissions to bypass the filter (bots, admins, etc.). At render time we don't want user-based bypasses; we want consistent enforcement regardless of who triggered the render.


Can we go with only blockedDomains from AbuseFilter or not? other solutions?
SpamBlacklist with an anonymous user worth exploring ?

If we would use an anonymous user:
UserFactory::newAnonymous() has no tboverride right so the full regex check runs I think. Pass $preventLog = true to suppress the meaningless audit log. Title can be null (already nullable in the signature).

SpamBlacklist::filter($urls, null, $anonUser, preventLog: true)

The problem is that it receives $urls and so we would first need to extract the urls from the html:
extractReferenceUrls($html) ← pre-scan for href values

│                                       inside ext-wikilambda-reference context

The diagram would become something like:

caller
  └─► WikifunctionsPFragmentRenderer::render($html)
           │
           ├─ 1. extractReferenceUrls($html)    ← lightweight pre-scan for href values
           │                                       inside ext-wikilambda-reference context
           ├─ 2. loadBlockedDomains($urls)       ← SpamBlacklist::filter($urls, null, $anonUser)
           │                                          + AbuseFilter::loadComputed()
           │                                          → merged blocked domain set
           ├─ 3. sanitiseHtmlFragment($html, $blockedDomains)
           │        └─► WikifunctionsPFragmentSanitiserTokenHandler
           │                  └─ for <a> in reference context:
           │                       • check href host against $blockedDomains
           │                       • blocked → tagAllowed = false (escaped as text)
           │                       • clean  → allow href through as before
           └─ return sanitised HTML

Let me know what you guys think?

@Jdforrester-WMF you mentioned SpamBlackList but I see the following issues with that on render time:

  • BaseBlacklist::filter() requires a User $user parameter

Yeah, just give it a fresh logged-out-user object. Suppressing the log is a good idea, as this isn't an edit action.

The problem is that it receives $urls and so we would first need to extract the urls from the html:
extractReferenceUrls($html) ← pre-scan for href values

│                                       inside ext-wikilambda-reference context

The diagram would become something like:

caller
  └─► WikifunctionsPFragmentRenderer::render($html)
           │
           ├─ 1. extractReferenceUrls($html)    ← lightweight pre-scan for href values
           │                                       inside ext-wikilambda-reference context
           ├─ 2. loadBlockedDomains($urls)       ← SpamBlacklist::filter($urls, null, $anonUser)
           │                                          + AbuseFilter::loadComputed()
           │                                          → merged blocked domain set
           ├─ 3. sanitiseHtmlFragment($html, $blockedDomains)
           │        └─► WikifunctionsPFragmentSanitiserTokenHandler
           │                  └─ for <a> in reference context:
           │                       • check href host against $blockedDomains
           │                       • blocked → tagAllowed = false (escaped as text)
           │                       • clean  → allow href through as before
           └─ return sanitised HTML

Let me know what you guys think?

We can pass them one at a time in the for <a> in reference context loop where we're validating, though that's a bit slower; but is it slower than a parsing pass to extract the hrefs in the first place?

I dont know which one is slower honestly. if spamblacklist needs to do their permissions checks everytime thats also slower then once. I would say left or right maybe its not a huge difference. I do perhaps like the below diagram better. Its simpler.

caller
  └─► WikifunctionsPFragmentRenderer::render($html)
           │
           ├─ 1. loadBlockedDomains()         ← AbuseFilter::loadComputed() only (APCu, no URLs needed)
           ├─ 2. sanitiseHtmlFragment($html, $blockedDomains, $spamChecker)
           │        └─► WikifunctionsPFragmentSanitiserTokenHandler
           │                  └─ for <a> in reference context:
           │                       • SpamBlacklist::filter([$href], null, $anonUser)  ← inline, per URL
           │                       • check href host against $blockedDomains (AbuseFilter)
           │                       • blocked by either → tagAllowed = false (escaped as text)
           │                       • clean → allow href through as before
           └─ return sanitised HTML

I dont know which one is slower honestly. if spamblacklist needs to do their permissions checks everytime thats also slower then once. I would say left or right maybe its not a huge difference. I do perhaps like the below diagram better. Its simpler.

caller
  └─► WikifunctionsPFragmentRenderer::render($html)
           │
           ├─ 1. loadBlockedDomains()         ← AbuseFilter::loadComputed() only (APCu, no URLs needed)
           ├─ 2. sanitiseHtmlFragment($html, $blockedDomains, $spamChecker)
           │        └─► WikifunctionsPFragmentSanitiserTokenHandler
           │                  └─ for <a> in reference context:
           │                       • SpamBlacklist::filter([$href], null, $anonUser)  ← inline, per URL
           │                       • check href host against $blockedDomains (AbuseFilter)
           │                       • blocked by either → tagAllowed = false (escaped as text)
           │                       • clean → allow href through as before
           └─ return sanitised HTML

WFM, let's do that for now.

Change #1286449 had a related patch set uploaded (by Jforrester; author: Jforrester):

[integration/config@master] Zuul: [mediawiki/extensions/WikiLambda] Add AF and SB deps

https://gerrit.wikimedia.org/r/1286449

Change #1286449 merged by jenkins-bot:

[integration/config@master] Zuul: [mediawiki/extensions/WikiLambda] Add AF and SB deps

https://gerrit.wikimedia.org/r/1286449

Mentioned in SAL (#wikimedia-releng) [2026-05-12T18:08:00Z] <James_F> Zuul: [mediawiki/extensions/WikiLambda] Add AF and SB deps for T423180

Change #1286889 had a related patch set uploaded (by Jforrester; author: Jforrester):

[integration/config@master] Zuul: [mediawiki/extensions/WikiLambda] Drop AF and SB deps down to phan-only

https://gerrit.wikimedia.org/r/1286889

Change #1286889 merged by jenkins-bot:

[integration/config@master] Zuul: [mediawiki/extensions/WikiLambda] Drop AF and SB deps down to phan-only

https://gerrit.wikimedia.org/r/1286889

Mentioned in SAL (#wikimedia-releng) [2026-05-13T12:37:36Z] <James_F> Zuul: [mediawiki/extensions/WikiLambda] Drop AF and SB deps down to phan-only, for T423180

Change #1277079 merged by jenkins-bot:

[mediawiki/extensions/WikiLambda@master] Allow external URLs in Z89 reference context

https://gerrit.wikimedia.org/r/1277079

This result shows that although the html output looks okay, the UI renderer is still too nervous to show the link?

image.png (272×86 px, 4 KB)

This result shows that although the html output looks okay, the UI renderer is still too nervous to show the link?

image.png (272×86 px, 4 KB)

I don't understand. That's not a reference. Do you mean that you want it to appear in main text? That's not what this task was about. Compare with https://www.wikifunctions.org/view/en/Z31906?call=%7B%22Z1K1%22%3A%22Z7%22%2C%22Z7K1%22%3A%22Z31906%22%2C%22Z31906K1%22%3A%22Foo%21+%3Ca+href%3D%5C%22https%3A%2F%2Fwww.bbc.com%5C%22%3EBBC%3C%2Fa%3E.+Bar.%22%7D which works as intended.

Is the definition of a "reference" anything in a <sup>? Yes, links will often be desirable as links in plain/"main" text. See these sections of a featured article, especially the Authority control box:

image.png (971×1,058 px, 159 KB)

Perhaps separately, I don't think the Wikifunctions UI itself should scramble valid HTML links when "rendered". It makes the function-writer think something is wrong with the function output. (With possible exceptions for blacklists or harmful content, but I think we could instead manage that as a community if necessary.)

They are still valid reason to have external links in text, though they should pass through SpamBlacklist. They are usually discouraged in article body (i.e. other than navbox and external links section) though.

Please file a different task, rather than derailing this one.

Ok, I see you wanted this sent elsewhere. Although IMO it was unfair to call it derailing. To me it was a good suggested amendment to drop what I think is an unnatural restriction to <sup>s. Anyway, I'll start a new task now.

They are still valid reason to have external links in text, though they should pass through SpamBlacklist. They are usually discouraged in article body (i.e. other than navbox and external links section) though.

Please file a different task, rather than derailing this one.

Ok, I see you wanted this sent elsewhere. Although IMO it was unfair to call it derailing. To me it was a good suggested amendment to drop what I think is an unnatural restriction to <sup>s. Anyway, I'll start a new task now.

Yes, that would be great. Changes like allowing external-domain links throughout the application often require security reviews and careful consideration, so they should be handled as a separate piece of work. Thanks!