Page MenuHomePhabricator

Implement mechanism to exclude a domain from externallinks database (LinkSearch)
Open, Needs TriagePublic

Description

Problem

The externallinks table is very large on certain wikis. Its size and continued growth make it the subject of urgent optimisation work led by @Ladsgroup in the DBA team.

Read more at: T300222: Implement normalizing MediaWiki link tables, T312666: Remove duplication in externallinks table, T343131: Commons database is growing way too fast, T398709: FY2025-26 WE 6.4.1: Move links tables of commons to a dedicated cluster, and most recently T403397: Externallinks in Russian Wikinews is unusually large.

Use case

This database table stores URLs that represent the outgoing links from articles. This exists primarily for preventing and finding undesirable links.

The SpamBlacklist extension uses it to automatically reject edits that add links to undesirable websites.

The interface at https://en.wikipedia.org/wiki/Special:LinkSearch lets editors search for existing links (i.e. to clean up after adding a new spam filter, or to research and assess impact before creating a filter, or to do manual clean up periodically for sites in the grey area).

It also enables features such as the Conflict of Interest reports on Meta-Wiki, through finding cross-wiki link additions by the same actor that warrant a closer look.

History:

Proposal: Don't index certain sites

The tasks linked above improved to the database table to be more efficient, however this wasn't enough, and in the same tasks we've also started to classify various domains as storing "unneeded links". Links to websites that are 1) highly trusted by the community, and 2) extremely widely linked such that 3) realistically one cannot traverse these or use these to limit a search, as it would often be close to a no-op that includes every page on the wiki.

On Wikimedia Commons, for example, virtually all file description pages have two links to https://creativecommons.org/ for the license, and it is always the same URL, and it gets there via a well-known license template that is centrally maintained and transcluded into each file page.

So far, in these tasks, we've converted them to interwiki links. Given that there is no UI for outgoing interwiki links, this is effectively the same as not storing the data at all.

I propose we instead introduce a configuration variable that controls a list of domains for which we don't index external links. We can then decide, together with the community, for which sites we don't need LinkSearch.

This builds on the existing $wgRegisterInternalExternals feature which already excludes external links to current domain (e.g. links to en.wikipedia.org from within en.wikipedia.org).

Note that we still have Global Search, insource, dumps, etc for finding these URLs in other ways. They just won't be in LinkSearch. And — they already aren't there today, because we converted them to interwiki links.

Strawman:

  • Add a configuration variable with a list of things to exclude from the LinkSearch index.
  • This list can be controlled on a per-wiki basis.
  • Apply the filter in ExternalLinksTable (not ParserOutput::addExternalLink), so that anything based on ParserOutput (ParserCache, EditStash, AbuseFilter, etc) is unaffected and continues to see these URLs during edits as part of positive rules (i.e. require a certain thing, or exempt filters if you add a certain thing).
  • On Special:LinkSearch and ApiQueryExtLinksUsage: Add a friendly warning if your search matches the exclusion list, informing you why there are no results.

Concrete options:

  • Option 1: $wgExternalLinksExcludedDomains: List of domains excluded from the LinkSearch index. Excluding example.com will exclude all protocols, subdomains, and paths under example.com. Excluding sub.example.com will exclude all protocols, paths, on that subdomain (and any sub-subdomains). We can re-use UrlUtils::matchesDomainList which powers $wgNoFollowDomainExceptions today.
  • Option 2: $wgExternalLinkExclusionList: List of domains or URL prefixes to exclude from the LinkSearch index. This supports the same format as LinkSearch itself. So example.com would exclude HTTP+HTTPS, any subdomains, and all paths under example.com, whereas example.org/foo would exclude only links that start with /foo on that specific domain (both HTTP+HTTPS), and https://example.org/foo would limit it to HTTPS only. We can re-use the LinkFilter class, which is also how we format database rows and search queries already for LinkSearch today.

Alternative: Interwiki links

The strategy we've used so far to convert them to interwiki links.

At T343131, @Ladsgroup wrote in July 2023,:
  • Use interwiki links/pagelinks instead of raw https links.
In T343131#9626539, @LucasWerkmeister wrote in January 2024:

[mediawiki/extensions/WikimediaMessages] Use interwiki to link to Creative Commons
https://gerrit.wikimedia.org/r/991921

Once this rolls out with the train next week, the number of external links to https://creativecommons.org (currently ~146.8 million) should start to go down gradually, as pages are re-parsed for various reasons and use the new version of the message with an interwiki link instead of an external link. […]

At T403397, @Ladsgroup wrote in Sep 2025:

[…] And https://org.wmflabs.tools. and https://org.toolforge.pageviews. and https://org.creativecommons. should switch to interwiki links

This means instead of storing a wide externallinks row, with repeated domain index values for millions of rows:

el_idel_fromel_to_domain_indexel_to_path
423309499063058https://org.creativecommons./licenses/by-sa/3.0
423309509063058https://org.creativecommons./licenses/by-sa/3.0/deed.en

... we store a notably smaller iwlinks row:

iwl_fromiwl_prefixiwl_title
163091007ccorglicenses/by-sa/3.0
163091007ccorglicenses/by-sa/3.0/deed.en

And besides being an individually smaller, the rows moved to a table (iwlinks) that is less used and has less in it. This thus not only moves but splits the data in a way that is easier to manage for backups/recovery.

However, this introduces UX downsides:

  • All wikis. The Interwiki map on Meta-Wiki applies to all wikis.
  • User-visible. This is not an internal optimization, but a user-visible change. When editing with VisualEditor/Parsoid, external links are automatically replaced by interwiki-style links, and that syntax is then shown in the interface when reviewing edits or comparing revisions in Recent Changes, History, and the Watchlist. These cannot be opened by copying from wikitext to the address bar, for example.
  • Encourages short obscure interwiki prefixes. The Interwiki prefix creativecommons was not used in favor of adding a new ccorg prefix, because it is shorter and saves more space in the database. See also: T343131#9474709
  • URL-encoding. Interwikis are meant for linking to MediaWiki titles, not arbitrary URLs. This means plus (+), underscore (_), and spaces get normalised and changed in ways that — without warning — breaks URLs. For example: T396835: Interwiki links with double underscore get rendered as single underscore.
  • Doesn't work for query strings. URLs with a query string cannot be converted to interwiki links. For example: T343131#9062161, we considered doing this for wikidata.org, but it didn't work.
  • Inconsistent. Given that these are not thought of as titles, but as URLs, there is an odd asymmetry where you enter is an external link in VisualEditor, but, for the next editor this is not reversed, so the next person cannot e.g. edit the link in VisualEditor and copy the input value to the address bar (if you don't care what's there, you can at least replace it easily since VE will automatically switch to external link format). In wikitext the inverse problem exists where to update an existing link you either have to mentally apply the mapping yourself to a partial URL, or change the syntax to an external link. In the latter case, the optimisation is implicitly undone and still stored in the externallinks table.
  • The domain is still in the externallinks table for links that are outside the interwiki prefix, or that were otherwise edited with externallink-syntax instead of iw-syntax. This sets an expectation that, because you can search for it and get results, you get a complete dataset for that domain. One would have to know about the interwiki map, and the editor syntax choice, to know how to interpet it, which seems confusing.

There is also no search UI for interwiki links. But, that is okay since we've established that we don't need search for these domains. As such, if we keep them as external links but don't index them, that is even more efficient, and avoids the above problems.

Event Timeline

Noting that we use the externallinks table in Wikilink-Tool to track links to Wikipedia Library partners, who are often interested in how many citations there are to their content on Wikipedia. We would presumably need to make changes to how that data is calculated (we're currently querying the replicas) if this work went forward.

Concrete strawman:
$wgExternalLinksExcludedDomains: List of domains excluded from the LinkSearch index. Excluding example.com will exclude all protocols, subdomains, and paths under example.com. Excluding sub.example.com will exclude all protocols, paths, on that subdomain (and any sub-subdomains). We can re-use UrlUtils::matchesDomainList which powers $wgNoFollowDomainExceptions today.

Your strawman doesn't specify what domains we'd put in the list in production, which makes it difficult to evaluate. Do you just mean domains used via interwikis plus a handful of manual additions like creativecommons.org, or something else?

Concrete strawman: […]

Your strawman doesn't specify what domains we'd put in the list in production, which makes it difficult to evaluate.

There is no list of domains we're turning into interwikis, either. This is a technical intervention used as last resort. I'm suggesting we agree to stop using interwiki conversion as the intervention, and adopt this mechanism instead. The proposal includes:

We can then decide, together with the community, for which sites we don't need LinkSearch.

Whether or not an individual domain should be included in the LinkSearch index (whether by forced interwiki conversion or by this mechanism) is orthogonal. As of writing, the only domain would be creativecommons.org.

Other domains or prefixes that have been suggested in these various tasks:

  • www.wikidata.org, on commonswiki (Aug 2023, T343131#9061982) — could not be converted to interwikis due to needing query strings.
  • wikimedia.org/api/rest_v1/, on ruwikinews (Sep 2025, T403397) — content is being changed to replace some with /w/api.php (which are already excluded) or removed without replacement.
  • tools.wmflabs.org, on ruwikinews (Sep 2025, T403397)
  • pageviews.toolforge.org, on ruwikinews (Sep 2025, T403397)
Krinkle renamed this task from Exclude trusted domains from externallinks database (LinkSearch) to Implement mechanism to exclude a domain from externallinks database (LinkSearch).Sep 18 2025, 5:44 PM
Krinkle updated the task description. (Show Details)

We can then decide, together with the community, for which sites we don't need LinkSearch.

Whether or not an individual domain should be included in the LinkSearch index (whether by forced interwiki conversion or by this mechanism) is orthogonal. As of writing, the only domain would be creativecommons.org.

Other domains or prefixes that have been suggested in these various tasks:

  • www.wikidata.org, on commonswiki (Aug 2023, T343131#9061982) — could not be converted to interwikis due to needing query strings.
  • wikimedia.org/api/rest_v1/, on ruwikinews (Sep 2025, T403397) — content is being changed to replace some with /w/api.php (which are already excluded) or removed without replacement.
  • tools.wmflabs.org, on ruwikinews (Sep 2025, T403397)
  • pageviews.toolforge.org, on ruwikinews (Sep 2025, T403397)

Ack, so in practice this likely won't affect the use for the Wikipedia Library per @Samwalton9-WMF's concern above. Sounds like a reasonable plan.

This database table stores URLs that represent the outgoing links from articles. This exists primarily for preventing and finding undesirable links.

This may have been the original use case, but I challenge the implied assertion that it’s the only use case today. I regularly use Special:LinkSearch with “trusted” domains.

So far, in these tasks, we've converted them to interwiki links. Given that there is no UI for outgoing interwiki links, this is effectively the same as not storing the data at all.

I disagree that it’s like not storing the data at all – there’s an API, and of course Quarry. (And also a task for adding a UI: T68293)


If we go ahead with this task, IMHO there’s no reason why it should be configured by domain. We could just as well exclude specific URLs, such as the CC license URLs (before those were converted to iwlinks), while keeping other, less frequent (and therefore more usably searchable) URLs on the same domain available for LinkSearch. (In this case, we can still add a warning like “results for this search will be incomplete” if the user searches for a domain that would match at least one excludelisted URL.)

[…] We could just as well exclude specific URLs, such as the CC license URLs (before those were converted to iwlinks), while keeping other, less frequent (and therefore more usably searchable) URLs on the same domain available for LinkSearch.

Makes sense. I've added the following to the task description:

Option 2: $wgExternalLinkExclusionList: List of domains or URL prefixes to exclude from the LinkSearch index. This supports the same format as LinkSearch itself. So example.com would exclude HTTP+HTTPS, any subdomains, and all paths under example.com, whereas example.org/foo would exclude only links that start with /foo on that specific domain (both HTTP+HTTPS), and https://example.org/foo would limit it to HTTPS only. We can re-use the LinkFilter class, which is also how we format database rows and search queries already for LinkSearch today.

I totally support the idea. I think we probably should go with combination of the two options. For example:

  • All links to sister projects or other domains in our infra could be considered internal and not be recorded at all. For example, currently any URL to Wikimedia Commons is being considered an external link in most wikis. For example:
mysql:research@dbstore1008.eqiad.wmnet [enwiki]> select count(*) from externallinks where el_to_domain_index like 'https://org.wikimedia.%';
+----------+
| count(*) |
+----------+
|  1794229 |
+----------+
1 row in set (1 min 8.091 sec)

This is basically 1% of externallinks in enwiki. (and interwiki wouldn't work all the time).

And the other way around:

mysql:research@dbstore1007.eqiad.wmnet [commonswiki]> select count(*) from externallinks where el_to_domain_index like 'https://org.wikipedia.%';
+----------+
| count(*) |
+----------+
|  2500046 |
+----------+
1 row in set (59.545 sec)
  • We could move CC and co to interwiki instead. Since they are external and not under our control.

This database table stores URLs that represent the outgoing links from articles. This exists primarily for preventing and finding undesirable links.

This may have been the original use case, but I challenge the implied assertion that it’s the only use case today. I regularly use Special:LinkSearch with “trusted” domains.

I wonder how much of it can be also handled by normal search. Several times I actually used search instead and it worked fine.

Open question: Where toolforge/WMCS should belong? interwiki or not recorded at all. I'd go with the latter. It's sorta internal to us.

I want to move forward with implementing a "excluded domains" for externallinks which would be completely ignored the same way they ignore link to self. And then add our domains as default to that config plus adding toolforge in certain wikis such as ruwikinews. Any objections to that?

Yes, same as before. This is useful data that I don’t think should be thrown away. (Also, my suggestion to exclude URLs or URL prefixes, as opposed to domains – which, to be clear, I also object to – seems to have gotten lost, as you’re again only talking about domains.)