Problem
The externallinks table is very large on certain wikis. Its size and continued growth make it the subject of urgent optimisation work led by @Ladsgroup in the DBA team.
Read more at: T300222: Implement normalizing MediaWiki link tables, T312666: Remove duplication in externallinks table, T343131: Commons database is growing way too fast, T398709: FY2025-26 WE 6.4.1: Move links tables of commons to a dedicated cluster, and most recently T403397: Externallinks in Russian Wikinews is unusually large.
Use case
This database table stores URLs that represent the outgoing links from articles. This exists primarily for preventing and finding undesirable links.
The SpamBlacklist extension uses it to automatically reject edits that add links to undesirable websites.
The interface at https://en.wikipedia.org/wiki/Special:LinkSearch lets editors search for existing links (i.e. to clean up after adding a new spam filter, or to research and assess impact before creating a filter, or to do manual clean up periodically for sites in the grey area).
It also enables features such as the Conflict of Interest reports on Meta-Wiki, through finding cross-wiki link additions by the same actor that warrant a closer look.
History:
- The database table was introduced by Tim Starling in January 2006. The SpamBlacklist extension was also created by Tim around the the same time. From the MediaWiki 1.6 release notes: "A new "externallinks" table tracks URL links; this can be used by a mass spam-cleanup tool in the SpamBlacklist extension."
- A few months later in June 2006, Brooke implemented the LinkSearch special page to enable proactive searching. This started in the LinkSearch extension and merged into MediaWiki core in 2008.
Proposal: Don't index certain sites
The tasks linked above improved to the database table to be more efficient, however this wasn't enough, and in the same tasks we've also started to classify various domains as storing "unneeded links". Links to websites that are 1) highly trusted by the community, and 2) extremely widely linked such that 3) realistically one cannot traverse these or use these to limit a search, as it would often be close to a no-op that includes every page on the wiki.
On Wikimedia Commons, for example, virtually all file description pages have two links to https://creativecommons.org/ for the license, and it is always the same URL, and it gets there via a well-known license template that is centrally maintained and transcluded into each file page.
So far, in these tasks, we've converted them to interwiki links. Given that there is no UI for outgoing interwiki links, this is effectively the same as not storing the data at all.
I propose we instead introduce a configuration variable that controls a list of domains for which we don't index external links. We can then decide, together with the community, for which sites we don't need LinkSearch.
This builds on the existing $wgRegisterInternalExternals feature which already excludes external links to current domain (e.g. links to en.wikipedia.org from within en.wikipedia.org).
Note that we still have Global Search, insource, dumps, etc for finding these URLs in other ways. They just won't be in LinkSearch. And — they already aren't there today, because we converted them to interwiki links.
Strawman:
- Add a configuration variable with a list of things to exclude from the LinkSearch index.
- This list can be controlled on a per-wiki basis.
- Apply the filter in ExternalLinksTable (not ParserOutput::addExternalLink), so that anything based on ParserOutput (ParserCache, EditStash, AbuseFilter, etc) is unaffected and continues to see these URLs during edits as part of positive rules (i.e. require a certain thing, or exempt filters if you add a certain thing).
- On Special:LinkSearch and ApiQueryExtLinksUsage: Add a friendly warning if your search matches the exclusion list, informing you why there are no results.
Concrete options:
- Option 1: $wgExternalLinksExcludedDomains: List of domains excluded from the LinkSearch index. Excluding example.com will exclude all protocols, subdomains, and paths under example.com. Excluding sub.example.com will exclude all protocols, paths, on that subdomain (and any sub-subdomains). We can re-use UrlUtils::matchesDomainList which powers $wgNoFollowDomainExceptions today.
- Option 2: $wgExternalLinkExclusionList: List of domains or URL prefixes to exclude from the LinkSearch index. This supports the same format as LinkSearch itself. So example.com would exclude HTTP+HTTPS, any subdomains, and all paths under example.com, whereas example.org/foo would exclude only links that start with /foo on that specific domain (both HTTP+HTTPS), and https://example.org/foo would limit it to HTTPS only. We can re-use the LinkFilter class, which is also how we format database rows and search queries already for LinkSearch today.
Alternative: Interwiki links
The strategy we've used so far to convert them to interwiki links.
This means instead of storing a wide externallinks row, with repeated domain index values for millions of rows:
| el_id | el_from | el_to_domain_index | el_to_path |
|---|---|---|---|
| 42330949 | 9063058 | https://org.creativecommons. | /licenses/by-sa/3.0 |
| 42330950 | 9063058 | https://org.creativecommons. | /licenses/by-sa/3.0/deed.en |
... we store a notably smaller iwlinks row:
| iwl_from | iwl_prefix | iwl_title |
|---|---|---|
| 163091007 | ccorg | licenses/by-sa/3.0 |
| 163091007 | ccorg | licenses/by-sa/3.0/deed.en |
And besides being an individually smaller, the rows moved to a table (iwlinks) that is less used and has less in it. This thus not only moves but splits the data in a way that is easier to manage for backups/recovery.
However, this introduces UX downsides:
- All wikis. The Interwiki map on Meta-Wiki applies to all wikis.
- User-visible. This is not an internal optimization, but a user-visible change. When editing with VisualEditor/Parsoid, external links are automatically replaced by interwiki-style links, and that syntax is then shown in the interface when reviewing edits or comparing revisions in Recent Changes, History, and the Watchlist. These cannot be opened by copying from wikitext to the address bar, for example.
- Encourages short obscure interwiki prefixes. The Interwiki prefix creativecommons was not used in favor of adding a new ccorg prefix, because it is shorter and saves more space in the database. See also: T343131#9474709
- URL-encoding. Interwikis are meant for linking to MediaWiki titles, not arbitrary URLs. This means plus (+), underscore (_), and spaces get normalised and changed in ways that — without warning — breaks URLs. For example: T396835: Interwiki links with double underscore get rendered as single underscore.
- Doesn't work for query strings. URLs with a query string cannot be converted to interwiki links. For example: T343131#9062161, we considered doing this for wikidata.org, but it didn't work.
- Inconsistent. Given that these are not thought of as titles, but as URLs, there is an odd asymmetry where you enter is an external link in VisualEditor, but, for the next editor this is not reversed, so the next person cannot e.g. edit the link in VisualEditor and copy the input value to the address bar (if you don't care what's there, you can at least replace it easily since VE will automatically switch to external link format). In wikitext the inverse problem exists where to update an existing link you either have to mentally apply the mapping yourself to a partial URL, or change the syntax to an external link. In the latter case, the optimisation is implicitly undone and still stored in the externallinks table.
- The domain is still in the externallinks table for links that are outside the interwiki prefix, or that were otherwise edited with externallink-syntax instead of iw-syntax. This sets an expectation that, because you can search for it and get results, you get a complete dataset for that domain. One would have to know about the interwiki map, and the editor syntax choice, to know how to interpet it, which seems confusing.
There is also no search UI for interwiki links. But, that is okay since we've established that we don't need search for these domains. As such, if we keep them as external links but don't index them, that is even more efficient, and avoids the above problems.