Page MenuHomePhabricator

Implement mechanism to exclude a domain from externallinks database (LinkSearch)
Closed, ResolvedPublic

Description

Problem

The externallinks table is very large on certain wikis. Its size and continued growth make it the subject of urgent optimisation work led by @Ladsgroup in the DBA team.

Read more at: T300222: Implement normalizing MediaWiki link tables, T312666: Remove duplication in externallinks table, T343131: Commons database is growing way too fast, T398709: FY2025-26 WE 6.4.1: Move links tables of commons to a dedicated cluster, and most recently T403397: Externallinks in Russian Wikinews is unusually large.

Use case

This database table stores URLs that represent the outgoing links from articles. This exists primarily for preventing and finding undesirable links.

The SpamBlacklist extension uses it to automatically reject edits that add links to undesirable websites.

The interface at https://en.wikipedia.org/wiki/Special:LinkSearch lets editors search for existing links (i.e. to clean up after adding a new spam filter, or to research and assess impact before creating a filter, or to do manual clean up periodically for sites in the grey area).

It also enables features such as the Conflict of Interest reports on Meta-Wiki, through finding cross-wiki link additions by the same actor that warrant a closer look.

History:

Proposal: Don't index certain sites

The tasks linked above improved to the database table to be more efficient, however this wasn't enough, and in the same tasks we've also started to classify various domains as storing "unneeded links". Links to websites that are 1) highly trusted by the community, and 2) extremely widely linked such that 3) realistically one cannot traverse these or use these to limit a search, as it would often be close to a no-op that includes every page on the wiki.

On Wikimedia Commons, for example, virtually all file description pages have two links to https://creativecommons.org/ for the license, and it is always the same URL, and it gets there via a well-known license template that is centrally maintained and transcluded into each file page.

So far, in these tasks, we've converted them to interwiki links. Given that there is no UI for outgoing interwiki links, this is effectively the same as not storing the data at all.

I propose we instead introduce a configuration variable that controls a list of domains for which we don't index external links. We can then decide, together with the community, for which sites we don't need LinkSearch.

This builds on the existing $wgRegisterInternalExternals feature which already excludes external links to current domain (e.g. links to en.wikipedia.org from within en.wikipedia.org).

Note that we still have Global Search, insource, dumps, etc for finding these URLs in other ways. They just won't be in LinkSearch. And — they already aren't there today, because we converted them to interwiki links.

Strawman:

  • Add a configuration variable with a list of things to exclude from the LinkSearch index.
  • This list can be controlled on a per-wiki basis.
  • Apply the filter in ExternalLinksTable (not ParserOutput::addExternalLink), so that anything based on ParserOutput (ParserCache, EditStash, AbuseFilter, etc) is unaffected and continues to see these URLs during edits as part of positive rules (i.e. require a certain thing, or exempt filters if you add a certain thing).
  • On Special:LinkSearch and ApiQueryExtLinksUsage: Add a friendly warning if your search matches the exclusion list, informing you why there are no results.

Concrete options:

  • Option 1: $wgExternalLinksIgnoreDomains: List of domains excluded from the LinkSearch index. Excluding example.com will exclude all protocols, subdomains, and paths under example.com. Excluding sub.example.com will exclude all protocols, paths, on that subdomain (and any sub-subdomains). We can re-use UrlUtils::matchesDomainList which powers $wgNoFollowDomainExceptions today.
  • Option 2: $wgExternalLinkExclusionList: List of domains or URL prefixes to exclude from the LinkSearch index. This supports the same format as LinkSearch itself. So example.com would exclude HTTP+HTTPS, any subdomains, and all paths under example.com, whereas example.org/foo would exclude only links that start with /foo on that specific domain (both HTTP+HTTPS), and https://example.org/foo would limit it to HTTPS only. We can re-use the LinkFilter class, which is also how we format database rows and search queries already for LinkSearch today.

Alternative: Interwiki links

The strategy we've used so far to convert them to interwiki links.

At T343131, @Ladsgroup wrote in July 2023,:
  • Use interwiki links/pagelinks instead of raw https links.
In T343131#9626539, @LucasWerkmeister wrote in January 2024:

[mediawiki/extensions/WikimediaMessages] Use interwiki to link to Creative Commons
https://gerrit.wikimedia.org/r/991921

Once this rolls out with the train next week, the number of external links to https://creativecommons.org (currently ~146.8 million) should start to go down gradually, as pages are re-parsed for various reasons and use the new version of the message with an interwiki link instead of an external link. […]

At T403397, @Ladsgroup wrote in Sep 2025:

[…] And https://org.wmflabs.tools. and https://org.toolforge.pageviews. and https://org.creativecommons. should switch to interwiki links

This means instead of storing a wide externallinks row, with repeated domain index values for millions of rows:

el_idel_fromel_to_domain_indexel_to_path
423309499063058https://org.creativecommons./licenses/by-sa/3.0
423309509063058https://org.creativecommons./licenses/by-sa/3.0/deed.en

... we store a notably smaller iwlinks row:

iwl_fromiwl_prefixiwl_title
163091007ccorglicenses/by-sa/3.0
163091007ccorglicenses/by-sa/3.0/deed.en

And besides being an individually smaller, the rows moved to a table (iwlinks) that is less used and has less in it. This thus not only moves but splits the data in a way that is easier to manage for backups/recovery.

However, this introduces UX downsides:

  • All wikis. The Interwiki map on Meta-Wiki applies to all wikis.
  • User-visible. This is not an internal optimization, but a user-visible change. When editing with VisualEditor/Parsoid, external links are automatically replaced by interwiki-style links, and that syntax is then shown in the interface when reviewing edits or comparing revisions in Recent Changes, History, and the Watchlist. These cannot be opened by copying from wikitext to the address bar, for example.
  • Encourages short obscure interwiki prefixes. The Interwiki prefix creativecommons was not used in favor of adding a new ccorg prefix, because it is shorter and saves more space in the database. See also: T343131#9474709
  • URL-encoding. Interwikis are meant for linking to MediaWiki titles, not arbitrary URLs. This means plus (+), underscore (_), and spaces get normalised and changed in ways that — without warning — breaks URLs. For example: T396835: Interwiki links with double underscore get rendered as single underscore.
  • Doesn't work for query strings. URLs with a query string cannot be converted to interwiki links. For example: T343131#9062161, we considered doing this for wikidata.org, but it didn't work.
  • Inconsistent. Given that these are not thought of as titles, but as URLs, there is an odd asymmetry where you enter is an external link in VisualEditor, but, for the next editor this is not reversed, so the next person cannot e.g. edit the link in VisualEditor and copy the input value to the address bar (if you don't care what's there, you can at least replace it easily since VE will automatically switch to external link format). In wikitext the inverse problem exists where to update an existing link you either have to mentally apply the mapping yourself to a partial URL, or change the syntax to an external link. In the latter case, the optimisation is implicitly undone and still stored in the externallinks table.
  • The domain is still in the externallinks table for links that are outside the interwiki prefix, or that were otherwise edited with externallink-syntax instead of iw-syntax. This sets an expectation that, because you can search for it and get results, you get a complete dataset for that domain. One would have to know about the interwiki map, and the editor syntax choice, to know how to interpet it, which seems confusing.

There is also no search UI for interwiki links. But, that is okay since we've established that we don't need search for these domains. As such, if we keep them as external links but don't index them, that is even more efficient, and avoids the above problems.

Event Timeline

Noting that we use the externallinks table in Wikilink-Tool to track links to Wikipedia Library partners, who are often interested in how many citations there are to their content on Wikipedia. We would presumably need to make changes to how that data is calculated (we're currently querying the replicas) if this work went forward.

Concrete strawman:
$wgExternalLinksExcludedDomains: List of domains excluded from the LinkSearch index. Excluding example.com will exclude all protocols, subdomains, and paths under example.com. Excluding sub.example.com will exclude all protocols, paths, on that subdomain (and any sub-subdomains). We can re-use UrlUtils::matchesDomainList which powers $wgNoFollowDomainExceptions today.

Your strawman doesn't specify what domains we'd put in the list in production, which makes it difficult to evaluate. Do you just mean domains used via interwikis plus a handful of manual additions like creativecommons.org, or something else?

Concrete strawman: […]

Your strawman doesn't specify what domains we'd put in the list in production, which makes it difficult to evaluate.

There is no list of domains we're turning into interwikis, either. This is a technical intervention used as last resort. I'm suggesting we agree to stop using interwiki conversion as the intervention, and adopt this mechanism instead. The proposal includes:

We can then decide, together with the community, for which sites we don't need LinkSearch.

Whether or not an individual domain should be included in the LinkSearch index (whether by forced interwiki conversion or by this mechanism) is orthogonal. As of writing, the only domain would be creativecommons.org.

Other domains or prefixes that have been suggested in these various tasks:

  • www.wikidata.org, on commonswiki (Aug 2023, T343131#9061982) — could not be converted to interwikis due to needing query strings.
  • wikimedia.org/api/rest_v1/, on ruwikinews (Sep 2025, T403397) — content is being changed to replace some with /w/api.php (which are already excluded) or removed without replacement.
  • tools.wmflabs.org, on ruwikinews (Sep 2025, T403397)
  • pageviews.toolforge.org, on ruwikinews (Sep 2025, T403397)
Krinkle renamed this task from Exclude trusted domains from externallinks database (LinkSearch) to Implement mechanism to exclude a domain from externallinks database (LinkSearch).Sep 18 2025, 5:44 PM
Krinkle updated the task description. (Show Details)

We can then decide, together with the community, for which sites we don't need LinkSearch.

Whether or not an individual domain should be included in the LinkSearch index (whether by forced interwiki conversion or by this mechanism) is orthogonal. As of writing, the only domain would be creativecommons.org.

Other domains or prefixes that have been suggested in these various tasks:

  • www.wikidata.org, on commonswiki (Aug 2023, T343131#9061982) — could not be converted to interwikis due to needing query strings.
  • wikimedia.org/api/rest_v1/, on ruwikinews (Sep 2025, T403397) — content is being changed to replace some with /w/api.php (which are already excluded) or removed without replacement.
  • tools.wmflabs.org, on ruwikinews (Sep 2025, T403397)
  • pageviews.toolforge.org, on ruwikinews (Sep 2025, T403397)

Ack, so in practice this likely won't affect the use for the Wikipedia Library per @Samwalton9-WMF's concern above. Sounds like a reasonable plan.

This database table stores URLs that represent the outgoing links from articles. This exists primarily for preventing and finding undesirable links.

This may have been the original use case, but I challenge the implied assertion that it’s the only use case today. I regularly use Special:LinkSearch with “trusted” domains.

So far, in these tasks, we've converted them to interwiki links. Given that there is no UI for outgoing interwiki links, this is effectively the same as not storing the data at all.

I disagree that it’s like not storing the data at all – there’s an API, and of course Quarry. (And also a task for adding a UI: T68293)


If we go ahead with this task, IMHO there’s no reason why it should be configured by domain. We could just as well exclude specific URLs, such as the CC license URLs (before those were converted to iwlinks), while keeping other, less frequent (and therefore more usably searchable) URLs on the same domain available for LinkSearch. (In this case, we can still add a warning like “results for this search will be incomplete” if the user searches for a domain that would match at least one excludelisted URL.)

[…] We could just as well exclude specific URLs, such as the CC license URLs (before those were converted to iwlinks), while keeping other, less frequent (and therefore more usably searchable) URLs on the same domain available for LinkSearch.

Makes sense. I've added the following to the task description:

Option 2: $wgExternalLinkExclusionList: List of domains or URL prefixes to exclude from the LinkSearch index. This supports the same format as LinkSearch itself. So example.com would exclude HTTP+HTTPS, any subdomains, and all paths under example.com, whereas example.org/foo would exclude only links that start with /foo on that specific domain (both HTTP+HTTPS), and https://example.org/foo would limit it to HTTPS only. We can re-use the LinkFilter class, which is also how we format database rows and search queries already for LinkSearch today.

I totally support the idea. I think we probably should go with combination of the two options. For example:

  • All links to sister projects or other domains in our infra could be considered internal and not be recorded at all. For example, currently any URL to Wikimedia Commons is being considered an external link in most wikis. For example:
mysql:research@dbstore1008.eqiad.wmnet [enwiki]> select count(*) from externallinks where el_to_domain_index like 'https://org.wikimedia.%';
+----------+
| count(*) |
+----------+
|  1794229 |
+----------+
1 row in set (1 min 8.091 sec)

This is basically 1% of externallinks in enwiki. (and interwiki wouldn't work all the time).

And the other way around:

mysql:research@dbstore1007.eqiad.wmnet [commonswiki]> select count(*) from externallinks where el_to_domain_index like 'https://org.wikipedia.%';
+----------+
| count(*) |
+----------+
|  2500046 |
+----------+
1 row in set (59.545 sec)
  • We could move CC and co to interwiki instead. Since they are external and not under our control.

This database table stores URLs that represent the outgoing links from articles. This exists primarily for preventing and finding undesirable links.

This may have been the original use case, but I challenge the implied assertion that it’s the only use case today. I regularly use Special:LinkSearch with “trusted” domains.

I wonder how much of it can be also handled by normal search. Several times I actually used search instead and it worked fine.

Open question: Where toolforge/WMCS should belong? interwiki or not recorded at all. I'd go with the latter. It's sorta internal to us.

I want to move forward with implementing a "excluded domains" for externallinks which would be completely ignored the same way they ignore link to self. And then add our domains as default to that config plus adding toolforge in certain wikis such as ruwikinews. Any objections to that?

Yes, same as before. This is useful data that I don’t think should be thrown away. (Also, my suggestion to exclude URLs or URL prefixes, as opposed to domains – which, to be clear, I also object to – seems to have gotten lost, as you’re again only talking about domains.)

To which I responded that the data is still avaiable through search. e.g. https://en.wikipedia.org/w/index.php?search=insource%3A%22.toolforge.org%2F%22&title=Special%3ASearch&ns0=1 and you didn't respond to that.

Change #1214683 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] ParserOutput: Allow for ignoring a set of domains for externallinks

https://gerrit.wikimedia.org/r/1214683

Change #1214683 merged by jenkins-bot:

[mediawiki/core@master] ParserOutput: Allow for ignoring a set of domains for externallinks

https://gerrit.wikimedia.org/r/1214683

User notice suggestion:

URLs to other wikimedia projects won't be registered as external links meaning they won't be recorded on externallinks table, won't be searchable in Special:LinkSearch and won't be checked against spam blacklist or abuse filters added_links variable. This is to reduce the undue load on the databases and improve editing save time. Also a limited number of trusted websites that are heavily used might be added to each wiki individually. For example Creative Commons website will be added to the ignore list in Wikimedia Commons due to large number of links to their license pages.

Edit mercilessly

Ladsgroup moved this task from Triage to In progress on the DBA board.

Suggested wording to make this more concise:

To improve database and site performance, URLs to other Wikimedia projects will no longer be stored in the externallinks table. This means they will not be searchable in Special:LinkSearch, and won't be checked by the Spam Blacklist or AbuseFilter. In the future this may be extended to certain highly-linked trusted websites on a per-wiki basis, such as Creative Commons on Wikimedia Commons.

Suggested wording to make this more concise:

To improve database and site performance, URLs to other Wikimedia projects will no longer be stored in the externallinks table. This means they will not be searchable in Special:LinkSearch, and won't be checked by the Spam Blacklist or AbuseFilter. In the future this may be extended to certain highly-linked trusted websites on a per-wiki basis, such as Creative Commons on Wikimedia Commons.

Thanks!

Change #1215666 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] SpecialLinkSearch: Add a message when domains are being ignored

https://gerrit.wikimedia.org/r/1215666

Change #1215666 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] SpecialLinkSearch: Add a message when domains are being ignored

https://gerrit.wikimedia.org/r/1215666

The top notice might draw a bit much visual attention when we list all WMF domains in a <code> markup. In Vector 22 it'll push down the UI by 4 lines.

T405005-msg-top_code.png (1×2 px, 291 KB) T405005-msg-top_code-vector22.png (1×2 px, 360 KB)

Alt top notice, plain:

T405005-msg-top_plain.png (1×2 px, 273 KB)

Alt bottom notice, with <code>:
T405005-msg-bottom_code.png (1×2 px, 247 KB) Screenshot 2025-12-05 at 18.38.09.png (1×2 px, 370 KB)

Change #1215674 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] SpecialLinkSearch: Move ignored-domains msg to bottom

https://gerrit.wikimedia.org/r/1215674

Change #1215666 merged by jenkins-bot:

[mediawiki/core@master] SpecialLinkSearch: Add a message when domains are being ignored

https://gerrit.wikimedia.org/r/1215666

Change #1215674 merged by jenkins-bot:

[mediawiki/core@master] SpecialLinkSearch: Move ignored-domains msg to bottom

https://gerrit.wikimedia.org/r/1215674

List of domains that are used more than 1M times in each wiki:

1arwiki
2el_to_domain_index count(*)
3https://com.google.www. 6083538
4https://org.archive.web. 4066707
5https://org.wikipedia.en. 1747914
6https://com.nytimes.www. 1260370
7https://org.jstor.www. 1233171
8https://com.google.scholar. 1217232
9https://org.wmflabs.wikipedialibrary. 1213924
10https://org.wikidata.www. 1144367
11arzwiki
12el_to_domain_index count(*)
13https://org.archive.web. 1063241
14cawiki
15el_to_domain_index count(*)
16https://org.wikidata.www. 5645283
17cebwiki
18el_to_domain_index count(*)
19http://gov.nasa.gsfc.sci.neo. 7977575
20http://org.geonames.www. 7402781
21https://org.toolforge.geohack. 5732627
22http://org.geonames.sws. 3698224
23http://org.catalogueoflife.www. 2519899
24https://org.doi. 1914953
25http://net.hydrol-earth-syst-sci.www. 1914586
26http://org.viewfinderpanoramas.www. 1801367
27https://org.wikipedia.sv. 1801260
28https://org.wmflabs.tools. 1225978
29http://org.wmflabs.tools. 1043911
30cewiki
31el_to_domain_index count(*)
32https://org.archive.web. 1130023
33commonswiki
34el_to_domain_index count(*)
35https://org.toolforge.wikimap. 53156292
36https://org.toolforge.geohack. 52727502
37http://fr.gouv.legifrance.www. 37424319
38https://com.flickr.www. 34706337
39https://org.wikidata.www. 29462572
40https://org.worldcat.www. 27122003
41https://org.openstreetmap.www. 25700413
42https://com.google.maps. 24320003
43https://uk.org.geograph.www. 22788291
44https://org.creativecommons. 22711883
45https://gov.loc.id. 22612032
46https://org.oclc.isni. 21855476
47https://org.wikidata.query. 20626537
48https://org.toolforge.quickstatements. 20508723
49https://au.gov.nla.librariesaustralia. 14905866
50https://edu.getty.www. 14554651
51https://com.flickr. 13714545
52https://chat.libera.web. 13321155
53https://org.archive.web. 12057651
54https://org.rightsstatements. 9816407
55https://la.dp. 8225893
56https://org.toolforge.reasonator. 6955830
57https://org.toolforge.kmlexport. 6636604
58https://info.d-nb. 5775703
59http://fr.bnf.catalogue. 5772848
60https://org.wmflabs.petscan. 5698897
61https://org.wmcloud.wikidocumentaries-demo. 5697991
62https://org.toolforge.glamtools. 5683955
63https://org.toolforge.locator-tool. 5672621
64http://fr.bnf.gallica. 5660468
65https://fr.bnf.www. 5601025
66http://fr.courdecassation.www. 5341839
67https://gov.nasa.www. 4635788
68https://gov.govinfo.www. 4604642
69http://org.hubblesite. 4603182
70https://gov.nasa.jpl.www. 4602921
71https://gov.nasa.apod. 4602050
72https://gov.nasa.gsfc.nssdc. 4601979
73https://gov.nasa.nascom.sohowww. 4601870
74https://gov.nasa.images. 4464334
75https://org.wikimedia.commons. 4385137
76https://gov.nasa.jsc.eol. 4142816
77https://gov.archives.catalog. 3913111
78https://gov.loc.www. 3518842
79https://org.archive. 3479994
80https://fr.bnf.catalogue. 2903063
81https://org.wikimedia.m.commons. 2592973
82http://org.wikimapia. 2425973
83https://uk.org.finds. 2402130
84https://org.wikimedia.ticket. 2199743
85cywiki
86el_to_domain_index count(*)
87https://org.wikidata.www. 2020452
88dewiki
89el_to_domain_index count(*)
90https://org.toolforge.geohack. 3723829
91https://org.wikimedia.commons. 1478030
92https://org.archive.web. 1373717
93enwiki
94el_to_domain_index count(*)
95https://org.archive.web. 17232691
96https://com.google.www. 14013340
97https://org.doi. 3676068
98https://org.toolforge.geohack. 3305089
99https://org.wmflabs.tools. 3125024
100http://com.reverseinternet. 2842023
101https://org.jstor.www. 2435972
102https://com.google.books. 2395485
103https://org.wikidata.www. 2263091
104https://com.google.scholar. 2100947
105https://org.wmcloud.iabot. 1864808
106https://org.worldcat.search. 1713450
107https://org.toolforge.spamcheck. 1578852
108https://gov.nih.nlm.ncbi.pubmed. 1547249
109https://org.toolforge.blocked-links-log. 1542890
110https://org.wmflabs.wikipedialibrary. 1522327
111https://org.archive. 1268384
112https://gov.nih.nlm.ncbi.www. 1187316
113https://org.toolforge.cluebotng. 1156043
114https://org.viaf. 1058204
115enwikinews
116el_to_domain_index count(*)
117http://org.wikipedia.en. 2731974
118eswiki
119el_to_domain_index count(*)
120https://org.archive.web. 3652610
121https://org.wikimedia.commons. 1828675
122https://org.wikidata.www. 1679800
123fawiki
124el_to_domain_index count(*)
125https://com.google.www. 3481938
126https://org.wikipedia.en. 1094204
127frwiki
128el_to_domain_index count(*)
129https://org.wikidata.www. 5567244
130https://fr.insee.www. 1214383
131http://org.toolserver. 1208127
132glwiki
133el_to_domain_index count(*)
134https://org.wikidata.www. 1940830
135hywiktionary
136el_to_domain_index count(*)
137http://com.nayiri.www. 1302983
138idwiki
139el_to_domain_index count(*)
140https://org.archive.web. 2073698
141idwiktionary
142el_to_domain_index count(*)
143https://id.go.kemdikbud.kbbi. 1907445
144itwiki
145el_to_domain_index count(*)
146https://org.wikidata.www. 2776954
147https://org.archive.web. 2403804
148https://org.wikimedia.commons. 1943015
149jawiki
150el_to_domain_index count(*)
151https://org.archive.web. 1036437
152metawiki
153el_to_domain_index count(*)
154https://com.google.www. 10960394
155https://com.appspot.wikipediatools. 4349958
156http://com.reverseinternet. 4029493
157https://org.wikipedia.en. 3129768
158https://org.toolforge.iw. 3029118
159https://org.toolforge.blocked-links-log. 2907986
160https://com.domaintools.whois. 2137898
161http://org.wikipedia.en. 2088580
162https://org.wikimedia.commons. 1732263
163https://org.toolforge.spamcheck. 1452111
164https://com.exalead.www. 1449971
165https://org.aboutus.www. 1449970
166https://com.malwaredomainlist.www. 1449970
167mgwiktionary
168el_to_domain_index count(*)
169http://org.wiktionary.en. 2654850
170http://org.wiktionary.fr. 1406306
171mswiki
172el_to_domain_index count(*)
173https://com.google.www. 1351387
174nlwiki
175el_to_domain_index count(*)
176https://org.archive.web. 1091999
177nowiki
178el_to_domain_index count(*)
179https://org.wikidata.www. 1775250
180plwiki
181el_to_domain_index count(*)
182https://org.wmflabs.tools. 1346857
183https://org.archive.web. 1152956
184ptwiki
185el_to_domain_index count(*)
186https://com.google.www. 1755722
187rowiki
188el_to_domain_index count(*)
189https://org.wikidata.www. 1199542
190ruwiki
191el_to_domain_index count(*)
192https://org.archive.web. 7450251
193ruwikinews
194el_to_domain_index count(*)
195https://org.archive.web. 6867681
196https://org.wmflabs.tools. 5794306
197https://com.twitter. 1603663
198https://com.facebook.www. 1535426
199https://me.t. 1514596
200https://com.livejournal.www. 1496910
201https://ru.ok.connect. 1496910
202https://ru.vkontakte. 1496910
203svwiki
204el_to_domain_index count(*)
205http://org.catalogueoflife.www. 1857462
206https://org.archive.web. 1164553
207https://org.toolforge.geohack. 1071289
208trwiki
209el_to_domain_index count(*)
210https://org.archive.web. 2149386
211ttwiki
212el_to_domain_index count(*)
213https://org.wikidata. 1574981
214ukwiki
215el_to_domain_index count(*)
216https://org.wikidata.www. 3183979
217https://org.archive.web. 2967071
218uzwiki
219el_to_domain_index count(*)
220https://com.google.www. 1063226
221viwiki
222el_to_domain_index count(*)
223https://com.facebook.www. 16721339
224warwiki
225el_to_domain_index count(*)
226https://org.catalogueoflife. 1139841
227wikidatawiki
228el_to_domain_index count(*)
229https://uk.ac.ebi.www. 25299832
230https://org.crossref.api. 12155290
231https://gov.nih.nlm.ncbi.eutils. 9705427
232https://gov.nih.nlm.ncbi.pubmed. 5154719
233http://org.europepmc. 4966444
234https://org.orcid.pub. 2662612
235https://net.opencitations. 2319548
236https://org.wikipedia.en. 2230922
237https://org.europepmc. 2161752
238https://com.elsevier.api. 1839685
239http://com.springer.link. 1213426
240http://uk.ac.ebi.www. 1125104
241https://wiki.fatcat.api. 1006989
242zhwiki
243el_to_domain_index count(*)
244https://org.archive.web. 5858019

Good that we are excluding en.wikipedia.org and www.wikidata.org, a lot will be removed. geohack.toolforge.org seems to a sensible one to add too (in some wikis at least)

We also need to purge existing rows from externallinks table.

We also need to purge existing rows from externallinks table.

With the exception of ruwikinews and maybe commons. No not really.

Change #1218276 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Set wgExternalLinksIgnoreDomains in production

https://gerrit.wikimedia.org/r/1218276

Change #1218295 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.46.0-wmf.5] SpecialLinkSearch: Add a message when domains are being ignored

https://gerrit.wikimedia.org/r/1218295

Change #1218296 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.46.0-wmf.5] ParserOutput: Allow for ignoring a set of domains for externallinks

https://gerrit.wikimedia.org/r/1218296

Change #1218296 merged by jenkins-bot:

[mediawiki/core@wmf/1.46.0-wmf.5] ParserOutput: Allow for ignoring a set of domains for externallinks

https://gerrit.wikimedia.org/r/1218296

Mentioned in SAL (#wikimedia-operations) [2025-12-15T14:58:22Z] <ladsgroup@deploy2002> Started scap sync-world: Backport for [[gerrit:1218296|ParserOutput: Allow for ignoring a set of domains for externallinks (T405005)]]

Mentioned in SAL (#wikimedia-operations) [2025-12-15T15:01:10Z] <ladsgroup@deploy2002> ladsgroup: Backport for [[gerrit:1218296|ParserOutput: Allow for ignoring a set of domains for externallinks (T405005)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-12-15T15:06:48Z] <ladsgroup@deploy2002> Finished scap sync-world: Backport for [[gerrit:1218296|ParserOutput: Allow for ignoring a set of domains for externallinks (T405005)]] (duration: 08m 26s)

Change #1218314 had a related patch set uploaded (by Ladsgroup; author: Krinkle):

[mediawiki/core@wmf/1.46.0-wmf.5] SpecialLinkSearch: Move ignored-domains msg to bottom

https://gerrit.wikimedia.org/r/1218314

Change #1218276 merged by jenkins-bot:

[operations/mediawiki-config@master] Set wgExternalLinksIgnoreDomains in production

https://gerrit.wikimedia.org/r/1218276

Mentioned in SAL (#wikimedia-operations) [2025-12-15T15:36:23Z] <ladsgroup@deploy2002> Started scap sync-world: Backport for [[gerrit:1218276|Set wgExternalLinksIgnoreDomains in production (T405005)]]

Mentioned in SAL (#wikimedia-operations) [2025-12-15T15:38:22Z] <ladsgroup@deploy2002> ladsgroup: Backport for [[gerrit:1218276|Set wgExternalLinksIgnoreDomains in production (T405005)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-12-15T15:44:43Z] <ladsgroup@deploy2002> Finished scap sync-world: Backport for [[gerrit:1218276|Set wgExternalLinksIgnoreDomains in production (T405005)]] (duration: 08m 20s)

Let the fun begin:

DELETE /* MediaWiki\Deferred\LinksUpdate\LinksTable::doWrites  */ FROM `externallinks` WHERE ((el_from = 75796948 AND ((el_to_domain_index = 'https://org.toolforge.geohack.' AND el_to_path = '/geohack.php?pagename=File:Former_Prange_Way,_Lakeshore_Mall_-_Flickr_-_MichaelSteeber_(2).jpg&params=044.116772_N_-087.637917_E_globe:Earth_type:camera_alt:184_source:exif_heading:42.56&language=en') OR (el_to_domain_index = 'https://org.toolforge.wikimap.' AND el_to_path = '/?wp=false&cluster=false&zoom=16&lat=044.116772&lon=-087.637917'))))
...
DELETE /* MediaWiki\Deferred\LinksUpdate\LinksTable::doWrites  */ FROM `externallinks` WHERE ((el_from = 142810652 AND ((el_to_domain_index = 'https://org.wikidata.www.' AND el_to_path = '/w/index.php?search=The+WIZO+organization+holding+a+party+for+new+immigrants+haswbstatement:P195%3DQ30526831+haswbstatement:P170%3DQ123250780&title=Special:Search&profile=advanced&fulltext=1&ns0=1') OR (el_to_domain_index = 'https://org.wikidata.query.' AND el_to_path = '/#SELECT%20?item%20?itemLabel%20?image%0AWHERE%20%7B%0A%20?item%20wdt:P195%20wd:Q30526831.%0A%20?item%20wdt:P170%20wd:Q123250780.%0A%20OPTIONAL%20%7B%20?item%20wdt:P18%20?image%20%7D%20.%0A%20SERVICE%20wikibase:label%20%7B%20bd:serviceParam%20wikibase:language%20%22en%22.%20%7D%0A%7D') OR (el_to_domain_index = 'https://org.creativecommons.' AND el_to_path = '/licenses/by/4.0/deed.en') OR (el_to_domain_index = 'https://org.wikimedia.ticket.' AND el_to_path = '/otrs/index.pl?Action=AgentTicketZoom&TicketNumber=2021092910005788'))))
..
DELETE /* MediaWiki\Deferred\LinksUpdate\LinksTable::doWrites  */ FROM `externallinks` WHERE ((el_from = 142810647 AND ((el_to_domain_index = 'https://org.wikidata.www.' AND el_to_path = '/w/index.php?search=The+WIZO+organization+holding+a+party+for+new+immigrants+haswbstatement:P195%3DQ30526831+haswbstatement:P170%3DQ123250780&title=Special:Search&profile=advanced&fulltext=1&ns0=1') OR (el_to_domain_index = 'https://org.wikidata.query.' AND el_to_path = '/#SELECT%20?item%20?itemLabel%20?image%0AWHERE%20%7B%0A%20?item%20wdt:P195%20wd:Q30526831.%0A%20?item%20wdt:P170%20wd:Q123250780.%0A%20OPTIONAL%20%7B%20?item%20wdt:P18%20?image%20%7D%20.%0A%20SERVICE%20wikibase:label%20%7B%20bd:serviceParam%20wikibase:language%20%22en%22.%20%7D%0A%7D') OR (el_to_domain_index = 'https://org.creativecommons.' AND el_to_path = '/licenses/by/4.0/deed.en') OR (el_to_domain_index = 'https://org.wikimedia.ticket.' AND el_to_path = '/otrs/index.pl?Action=AgentTicketZoom&TicketNumber=2021092910005788'))))
...
DELETE /* MediaWiki\Deferred\LinksUpdate\LinksTable::doWrites  */ FROM `externallinks` WHERE ((el_from = 142810648 AND ((el_to_domain_index = 'https://org.wikidata.www.' AND el_to_path = '/w/index.php?search=The+WIZO+organization+holding+a+party+for+new+immigrants+haswbstatement:P195%3DQ30526831+haswbstatement:P170%3DQ123250780&title=Special:Search&profile=advanced&fulltext=1&ns0=1') OR (el_to_domain_index = 'https://org.wikidata.query.' AND el_to_path = '/#SELECT%20?item%20?itemLabel%20?image%0AWHERE%20%7B%0A%20?item%20wdt:P195%20wd:Q30526831.%0A%20?item%20wdt:P170%20wd:Q123250780.%0A%20OPTIONAL%20%7B%20?item%20wdt:P18%20?image%20%7D%20.%0A%20SERVICE%20wikibase:label%20%7B%20bd:serviceParam%20wikibase:language%20%22en%22.%20%7D%0A%7D') OR (el_to_domain_index = 'https://org.creativecommons.' AND el_to_path = '/licenses/by/4.0/deed.en') OR (el_to_domain_index = 'https://org.wikimedia.ticket.' AND el_to_path = '/otrs/index.pl?Action=AgentTicketZoom&TicketNumber=2021092910005788'))))
...
DELETE /* MediaWiki\Deferred\LinksUpdate\LinksTable::doWrites  */ FROM `externallinks` WHERE ((el_from = 142810608 AND ((el_to_domain_index = 'https://org.wikidata.www.' AND el_to_path = '/w/index.php?search=The+WIZO+organization+holding+a+party+for+new+immigrants+haswbstatement:P195%3DQ30526831+haswbstatement:P170%3DQ123250780&title=Special:Search&profile=advanced&fulltext=1&ns0=1') OR (el_to_domain_index = 'https://org.wikidata.query.' AND el_to_path = '/#SELECT%20?item%20?itemLabel%20?image%0AWHERE%20%7B%0A%20?item%20wdt:P195%20wd:Q30526831.%0A%20?item%20wdt:P170%20wd:Q123250780.%0A%20OPTIONAL%20%7B%20?item%20wdt:P18%20?image%20%7D%20.%0A%20SERVICE%20wikibase:label%20%7B%20bd:serviceParam%20wikibase:language%20%22en%22.%20%7D%0A%7D') OR (el_to_domain_index = 'https://org.creativecommons.' AND el_to_path = '/licenses/by/4.0/deed.en') OR (el_to_domain_index = 'https://org.wikimedia.ticket.' AND el_to_path = '/otrs/index.pl?Action=AgentTicketZoom&TicketNumber=2021092910005788'))))
...
DELETE /* MediaWiki\Deferred\LinksUpdate\LinksTable::doWrites  */ FROM `externallinks` WHERE ((el_from = 142810634 AND ((el_to_domain_index = 'https://org.wikidata.www.' AND el_to_path = '/w/index.php?search=The+WIZO+organization+holding+a+party+for+new+immigrants+haswbstatement:P195%3DQ30526831+haswbstatement:P170%3DQ123250780&title=Special:Search&profile=advanced&fulltext=1&ns0=1') OR (el_to_domain_index = 'https://org.wikidata.query.' AND el_to_path = '/#SELECT%20?item%20?itemLabel%20?image%0AWHERE%20%7B%0A%20?item%20wdt:P195%20wd:Q30526831.%0A%20?item%20wdt:P170%20wd:Q123250780.%0A%20OPTIONAL%20%7B%20?item%20wdt:P18%20?image%20%7D%20.%0A%20SERVICE%20wikibase:label%20%7B%20bd:serviceParam%20wikibase:language%20%22en%22.%20%7D%0A%7D') OR (el_to_domain_index = 'https://org.creativecommons.' AND el_to_path = '/licenses/by/4.0/deed.en') OR (el_to_domain_index = 'https://org.wikimedia.ticket.' AND el_to_path = '/otrs/index.pl?Action=AgentTicketZoom&TicketNumber=2021092910005788'))))
...
DELETE /* MediaWiki\Deferred\LinksUpdate\LinksTable::doWrites  */ FROM `externallinks` WHERE ((el_from = 142810662 AND ((el_to_domain_index = 'https://org.wikidata.www.' AND el_to_path = '/w/index.php?search=The+WIZO+organization+holding+a+party+for+new+immigrants+haswbstatement:P195%3DQ30526831+haswbstatement:P170%3DQ123250780&title=Special:Search&profile=advanced&fulltext=1&ns0=1') OR (el_to_domain_index = 'https://org.wikidata.query.' AND el_to_path = '/#SELECT%20?item%20?itemLabel%20?image%0AWHERE%20%7B%0A%20?item%20wdt:P195%20wd:Q30526831.%0A%20?item%20wdt:P170%20wd:Q123250780.%0A%20OPTIONAL%20%7B%20?item%20wdt:P18%20?image%20%7D%20.%0A%20SERVICE%20wikibase:label%20%7B%20bd:serviceParam%20wikibase:language%20%22en%22.%20%7D%0A%7D') OR (el_to_domain_index = 'https://org.creativecommons.' AND el_to_path = '/licenses/by/4.0/deed.en') OR (el_to_domain_index = 'https://org.wikimedia.ticket.' AND el_to_path = '/otrs/index.pl?Action=AgentTicketZoom&TicketNumber=2021092910005788'))))
...
DELETE /* MediaWiki\Deferred\LinksUpdate\LinksTable::doWrites  */ FROM `externallinks` WHERE ((el_from = 142810658 AND ((el_to_domain_index = 'https://org.wikidata.www.' AND el_to_path = '/w/index.php?search=The+WIZO+organization+holding+a+party+for+new+immigrants+haswbstatement:P195%3DQ30526831+haswbstatement:P170%3DQ123250780&title=Special:Search&profile=advanced&fulltext=1&ns0=1') OR (el_to_domain_index = 'https://org.wikidata.query.' AND el_to_path = '/#SELECT%20?item%20?itemLabel%20?image%0AWHERE%20%7B%0A%20?item%20wdt:P195%20wd:Q30526831.%0A%20?item%20wdt:P170%20wd:Q123250780.%0A%20OPTIONAL%20%7B%20?item%20wdt:P18%20?image%20%7D%20.%0A%20SERVICE%20wikibase:label%20%7B%20bd:serviceParam%20wikibase:language%20%22en%22.%20%7D%0A%7D') OR (el_to_domain_index = 'https://org.creativecommons.' AND el_to_path = '/licenses/by/4.0/deed.en') OR (el_to_domain_index = 'https://org.wikimedia.ticket.' AND el_to_path = '/otrs/index.pl?Action=AgentTicketZoom&TicketNumber=2021092910005788'))))

I need to be afk, I'll backport the messages once I'm back.

With the current config, we will remove ~170M rows from externallinks of commons.

Change #1218295 merged by jenkins-bot:

[mediawiki/core@wmf/1.46.0-wmf.5] SpecialLinkSearch: Add a message when domains are being ignored

https://gerrit.wikimedia.org/r/1218295

Change #1218314 abandoned by Ladsgroup:

[mediawiki/core@wmf/1.46.0-wmf.5] SpecialLinkSearch: Move ignored-domains msg to bottom

Reason:

I need to do this again.

https://gerrit.wikimedia.org/r/1218314

Mentioned in SAL (#wikimedia-operations) [2025-12-16T02:11:20Z] <ladsgroup@deploy2002> Started scap sync-world: Backport for [[gerrit:1218295|SpecialLinkSearch: Add a message when domains are being ignored (T405005)]]

Mentioned in SAL (#wikimedia-operations) [2025-12-16T02:36:39Z] <ladsgroup@deploy2002> ladsgroup: Backport for [[gerrit:1218295|SpecialLinkSearch: Add a message when domains are being ignored (T405005)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-12-16T02:50:07Z] <ladsgroup@deploy2002> Finished scap sync-world: Backport for [[gerrit:1218295|SpecialLinkSearch: Add a message when domains are being ignored (T405005)]] (duration: 38m 47s)

Ladsgroup moved this task from In progress to Done on the DBA board.

I let the moving the message to the bottom roll with the train (this week). It is working and it is removing a lot of rows.

@Ladsgroup where a list of an excluded domains a) is listed, and b) could be discussed? For example, any Russian Wikipedia article about geographical object has links to this place on Google, Yandex, OSM maps and Geohack (Toolforge service), so we have 670k links to all 4 services, these are most extlinked services in ruwiki (stats: https://ru.wikipedia.org/wiki/Project:Внешние_ссылки/Статистика ). I think map services could be excluded too.

@Ladsgroup where a list of an excluded domains a) is listed, and b) could be discussed? For example, any Russian Wikipedia article about geographical object has links to this place on Google, Yandex, OSM maps and Geohack (Toolforge service), so we have 670k links to all 4 services, these are most extlinked services in ruwiki (stats: https://ru.wikipedia.org/wiki/Project:Внешние_ссылки/Статистика ). I think map services could be excluded too.

It's the value of wgExternalLinksIgnoreDomains. So for example: https://noc.wikimedia.org/wiki.php?wiki=ruwiki#wgExternalLinksIgnoreDomains

You can sure discuss it and if the community agrees, you can add it. No objection from my side.