Page MenuHomePhabricator

Remove duplication in externallinks table
Closed, ResolvedPublic

Description

Background

MediaWiki's link database tables are among the largest tables of any WMF production database. Optimizing the way we store and retreive this would generally see benefits such as:

  • lower diskspace per database: slower growth, less hardware in the long run (co-locate more wikis together), and ensures each individual replica host can continue to fit and perform well with a full core database of a given wiki.
  • faster backup and recovery,
  • lower replication lag (smaller queries, smaller binlogs).

This is a continuation in spirit of T222224: RFC: Normalize MediaWiki link tables and T300222: Implement normalizing MediaWiki link tables, where a similar optimization and major win was made to the templatelinks table (with upcoming work for pagelinks, imagelinks and categorylinks).

externalinks

This table has grown a lot in production. It's one of the biggest tables for Commons, at 200GB, and will cause more issues in the future.

Unlike other links table, this table can't be normalized using the linktarget feature (T299417). We currently store each external link upto 8 times as part of a single row (and its indexes).

For example, this is a typical row today:

*************************** 4. row ***************************
      el_id: 9
    el_from: 379228
      el_to: http://www.iau.org/public_press/themes/naming/#minorplanets
   el_index: http://org.iau.www./public_press/themes/naming/#minorplanets
el_index_60: http://org.iau.www./public_press/themes/naming/#minorplanets

One external link is three ways in three columns of a given row, and those columns in turn are used in five different indexes.

Option A: Create new table

Create a new table holding the domai names (call it, externaldomain?) and it would hold the domain and its inverse.

Something like:

ed_id: 1234
ed_domain: http://www.iau.org
ed_index: http://org.iau.www.

The row externallinks would be reduced to

       el_id: 9
     el_from: 379228
el_to_domain: 1234 # Key to externaldomain.ed_id
  el_to_path: public_press/themes/naming/#minorplanets

Option B:

Avoid creating a new table by storing the domain inline, using only the index variant. We go from 8 to 3 copies (1 in the columns, plus two large indexes).

             el_id: 9
           el_from: 379228
el_to_domain_index: http://org.iau.www.
        el_to_path: /public_press/themes/naming/#minorplanets

Change the indexes:

- INDEX el_from           (el_from,            el_to(40)),
- INDEX el_from_index_60  (el_from,            el_index_60,         el_id),
+ INDEX el_page_domain_id (el_from,            el_to_domain_index,  el_id),

- INDEX el_to          (el_to(60),          el_from),
+ INDEX el_domain_page (el_to_domain_index, el_from),

- INDEX el_index     (el_index(60)),
- INDEX el_index_60  (el_index_60,        el_id),

Option B-2 (chosen)

Per T312666#8146447, even further minimised the index. This reduces from 8 to 2 copies (1 in the columns, and 1 large index).

             el_id: 9
           el_from: 379228
el_to_domain_index: http://org.iau.www.
        el_to_path: /public_press/themes/naming/#minorplanets

Change the indexes:

- INDEX el_from           (el_from,            el_to(40)),
- INDEX el_from_index_60  (el_from,            el_index_60,         el_id),
+ INDEX el_from           (el_from,            el_id),

- INDEX el_to   (el_to(60),          el_from),
+ INDEX el_to   (el_to_domain_index, el_from),

- INDEX el_index     (el_index(60)),
- INDEX el_index_60  (el_index_60,        el_id),

Prior art: T59176: ApiQueryExtLinksUsage::run query has crazy limit

Related Objects

StatusSubtypeAssignedTask
ResolvedBawolff
OpenNone
ResolvedLadsgroup
ResolvedFeatureLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
DuplicateLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedAntoine_Quhen
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
OpenLadsgroup
ResolvedBUG REPORTLadsgroup

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

As an aside note, -probably not needed for the new proposal-, but following @Krinkle's comment, el_index_60 functionality, on modern SQL, should be done through a functional index, something like:

el_index_60 VARCHAR(60) AS (LEFT(el_to, 60)) VIRTUAL,
INDEX el_index_60 (el_index_60, el_id)

I don't know if LEFT(el_to, 60) is the right function, but you get the idea- it has the advantage of being able to create custom indexes without storing additional column data.

I think the new proposal doesn't require this, but it felt relevant mentioning it because in other cases virtual columns + functional indexes (supported in all modern sql engines) could be very useful to add functionality with very low disk "cost".

Updated option B-2:

- INDEX el_from           (el_from,            el_to(40)),
- INDEX el_from_index_60  (el_from,            el_index_60,         el_id),
+ INDEX el_from           (el_from,            el_id),

- INDEX el_to   (el_to(60),          el_from),
+ INDEX el_to   (el_to_domain_index, el_from),

- INDEX el_index     (el_index(60)),
- INDEX el_index_60  (el_index_60,        el_id),
  • Applies info from above: The main index for el_from (which I called "el_page_domain_id" above) does not need to include el_domain_index. Amir indicates the number of links per page is low enough that for the specific case of ExternalLinksTable deleting one row for a given page, and ApiQueryExternalLinks filtering out certain URL protocols for a given page; we can let those filesort in memory.
  • Renames indexes back to original names for consistency with other tables.

Updated option B-2:

- INDEX el_from           (el_from,            el_to(40)),
- INDEX el_from_index_60  (el_from,            el_index_60,         el_id),
+ INDEX el_from           (el_from,            el_id),

- INDEX el_to   (el_to(60),          el_from),
+ INDEX el_to   (el_to_domain_index, el_from),

- INDEX el_index     (el_index(60)),
- INDEX el_index_60  (el_index_60,        el_id),
  • Applies info from above: The main index for el_from (which I called "el_page_domain_id" above) does not need to include el_domain_index. Amir indicates the number of links per page is low enough that for the specific case of ExternalLinksTable deleting one row for a given page, and ApiQueryExternalLinks filtering out certain URL protocols for a given page; we can let those filesort in memory.

Yup, thanks.

  • Renames indexes back to original names for consistency with other tables.

It is good but doing schema changes for this usually leads to all sorts of mess, MySQL updater doesn't have an easy way to actually find an index is modified and drop and re-introduce it with new columns. It's much much simpler for schema changes to have a different name for different index. We did a similar thing ipblocks a year ago, quite a mess. if we can rename them to el_from_id and el_to_from, it'd be amazing.

  • Renames indexes back to original names for consistency with other tables.

[…]. if we can rename them to el_from_id and el_to_from, it'd be amazing.

Ack. I mainly hope that we ignore my bad idea that (regretfully) included "prefix" and "page" in the names, which don't appear in any field names. Anything else is fine :)

About the protocol column. Currently to find the specific link you need to search both http and https prefix and moving the protocol out from the index column it would simplify the querying and likely makes it faster.

Ie. my pretty common use case in toolforge is to find images uploaded from Finna, Europeana, and Flickr and try to figure out if images behind the links are the same as images in Commons.

SELECT * FROM externallinks WHERE
     (el_index_60 LIKE "http://fi.finna%" OR el_index_60 LIKE "https://fi.finna%") 
     AND el_to LIKE "%id=hkm.HKMS000005:km0000lgdn%"

About the protocol column. Currently to find the specific link you need to search both http and https prefix and moving the protocol out from the index column it would simplify the querying and likely makes it faster.

Ie. my pretty common use case in toolforge is to find images uploaded from Finna, Europeana, and Flickr and try to figure out if images behind the links are the same as images in Commons.

SELECT * FROM externallinks WHERE
     (el_index_60 LIKE "http://fi.finna%" OR el_index_60 LIKE "https://fi.finna%") 
     AND el_to LIKE "%id=hkm.HKMS000005:km0000lgdn%"

I think we should push for replacing http:// links with https:// if the external site is serving https. It removes a redirect from every reader (it adds up really easily, see T120085), and it's much more secure for the readers as well.

I think we should push for replacing http:// links with https:// if the external site is serving https. It removes a redirect from every reader (it adds up really easily, see T120085), and it's much more secure for the readers as well.

We already have https://www.mediawiki.org/wiki/Extension:SecureLinkFixer that makes rewrites URLs to HTTPS if the domain always requires HTTPS. If we have a list of domains with the same content on http and https, it seems worth doing the same with them as well.

However, at a storage level, we should probably keep the links using the protocol that was used when writing them on the web page (e.g. suppose a domain was removed from the https list)

I think we should push for replacing http:// links with https:// if the external site is serving https. It removes a redirect from every reader (it adds up really easily, see T120085), and it's much more secure for the readers as well.

We already have https://www.mediawiki.org/wiki/Extension:SecureLinkFixer that makes rewrites URLs to HTTPS if the domain always requires HTTPS. If we have a list of domains with the same content on http and https, it seems worth doing the same with them as well.

[off-topic] Originally the plan was to use solely HSTS, but that doesn't cover everything. I would be open to figuring out how we can expand the list, maybe even an on-wiki list. File a task against SLF?

However, at a storage level, we should probably keep the links using the protocol that was used when writing them on the web page (e.g. suppose a domain was removed from the https list)

My understanding is that if SLF changes the link, externallinks will register the HTTPS version, because that's the actual URL that gets linked to. I'm wrong (h/t Krinkle) the original link is registered, see e.g. https://www.mediawiki.org/w/api.php?action=query&titles=User:Legoktm/SLF&prop=extlinks

I think we should push for replacing http:// links with https:// if the external site is serving https. It removes a redirect from every reader (it adds up really easily, see T120085), and it's much more secure for the readers as well.

We already have https://www.mediawiki.org/wiki/Extension:SecureLinkFixer that makes rewrites URLs to HTTPS if the domain always requires HTTPS. If we have a list of domains with the same content on http and https, it seems worth doing the same with them as well.

It's great that it exists!

However, at a storage level, we should probably keep the links using the protocol that was used when writing them on the web page (e.g. suppose a domain was removed from the https list)

Sure it's not really useful to migrate all of links automatically in the provided list, I don't think that'd be a good idea but I think if we just collect statistics and migrate most important ones (e.g. NYT, BBC, etc.), It'll be a massive improvement and I highly doubt any major website roll back from HTTPS and in the off chance it happening, we have to switch everything back to HTTP anyway (which again, is extremely unlikely)

Thanks for working on fixing externallinks. There are a lot of requests around Special:Linksearch that we should see if the new schema will help and hopefully not hinder. Specifically:

Thank you, I looked at them

hmm, this won't be impacted by this redesign. Generally I think this will be mostly addressed by switching to HTTPS as I doubt people would like to see ftp:// links of bbc. It also has a rather easy solution. Just make the domain be IN(...) and then just build list of all acceptable protocols. It's extremely fast compared to something like %com.foo.bar% which implementing this solution would definitely be easier once we store the domain in one (1) place only :P

This won't be impacted by this redesign. It basically needs what we did with templatelinks, add el_from_namespace column.

This will be fixed by this redesign.

This will be fixed by this redesign.

This won't be impacted by this redesign. It basically needs what we did with templatelinks, add el_from_namespace column.

Ack. I will defer to you on whether this is something that makes sense to also bundle into this schema change or defer for another time.

And added the ones that will be fixed as parent tasks, whee!

Change 826371 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] [POC] How the new externallinks would look like

https://gerrit.wikimedia.org/r/826371

I built a POC:

MariaDB [client]> select * from externallinks\G
*************************** 1. row ***************************
             el_id: 14
           el_from: 1
             el_to: https://www.youtube.com/watch?v=o2gGD4ewTNA
          el_index: https://com.youtube.www./watch?v=o2gGD4ewTNA
       el_index_60: https://com.youtube.www./watch?v=o2gGD4ewTNA
el_to_domain_index: https://com.youtube.www.
        el_to_path: /watch?v=o2gGD4ewTNA
1 row in set (0.000 sec)

Queries API and the special page make are something like this:

[DBQuery] ApiQueryExternalLinks::execute [0s] unknown: SELECT el_from,el_to,el_index_60,el_id FROM `externallinks` WHERE el_from = 1 AND ((el_to NOT LIKE '//%' ESCAPE '`' ) OR (el_index_60 LIKE 'http://%' ESCAPE '`' )) ORDER BY el_index_60,el_id LIMIT 11

[DBQuery] SpecialLinkSearch::reallyDoQuery [0s] unknown: SELECT page_namespace AS `namespace`,page_title AS `title`,el_index AS `value`,el_to AS `url` FROM `page`,`externallinks` WHERE (page_id = el_from) AND (el_to_domain_index LIKE 'https://com.youtube.%' ESCAPE '`' ) AND (el_to_path LIKE '' ESCAPE '`' ) ORDER BY el_index_60,el_id LIMIT 51

Somehow the special page eats the el_to_path or something like that, I wonder if we would need an index on el_to_domain_index, el_to_path. In that case, both shouldn't be blobs, But we can have 255 limit for characters in the domain and that should cover more than enough for any kind of search.

I wrote a script and went through the dump of enwiki's externallinks, it broke partially done but you get the idea.

By domain I mean schema + host + port. Path is the rest, including path, query parameters, etc.

It looked at 11,156,314 links, out of which 1,464,260 were unique domains. The ratio of distinct domain to total number of links is 7.62. It's not small but not large (compared to templates which is around 100 and pagelinks which is around 18)

The average length of domains is 24.56 chars and for path it's 38.09 characters.
The percentiles for domain length are: 24 (50 percentile), 27 (75), 31 (90), 38 (99), the max value had 386 characters.
The percentiles for path length are: 32 (50 percentile), 51 (75), 72 (90), 155 (99), the max value had 32948 characters (!!).

It says we can get away with 255 for domain and it will handle basically every case, blob makes more sense for path. It show there is a value in normalization of externallinks and specially given the comparable average size with paths, it means the normalization table even if repeats every domain, it would take away 40% of the actual values and basically split the table into two. Two medium-sized tables are much better than one large one. For alter, maint, etc. That being said, we can do this in multiple steps. first, take the domain into a separate field, clean up indexes and build it properly (option B-2) then if it's considerably big and in a couple of years, let's take the domain out into its own table after addressing other fires.

Change 867740 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Externallinks: Set Persian Wikiquote to WRITE BOTH

https://gerrit.wikimedia.org/r/867740

Change 889575 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] [WIP] api: Add support for externallinks read new

https://gerrit.wikimedia.org/r/889575

Change 915662 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mariadb: Allow new externallinks fields to be queried in wikireplicas

https://gerrit.wikimedia.org/r/915662

Change 915662 merged by Ladsgroup:

[operations/puppet@production] mariadb: Allow new externallinks fields to be queried in wikireplicas

https://gerrit.wikimedia.org/r/915662

Krinkle updated the task description. (Show Details)

Please see T337149 before deploying this to any more wikis. SCHEMA_COMPAT_READ_NEW is causing major problems with ConfirmEdit, AbuseFilter, and SpamBlacklist.

Here from the wikitech email. Is there any way to identify protocol-less URLs besides a search like this one (which drags up a lot of false positives)?

Here from the wikitech email. Is there any way to identify protocol-less URLs besides a search like this one (which drags up a lot of false positives)?

Since the old columns have not been removed yet, I think you can query with el_to like '//%' and that should give you a list. Otherwise search is your only option.

I dropped the columns in enwiki in beta cluster via this command

ALTER TABLE externallinks
DROP COLUMN el_to,
DROP COLUMN el_index,
DROP COLUMN el_index_60,
DROP INDEX el_to,
DROP INDEX el_index,
DROP INDEX el_index_60,
DROP INDEX el_from_index_60;

it went from 197MB to 61MB

It seems the explanations of the columns on https://www.mediawiki.org/wiki/Manual:Externallinks_table were not yet added. Specifically for el_to_domain_index and el_to_path ?

I've made an attempt, but please correct me if I'm wrong.

It seems the explanations of the columns on https://www.mediawiki.org/wiki/Manual:Externallinks_table were not yet added. Specifically for el_to_domain_index and el_to_path ?

I've made an attempt, but please correct me if I'm wrong.

Thanks. It looks great. I hope eventually we get this in tables.json and then automatically generated. One day.

Noting that externallinks in test wiki has gone from 189MB to 26MB after drop of the old columns in db2109.

And in ruwikinews (the biggest table in s3), went from 62GB to 19GB (without the previous optimization, it was 130GB).

@Danny_B noted on IRC that https://www.mediawiki.org/wiki/Manual:Externallinks_table has not yet been updated to show that el_index and el_index_60 have been/are being dropped. This came up while debugging T344866: Queries to externallinks table fail following schema changes.

Change 954900 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Schema: Drop old externallinks columns and indexes

https://gerrit.wikimedia.org/r/954900

Change 954900 merged by jenkins-bot:

[mediawiki/core@master] Schema: Drop old externallinks columns and indexes

https://gerrit.wikimedia.org/r/954900

Change 954952 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] ExternalLinks: Drop migration code

https://gerrit.wikimedia.org/r/954952

Change 954952 merged by jenkins-bot:

[mediawiki/core@master] ExternalLinks: Drop migration code

https://gerrit.wikimedia.org/r/954952

Change 826371 abandoned by Umherirrender:

[mediawiki/core@master] [POC] How the new externallinks would look like

Reason:

Task is resolved, work already done

https://gerrit.wikimedia.org/r/826371