This table has grown to a pretty large size in production. It's one of the biggest tables of commons with 200GB and will cause more issues in the future.## Background
Unlike other links table,MediaWiki's link database tables are among the largest tables of any WMF production database. it can't be normalized with linktarget (T299417) but it basically repeats each external link multiple times.
For exampleOptimizing the way we store and retreive this would generally see benefits such as:
* lower diskspace per database: slower growth, less hardware in the long run (co-locate more wikis together), this is a typical row:and ensures each individual replica host can continue to fit and perform well with a full core database of a given wiki.
* faster backup and recovery,
* lower replication lag (smaller queries, smaller binlogs).
This is a continuation in spirit of {T222224} and {T300222}, where a similar optimization and major win was made to the `templatelinks` table (with upcoming work for `pagelinks`, `imagelinks` and `categorylinks`).
## externalinks
This table has grown a lot in production. It's one of the biggest tables for Commons, at 200GB, and will cause more issues in the future.
Unlike other links table, this table can't be normalized using the `linktarget` feature (T299417). We currently store each external link upto 8 times as part of a single row (and its indexes).
For example, this is a typical row today:
```
*************************** 4. row ***************************
el_id: 9
el_from: 379228
el_to: http://www.iau.org/public_press/themes/naming/#minorplanets
el_index: http://org.iau.www./public_press/themes/naming/#minorplanets
el_index_60: http://org.iau.www./public_press/themes/naming/#minorplanets
```
AnOne external link is being repeatedthree ways in three times and those largcolumns of a given row, and those columns are beingin turn are used in **five** different indexes.
The proposal would be something like this.
**Option A: Create new table**
Create a new table holding the domain names (call it, `externaldomain`?) and it would hold the domain and its inverse.
**Option A**
Something like:
```
ed_id: 1234
ed_domain: http://www.iau.org
ed_index: http://org.iau.www.
```
And t
The row externallinks would turn into:be reduced to
```
*************************** 4. row ***************************
el_id: 9
el_from: 379228
el_to_domain: 1234 # Key to externaldomain.ed_id
el_to_path: public_press/themes/naming/#minorplanets
```
**Option B**:**
Avoid creating a new table: by storing the domain inline, using only the index variant. We go from 8 to 3 copies (1 in the columns, plus two large indexes).
```
*************************** 4. row ***************************
el_id: 9
el_from: 379228
el_to_domain_index: http://org.iau.www.
el_to_path: /public_press/themes/naming/#minorplanets
```
Change the indexes:
```lang=diff
- INDEX el_from (el_from, el_to(40)),
- INDEX el_from_index_60 (el_from, el_index_60, el_id),
+ INDEX el_page_domain_id (el_from, el_to_domain_index, el_id),
- INDEX el_to (el_to(60), el_from),
+ INDEX el_domain_page (el_to_domain_index, el_from),
- INDEX el_index (el_index(60)),
- INDEX el_index_60 (el_index_60, el_id),
```
**Option B-2** (chosen)
Further minimised the index. Going from 8 to 2 copies (1 in the columns, and only 1 large index).
```lang=yaml
el_id: 9
el_from: 379228
el_to_domain_index: http://org.iau.www.
el_to_path: /public_press/themes/naming/#minorplanets
```
Change the indexes:
```lang=diff
- INDEX el_from (el_from, el_to(40)),
- INDEX el_from_index_60 (el_from, el_index_60, el_id),
+ INDEX el_from (el_from, el_id),
- INDEX el_to (el_to(60), el_from),
+ INDEX el_to (el_to_domain_index, el_from),
- INDEX el_index (el_index(60)),
- INDEX el_index_60 (el_index_60, el_id),
```
Prior art: {T59176}