Page MenuHomePhabricator

Normalize templatelinks table
Closed, ResolvedPublic

Description

After T299416: Normalize link tables: Create linktarget table is done. To learn how to update your queries, see T299417#7814637.

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
Resolved Marostegui
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Steps that don't have a subticket yet:

  • Turn on read new in beta cluster
  • Turn on read new in production
  • Schema change to make tl_namespace and tl_title nullable
  • Deploying the schema change
  • Write code stopping to write on the old columns via config
  • Turn off writing to the old columns in beta and then production
  • Schema change to drop old columns
  • Deploying the schema change

Could you ellaborate, please, how will the new scheme interpret the red templates transclusion, so that quarry could recognize them?

I'm not 100% sure I understood your usecase but if a query to templatelinks used to look like this:

SELECT tl_from, tl_namespace, tl_title from templatelinks where tl_from = 5974403;

It will change to:

SELECT tl_from, lt_namespace, lt_title from templatelinks join linktarget on lt_target_id = lt_id where tl_from = 5974403;

It seems the new table has not been replicated to the cloud replicas (I'll check and fix it soon).

If this doesn't answer your question, can you elaborate more? possibly giving the query you run usually.

I'm not 100% sure I understood your usecase but if a query to templatelinks used to look like this:

SELECT tl_from, tl_namespace, tl_title from templatelinks where tl_from = 5974403;

It will change to:

SELECT tl_from, lt_namespace, lt_title from templatelinks join linktarget on lt_target_id = lt_id where tl_from = 5974403;

It seems the new table has not been replicated to the cloud replicas (I'll check and fix it soon).

If this doesn't answer your question, can you elaborate more? possibly giving the query you run usually.

Thanks. It doesn't. What happens when the template does not exist?
(For example, here.)

Nothing will change when it comes to existence of a template.
For example, you have this query:

select page_namespace, page_title, tl_title from page inner join templatelinks where page_namespace in (0, 4, 6, 8, 10, 12, 14,
100) and page_id = tl_from and tl_namespace != 8 and not ((page_namespace = 100 and (tl_namespace <> 10 or tl_title like "הידעת_מדינות_%"))
or (page_namespace = 8 and page_title like "%.js") or (page_namespace = 4 and (page_title like "%OTRS%" or page_title like "%בקשות_לבדיקה%"
or page_title like "%דלפק_ייעוץ%" or page_title like "%הכה_את_המומחה%" or page_title like "%זכויות_יוצרים%" or page_title like
"%יומן_מחיקות%" or page_title like "%סדנה_לגרפיקה%"  or (page_title like "%המלצות_קודמות%" and not page_title in
("ערכים_מומלצים/המלצות_קודמות/אפריל_2016", "ערכים_מומלצים/המלצות_קודמות/מרץ_2016", "ערכים_מומלצים/המלצות_קודמות/מאי_2016")) or
page_title like "%אולם_דיונים%" or page_title like "%עבודות_ויקידמיות%" or page_title like "%ארכיון%" or page_title like
"%סקר_ויקיפדיה%" or page_title like "%איך_להקים_אסם%" or page_title like "%מיזמי_ויקיפדיה/נבחרי_ציבור_פוטנציאליים%" or page_title like
"%הומור%" or page_title like "%רשימת%" or page_title like "%חשבון_נפש%" or page_title like "%סופשבוע_ללא_ערכים%" or page_title like
"%ויקימניה%" or page_title like "%תחרות_צילום%" or page_title like "%מתקפת_איכות%" or page_title like "%מזנון%" or page_title like
"%מועדונים%" or page_title like "%מפגשים%" or page_title like "%תבנית%" or page_title like "%שער_לילדים%" or page_title like
"%שיתופי_פעולה%" or page_title like "%מיזמי_ויקיפדיה%" or page_title like "%בוט_החלפות%" or page_title like "הידעת?/סדרה__/%" or
page_title in ("הידעת?/2018/אוגוסט", "חדשות", "הידעת?/2018/ספטמבר", "הידעת?/ממרץ_2011", "ארגז_חול", "הידעת?", "כיכר_העיר", 
"הידעת?/המתנה") or page_title = "ויקימדיה_ישראל/תמונה_נבחרת/גלריה" and page_namespace = 4 and tl_namespace = 4 and tl_title like
"ויקימדיה_ישראל/תמונה_נבחרת/%")) or exists (select * from page where page_namespace = tl_namespace and page_title = tl_title))

In order to make the change easier, move the join condition to ON clause instead of WHERE:

select page_namespace, page_title, tl_title from page inner join templatelinks ON page_id = tl_from where page_namespace in (0, 4, 6, 8, 10, 12, 14,
100) and tl_namespace != 8 and not ((page_namespace = 100 and (tl_namespace <> 10 or tl_title like "הידעת_מדינות_%"))
or (page_namespace = 8 and page_title like "%.js") or (page_namespace = 4 and (page_title like "%OTRS%" or page_title like "%בקשות_לבדיקה%"
or page_title like "%דלפק_ייעוץ%" or page_title like "%הכה_את_המומחה%" or page_title like "%זכויות_יוצרים%" or page_title like
"%יומן_מחיקות%" or page_title like "%סדנה_לגרפיקה%"  or (page_title like "%המלצות_קודמות%" and not page_title in
("ערכים_מומלצים/המלצות_קודמות/אפריל_2016", "ערכים_מומלצים/המלצות_קודמות/מרץ_2016", "ערכים_מומלצים/המלצות_קודמות/מאי_2016")) or
page_title like "%אולם_דיונים%" or page_title like "%עבודות_ויקידמיות%" or page_title like "%ארכיון%" or page_title like
"%סקר_ויקיפדיה%" or page_title like "%איך_להקים_אסם%" or page_title like "%מיזמי_ויקיפדיה/נבחרי_ציבור_פוטנציאליים%" or page_title like
"%הומור%" or page_title like "%רשימת%" or page_title like "%חשבון_נפש%" or page_title like "%סופשבוע_ללא_ערכים%" or page_title like
"%ויקימניה%" or page_title like "%תחרות_צילום%" or page_title like "%מתקפת_איכות%" or page_title like "%מזנון%" or page_title like
"%מועדונים%" or page_title like "%מפגשים%" or page_title like "%תבנית%" or page_title like "%שער_לילדים%" or page_title like
"%שיתופי_פעולה%" or page_title like "%מיזמי_ויקיפדיה%" or page_title like "%בוט_החלפות%" or page_title like "הידעת?/סדרה__/%" or
page_title in ("הידעת?/2018/אוגוסט", "חדשות", "הידעת?/2018/ספטמבר", "הידעת?/ממרץ_2011", "ארגז_חול", "הידעת?", "כיכר_העיר", 
"הידעת?/המתנה") or page_title = "ויקימדיה_ישראל/תמונה_נבחרת/גלריה" and page_namespace = 4 and tl_namespace = 4 and tl_title like
"ויקימדיה_ישראל/תמונה_נבחרת/%")) or exists (select * from page where page_namespace = tl_namespace and page_title = tl_title))

Note that actually the last part of your query (the subquery) is wrong as it doesn't add templatelinks.

And then change things:

select page_namespace, page_title, lt_title from page inner join templatelinks ON page_id = tl_from join linktarget on tl_target_id = lt_id where page_namespace in (0, 4, 6, 8, 10, 12, 14,
100) and lt_namespace != 8 and not ((page_namespace = 100 and (lt_namespace <> 10 or lt_title like "הידעת_מדינות_%"))
or (page_namespace = 8 and page_title like "%.js") or (page_namespace = 4 and (page_title like "%OTRS%" or page_title like "%בקשות_לבדיקה%"
or page_title like "%דלפק_ייעוץ%" or page_title like "%הכה_את_המומחה%" or page_title like "%זכויות_יוצרים%" or page_title like
"%יומן_מחיקות%" or page_title like "%סדנה_לגרפיקה%"  or (page_title like "%המלצות_קודמות%" and not page_title in
("ערכים_מומלצים/המלצות_קודמות/אפריל_2016", "ערכים_מומלצים/המלצות_קודמות/מרץ_2016", "ערכים_מומלצים/המלצות_קודמות/מאי_2016")) or
page_title like "%אולם_דיונים%" or page_title like "%עבודות_ויקידמיות%" or page_title like "%ארכיון%" or page_title like
"%סקר_ויקיפדיה%" or page_title like "%איך_להקים_אסם%" or page_title like "%מיזמי_ויקיפדיה/נבחרי_ציבור_פוטנציאליים%" or page_title like
"%הומור%" or page_title like "%רשימת%" or page_title like "%חשבון_נפש%" or page_title like "%סופשבוע_ללא_ערכים%" or page_title like
"%ויקימניה%" or page_title like "%תחרות_צילום%" or page_title like "%מתקפת_איכות%" or page_title like "%מזנון%" or page_title like
"%מועדונים%" or page_title like "%מפגשים%" or page_title like "%תבנית%" or page_title like "%שער_לילדים%" or page_title like
"%שיתופי_פעולה%" or page_title like "%מיזמי_ויקיפדיה%" or page_title like "%בוט_החלפות%" or page_title like "הידעת?/סדרה__/%" or
page_title in ("הידעת?/2018/אוגוסט", "חדשות", "הידעת?/2018/ספטמבר", "הידעת?/ממרץ_2011", "ארגז_חול", "הידעת?", "כיכר_העיר", 
"הידעת?/המתנה") or page_title = "ויקימדיה_ישראל/תמונה_נבחרת/גלריה" and page_namespace = 4 and lt_namespace = 4 and lt_title like
"ויקימדיה_ישראל/תמונה_נבחרת/%")) or exists (select * from page join linktarget where page_namespace = lt_namespace and page_title = lt_title))

How can it work if linktarget does not have any data about non-existing templates? What will be the id in templatelinks and where will be the tempkate name stored?

How can it work if linktarget does not have any data about non-existing templates?

That's not true. It doesn't care if the page exist or not and store it anyway, it can exist or not. That's exactly one of two reasons we created a new table instead of just using page_id.

I dropped the column (and indexes) in fawiki in beta cluster:

root@deployment-db08:/srv/sqldata/fawiki# ls -Ssh | grep -i templatelinks
9.1M templatelinks.ibd
4.0K templatelinks.frm
root@deployment-db08:/srv/sqldata/fawiki# ls -Ssh | grep -i templatelinks
528K templatelinks.ibd
4.0K templatelinks.frm

5.7% of its original size ^^

FWIW, linktarget is 180K:

root@deployment-db08:/srv/sqldata/fawiki# ls -Ssh | grep -i linktarget
180K linktarget.ibd
4.0K linktarget.frm

Removed it in all of beta cluster now which removed 1GB from beta cluster:

ladsgroup@deployment-db08:~$ df -h
Filesystem                          Size  Used Avail Use% Mounted on
udev                                7.9G     0  7.9G   0% /dev
tmpfs                               1.6G  177M  1.4G  12% /run
/dev/sda2                            19G  4.7G   13G  27% /
tmpfs                               7.9G     0  7.9G   0% /dev/shm
tmpfs                               5.0M     0  5.0M   0% /run/lock
tmpfs                               7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/mapper/vd-second--local--disk  111G   37G   68G  36% /srv
tmpfs                               1.6G     0  1.6G   0% /run/user/0
tmpfs                               1.6G     0  1.6G   0% /run/user/3182
ladsgroup@deployment-db08:~$ df -h
Filesystem                          Size  Used Avail Use% Mounted on
udev                                7.9G     0  7.9G   0% /dev
tmpfs                               1.6G  177M  1.4G  12% /run
/dev/sda2                            19G  4.7G   13G  27% /
tmpfs                               7.9G     0  7.9G   0% /dev/shm
tmpfs                               5.0M     0  5.0M   0% /run/lock
tmpfs                               7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/mapper/vd-second--local--disk  111G   37G   69G  35% /srv
tmpfs                               1.6G     0  1.6G   0% /run/user/0
tmpfs                               1.6G     0  1.6G   0% /run/user/3182

It is now dropped from testwiki in production and it went from 60MB to 20MB there. Given its number of rows and with linear extrapolation we get ~120GB for commons and cebwiki (they are currently on 500GB and 300GB respectively), this means just fixing cebwiki and commonswiki would free 600GB. If you add some other similar wikis like arzwiki, you'd get around a TB of clean up, probably more.

Dropping on s5 now. Started with dbstore1003:3315 and will continue next week (letting it stay there for a while in case writes happen to the old fields).

Impact report:
Before the second schema change, largest wikis of s5:

217G cebwiki/templatelinks.ibd
8.5G dewiki/templatelinks.ibd
6.6G srwiki/templatelinks.ibd
1.9G shwiki/templatelinks.ibd
1.5G mgwiktionary/templatelinks.ibd

In total: 235.5GB

After:

 95G cebwiki/templatelinks.ibd
3.0G dewiki/templatelinks.ibd
2.5G srwiki/templatelinks.ibd
797M shwiki/templatelinks.ibd
729M mgwiktionary/templatelinks.ibd

In total: 102.0GB, 43% of the original size. Note that this is the second schema change, the first one runs optimization on it which already reduced the total size of the db by 13% and 7% in s5 and s4 respectively. The second has done a whooping 21% redaction in size (from 620GB to 490GB) on top of the previous redaction and will definitely trigger an alert in backups.

It has such an impact that dbstore1003 while holding three sections (and we just dropped this on one section only) has reached its lowest storage utilization in the past year meaning all the growth in size in three sections (s1, s5 and s7) over a year has been already offset by this change in s5.

Can you estimate please, when templatelinks column tl_namespace is available again?

Can you estimate please, when templatelinks column tl_namespace is available again?

Never. The column is being removed.

Can you estimate please, when templatelinks column tl_namespace is available again?

It won't be. You need to join your query with linktarget.

e.g.

select tl_from, tl_namespace, tl_title from templatelinks where ....;

Must become

select tl_from, lt_namespace, lt_title from templatelinks join linktarget on tl_target_id = lt_id where ....;

HTH

Thank, but it's a little bit confusing:

So I have to change the query:

SELECT page_title, page_namespace FROM page, templatelinks
WHERE tl_from = page_id
  AND page_namespace IN (0,100)
  AND tl_from_namespace IN (0,100)
  AND tl_namespace = 10
  AND tl_title = 'Review';

But how to do?

Am Fr., 19. Aug. 2022 um 08:42 Uhr schrieb Ladsgroup <
no-reply@phabricator.wikimedia.org>:

Ladsgroup added a comment. View Task
https://phabricator.wikimedia.org/T299417

In T299417#8168022 https://phabricator.wikimedia.org/T299417#8168022,
@doctaxon https://phabricator.wikimedia.org/p/doctaxon/ wrote:

Can you estimate please, when templatelinks column tl_namespace is
available again?

It won't be. You need to join your query with linktarget.

e.g.

select tl_from, tl_namespace, tl_title from templatelinks where ....;

Must become

select tl_from, lt_namespace, lt_title from templatelinks join linktarget on tl_target_id = lt_id where ....;

HTH

*TASK DETAIL*
https://phabricator.wikimedia.org/T299417

*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *Ladsgroup
*Cc: *doctaxon, Marostegui, Universal_Omega, Zabe, Mormegil, Superyetkin,
MusikAnimal, IKhitron, AntiCompositeNumber, RhinosF1, ArielGlenn, Izno,
Aklapper, Ladsgroup, Devnull, LSobanski, Hazizibinmahdi, Iflorez,
Vali.matei, Minhnv-2809, Jay8g, Krenair

This query:

SELECT page_title, page_namespace FROM page, templatelinks
WHERE tl_from = page_id

AND page_namespace IN (0,100)
AND tl_from_namespace IN (0,100)
AND tl_namespace = 10
AND tl_title = 'Review';

Must become this:

SELECT page_title, page_namespace FROM page join templatelinks
on tl_from = page_id join linktarget on tl_target_id = lt_id

AND page_namespace IN (0,100)
AND tl_from_namespace IN (0,100)
AND lt_namespace = 10
AND lt_title = 'Review';

Tested and works fine.

checked, thank you very much

and thanks for your great database changing works. It's very helpful.

Change 830114 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] schema: Drop tl_title and tl_namespace fields from templatelinks

https://gerrit.wikimedia.org/r/830114

Change 830219 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@REL1_39] schema: Drop tl_title and tl_namespace fields from templatelinks

https://gerrit.wikimedia.org/r/830219

Change 830114 merged by jenkins-bot:

[mediawiki/core@master] schema: Drop tl_title and tl_namespace fields from templatelinks

https://gerrit.wikimedia.org/r/830114

Change 830219 abandoned by Ladsgroup:

[mediawiki/core@REL1_39] schema: Drop tl_title and tl_namespace fields from templatelinks

Reason:

let's try again.

https://gerrit.wikimedia.org/r/830219

Change 830219 restored by Ladsgroup:

[mediawiki/core@REL1_39] schema: Drop tl_title and tl_namespace fields from templatelinks

https://gerrit.wikimedia.org/r/830219

Change 830636 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Introduce pruneUnusedLinkTargetRows maint script

https://gerrit.wikimedia.org/r/830636

Change 830219 merged by Ladsgroup:

[mediawiki/core@REL1_39] schema: Drop tl_title and tl_namespace fields from templatelinks

https://gerrit.wikimedia.org/r/830219

Change 832157 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Stop writing to the old templatelinks columns of enwiki

https://gerrit.wikimedia.org/r/832157

Any SQL query involving the templatelinks table (e.g. select *) on eswiktionary and plwiktionary (group1) now returns:

ERROR 1356 (HY000): View 'eswiktionary_p.templatelinks' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them

Yeah, I'm trying to run the maintain views there but there are queries that are stuck. I need to depool them which takes a bit.

Change 830636 merged by jenkins-bot:

[mediawiki/core@master] Introduce pruneUnusedLinkTargetRows maint script

https://gerrit.wikimedia.org/r/830636

Change 832563 had a related patch set uploaded (by Jforrester; author: Amir Sarabadani):

[mediawiki/core@REL1_39] Introduce pruneUnusedLinkTargetRows maint script

https://gerrit.wikimedia.org/r/832563

Change 832563 merged by Jforrester:

[mediawiki/core@REL1_39] Introduce pruneUnusedLinkTargetRows maint script

https://gerrit.wikimedia.org/r/832563

We have been dropping the old columns which now affects users. While this was announced a while ago (plus regular updates afterwards) but let's add a line in tech news about this.

Something like:

Two fields of tl_namespace and tl_title in templatelinks table are now being dropped. Queries relying on these fields need to change to use the new normalization field called tl_target_id. See T299417 for more information. This is part of normalization of links tables [1]

(edit mercilessly)

I also wrote a document explaining why we need to do normalization: https://www.mediawiki.org/wiki/User:ASarabadani_(WMF)/Database_for_devs_toolkit/Concepts/Normalization Maybe that'd be useful.

Is the new linktarget table publicly available somewhere? My bot regularly updates maintenance lists (https://fr.wikipedia.org/wiki/Projet:Mod%C3%A8le/Maintenance/Listes) based on the template dump of frwiki (https://dumps.wikimedia.org/frwiki/latest/frwiki-latest-templatelinks.sql.gz). After this update, I don't see a way to reconstruct the fields that were present before, because I don't know where to find the data to resolve target_ids.

@Orlodrim: see T315063; the table will be dumped, but some required work hasn't been done yet. In the meantime, the table is available in the Toolforge database replicas.

Hi, the change also affected Pi bot's deployment of Wikidata Infoboxes on Commons, help rewriting the query would be appreciated at T320314!