Page MenuHomePhabricator

Normalize templatelinks table
Open, MediumPublic

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
OpenLadsgroup
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
ResolvedLadsgroup
ResolvedLadsgroup
OpenLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
OpenLadsgroup
OpenLadsgroup
OpenLadsgroup
OpenLadsgroup

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ladsgroup triaged this task as Medium priority.Jan 18 2022, 4:08 PM

Steps that don't have a subticket yet:

  • Turn on read new in beta cluster
  • Turn on read new in production
  • Schema change to make tl_namespace and tl_title nullable
  • Deploying the schema change
  • Write code stopping to write on the old columns via config
  • Turn off writing to the old columns in beta and then production
  • Schema change to drop old columns
  • Deploying the schema change

Could you ellaborate, please, how will the new scheme interpret the red templates transclusion, so that quarry could recognize them?

I'm not 100% sure I understood your usecase but if a query to templatelinks used to look like this:

SELECT tl_from, tl_namespace, tl_title from templatelinks where tl_from = 5974403;

It will change to:

SELECT tl_from, lt_namespace, lt_title from templatelinks join linktarget on lt_target_id = lt_id where tl_from = 5974403;

It seems the new table has not been replicated to the cloud replicas (I'll check and fix it soon).

If this doesn't answer your question, can you elaborate more? possibly giving the query you run usually.

I'm not 100% sure I understood your usecase but if a query to templatelinks used to look like this:

SELECT tl_from, tl_namespace, tl_title from templatelinks where tl_from = 5974403;

It will change to:

SELECT tl_from, lt_namespace, lt_title from templatelinks join linktarget on lt_target_id = lt_id where tl_from = 5974403;

It seems the new table has not been replicated to the cloud replicas (I'll check and fix it soon).

If this doesn't answer your question, can you elaborate more? possibly giving the query you run usually.

Thanks. It doesn't. What happens when the template does not exist?
(For example, here.)

Nothing will change when it comes to existence of a template.
For example, you have this query:

select page_namespace, page_title, tl_title from page inner join templatelinks where page_namespace in (0, 4, 6, 8, 10, 12, 14,
100) and page_id = tl_from and tl_namespace != 8 and not ((page_namespace = 100 and (tl_namespace <> 10 or tl_title like "הידעת_מדינות_%"))
or (page_namespace = 8 and page_title like "%.js") or (page_namespace = 4 and (page_title like "%OTRS%" or page_title like "%בקשות_לבדיקה%"
or page_title like "%דלפק_ייעוץ%" or page_title like "%הכה_את_המומחה%" or page_title like "%זכויות_יוצרים%" or page_title like
"%יומן_מחיקות%" or page_title like "%סדנה_לגרפיקה%"  or (page_title like "%המלצות_קודמות%" and not page_title in
("ערכים_מומלצים/המלצות_קודמות/אפריל_2016", "ערכים_מומלצים/המלצות_קודמות/מרץ_2016", "ערכים_מומלצים/המלצות_קודמות/מאי_2016")) or
page_title like "%אולם_דיונים%" or page_title like "%עבודות_ויקידמיות%" or page_title like "%ארכיון%" or page_title like
"%סקר_ויקיפדיה%" or page_title like "%איך_להקים_אסם%" or page_title like "%מיזמי_ויקיפדיה/נבחרי_ציבור_פוטנציאליים%" or page_title like
"%הומור%" or page_title like "%רשימת%" or page_title like "%חשבון_נפש%" or page_title like "%סופשבוע_ללא_ערכים%" or page_title like
"%ויקימניה%" or page_title like "%תחרות_צילום%" or page_title like "%מתקפת_איכות%" or page_title like "%מזנון%" or page_title like
"%מועדונים%" or page_title like "%מפגשים%" or page_title like "%תבנית%" or page_title like "%שער_לילדים%" or page_title like
"%שיתופי_פעולה%" or page_title like "%מיזמי_ויקיפדיה%" or page_title like "%בוט_החלפות%" or page_title like "הידעת?/סדרה__/%" or
page_title in ("הידעת?/2018/אוגוסט", "חדשות", "הידעת?/2018/ספטמבר", "הידעת?/ממרץ_2011", "ארגז_חול", "הידעת?", "כיכר_העיר", 
"הידעת?/המתנה") or page_title = "ויקימדיה_ישראל/תמונה_נבחרת/גלריה" and page_namespace = 4 and tl_namespace = 4 and tl_title like
"ויקימדיה_ישראל/תמונה_נבחרת/%")) or exists (select * from page where page_namespace = tl_namespace and page_title = tl_title))

In order to make the change easier, move the join condition to ON clause instead of WHERE:

select page_namespace, page_title, tl_title from page inner join templatelinks ON page_id = tl_from where page_namespace in (0, 4, 6, 8, 10, 12, 14,
100) and tl_namespace != 8 and not ((page_namespace = 100 and (tl_namespace <> 10 or tl_title like "הידעת_מדינות_%"))
or (page_namespace = 8 and page_title like "%.js") or (page_namespace = 4 and (page_title like "%OTRS%" or page_title like "%בקשות_לבדיקה%"
or page_title like "%דלפק_ייעוץ%" or page_title like "%הכה_את_המומחה%" or page_title like "%זכויות_יוצרים%" or page_title like
"%יומן_מחיקות%" or page_title like "%סדנה_לגרפיקה%"  or (page_title like "%המלצות_קודמות%" and not page_title in
("ערכים_מומלצים/המלצות_קודמות/אפריל_2016", "ערכים_מומלצים/המלצות_קודמות/מרץ_2016", "ערכים_מומלצים/המלצות_קודמות/מאי_2016")) or
page_title like "%אולם_דיונים%" or page_title like "%עבודות_ויקידמיות%" or page_title like "%ארכיון%" or page_title like
"%סקר_ויקיפדיה%" or page_title like "%איך_להקים_אסם%" or page_title like "%מיזמי_ויקיפדיה/נבחרי_ציבור_פוטנציאליים%" or page_title like
"%הומור%" or page_title like "%רשימת%" or page_title like "%חשבון_נפש%" or page_title like "%סופשבוע_ללא_ערכים%" or page_title like
"%ויקימניה%" or page_title like "%תחרות_צילום%" or page_title like "%מתקפת_איכות%" or page_title like "%מזנון%" or page_title like
"%מועדונים%" or page_title like "%מפגשים%" or page_title like "%תבנית%" or page_title like "%שער_לילדים%" or page_title like
"%שיתופי_פעולה%" or page_title like "%מיזמי_ויקיפדיה%" or page_title like "%בוט_החלפות%" or page_title like "הידעת?/סדרה__/%" or
page_title in ("הידעת?/2018/אוגוסט", "חדשות", "הידעת?/2018/ספטמבר", "הידעת?/ממרץ_2011", "ארגז_חול", "הידעת?", "כיכר_העיר", 
"הידעת?/המתנה") or page_title = "ויקימדיה_ישראל/תמונה_נבחרת/גלריה" and page_namespace = 4 and tl_namespace = 4 and tl_title like
"ויקימדיה_ישראל/תמונה_נבחרת/%")) or exists (select * from page where page_namespace = tl_namespace and page_title = tl_title))

Note that actually the last part of your query (the subquery) is wrong as it doesn't add templatelinks.

And then change things:

select page_namespace, page_title, lt_title from page inner join templatelinks ON page_id = tl_from join linktarget on tl_target_id = lt_id where page_namespace in (0, 4, 6, 8, 10, 12, 14,
100) and lt_namespace != 8 and not ((page_namespace = 100 and (lt_namespace <> 10 or lt_title like "הידעת_מדינות_%"))
or (page_namespace = 8 and page_title like "%.js") or (page_namespace = 4 and (page_title like "%OTRS%" or page_title like "%בקשות_לבדיקה%"
or page_title like "%דלפק_ייעוץ%" or page_title like "%הכה_את_המומחה%" or page_title like "%זכויות_יוצרים%" or page_title like
"%יומן_מחיקות%" or page_title like "%סדנה_לגרפיקה%"  or (page_title like "%המלצות_קודמות%" and not page_title in
("ערכים_מומלצים/המלצות_קודמות/אפריל_2016", "ערכים_מומלצים/המלצות_קודמות/מרץ_2016", "ערכים_מומלצים/המלצות_קודמות/מאי_2016")) or
page_title like "%אולם_דיונים%" or page_title like "%עבודות_ויקידמיות%" or page_title like "%ארכיון%" or page_title like
"%סקר_ויקיפדיה%" or page_title like "%איך_להקים_אסם%" or page_title like "%מיזמי_ויקיפדיה/נבחרי_ציבור_פוטנציאליים%" or page_title like
"%הומור%" or page_title like "%רשימת%" or page_title like "%חשבון_נפש%" or page_title like "%סופשבוע_ללא_ערכים%" or page_title like
"%ויקימניה%" or page_title like "%תחרות_צילום%" or page_title like "%מתקפת_איכות%" or page_title like "%מזנון%" or page_title like
"%מועדונים%" or page_title like "%מפגשים%" or page_title like "%תבנית%" or page_title like "%שער_לילדים%" or page_title like
"%שיתופי_פעולה%" or page_title like "%מיזמי_ויקיפדיה%" or page_title like "%בוט_החלפות%" or page_title like "הידעת?/סדרה__/%" or
page_title in ("הידעת?/2018/אוגוסט", "חדשות", "הידעת?/2018/ספטמבר", "הידעת?/ממרץ_2011", "ארגז_חול", "הידעת?", "כיכר_העיר", 
"הידעת?/המתנה") or page_title = "ויקימדיה_ישראל/תמונה_נבחרת/גלריה" and page_namespace = 4 and lt_namespace = 4 and lt_title like
"ויקימדיה_ישראל/תמונה_נבחרת/%")) or exists (select * from page join linktarget where page_namespace = lt_namespace and page_title = lt_title))

How can it work if linktarget does not have any data about non-existing templates? What will be the id in templatelinks and where will be the tempkate name stored?

How can it work if linktarget does not have any data about non-existing templates?

That's not true. It doesn't care if the page exist or not and store it anyway, it can exist or not. That's exactly one of two reasons we created a new table instead of just using page_id.

I dropped the column (and indexes) in fawiki in beta cluster:

root@deployment-db08:/srv/sqldata/fawiki# ls -Ssh | grep -i templatelinks
9.1M templatelinks.ibd
4.0K templatelinks.frm
root@deployment-db08:/srv/sqldata/fawiki# ls -Ssh | grep -i templatelinks
528K templatelinks.ibd
4.0K templatelinks.frm

5.7% of its original size ^^

FWIW, linktarget is 180K:

root@deployment-db08:/srv/sqldata/fawiki# ls -Ssh | grep -i linktarget
180K linktarget.ibd
4.0K linktarget.frm

Removed it in all of beta cluster now which removed 1GB from beta cluster:

ladsgroup@deployment-db08:~$ df -h
Filesystem                          Size  Used Avail Use% Mounted on
udev                                7.9G     0  7.9G   0% /dev
tmpfs                               1.6G  177M  1.4G  12% /run
/dev/sda2                            19G  4.7G   13G  27% /
tmpfs                               7.9G     0  7.9G   0% /dev/shm
tmpfs                               5.0M     0  5.0M   0% /run/lock
tmpfs                               7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/mapper/vd-second--local--disk  111G   37G   68G  36% /srv
tmpfs                               1.6G     0  1.6G   0% /run/user/0
tmpfs                               1.6G     0  1.6G   0% /run/user/3182
ladsgroup@deployment-db08:~$ df -h
Filesystem                          Size  Used Avail Use% Mounted on
udev                                7.9G     0  7.9G   0% /dev
tmpfs                               1.6G  177M  1.4G  12% /run
/dev/sda2                            19G  4.7G   13G  27% /
tmpfs                               7.9G     0  7.9G   0% /dev/shm
tmpfs                               5.0M     0  5.0M   0% /run/lock
tmpfs                               7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/mapper/vd-second--local--disk  111G   37G   69G  35% /srv
tmpfs                               1.6G     0  1.6G   0% /run/user/0
tmpfs                               1.6G     0  1.6G   0% /run/user/3182

It is now dropped from testwiki in production and it went from 60MB to 20MB there. Given its number of rows and with linear extrapolation we get ~120GB for commons and cebwiki (they are currently on 500GB and 300GB respectively), this means just fixing cebwiki and commonswiki would free 600GB. If you add some other similar wikis like arzwiki, you'd get around a TB of clean up, probably more.

Dropping on s5 now. Started with dbstore1003:3315 and will continue next week (letting it stay there for a while in case writes happen to the old fields).

Impact report:
Before the second schema change, largest wikis of s5:

217G cebwiki/templatelinks.ibd
8.5G dewiki/templatelinks.ibd
6.6G srwiki/templatelinks.ibd
1.9G shwiki/templatelinks.ibd
1.5G mgwiktionary/templatelinks.ibd

In total: 235.5GB

After:

 95G cebwiki/templatelinks.ibd
3.0G dewiki/templatelinks.ibd
2.5G srwiki/templatelinks.ibd
797M shwiki/templatelinks.ibd
729M mgwiktionary/templatelinks.ibd

In total: 102.0GB, 43% of the original size. Note that this is the second schema change, the first one runs optimization on it which already reduced the total size of the db by 13% and 7% in s5 and s4 respectively. The second has done a whooping 21% redaction in size (from 620GB to 490GB) on top of the previous redaction and will definitely trigger an alert in backups.

It has such an impact that dbstore1003 while holding three sections (and we just dropped this on one section only) has reached its lowest storage utilization in the past year meaning all the growth in size in three sections (s1, s5 and s7) over a year has been already offset by this change in s5.