After T299416: Normalize link tables: Create linktarget table is done.
Description
Event Timeline
Steps that don't have a subticket yet:
- Turn on read new in beta cluster
- Turn on read new in production
- Schema change to make tl_namespace and tl_title nullable
- Deploying the schema change
- Write code stopping to write on the old columns via config
- Turn off writing to the old columns in beta and then production
- Schema change to drop old columns
- Deploying the schema change
Could you ellaborate, please, how will the new scheme interpret the red templates transclusion, so that quarry could recognize them?
I'm not 100% sure I understood your usecase but if a query to templatelinks used to look like this:
SELECT tl_from, tl_namespace, tl_title from templatelinks where tl_from = 5974403;
It will change to:
SELECT tl_from, lt_namespace, lt_title from templatelinks join linktarget on lt_target_id = lt_id where tl_from = 5974403;
It seems the new table has not been replicated to the cloud replicas (I'll check and fix it soon).
If this doesn't answer your question, can you elaborate more? possibly giving the query you run usually.
Thanks. It doesn't. What happens when the template does not exist?
(For example, here.)
Nothing will change when it comes to existence of a template.
For example, you have this query:
select page_namespace, page_title, tl_title from page inner join templatelinks where page_namespace in (0, 4, 6, 8, 10, 12, 14, 100) and page_id = tl_from and tl_namespace != 8 and not ((page_namespace = 100 and (tl_namespace <> 10 or tl_title like "הידעת_מדינות_%")) or (page_namespace = 8 and page_title like "%.js") or (page_namespace = 4 and (page_title like "%OTRS%" or page_title like "%בקשות_לבדיקה%" or page_title like "%דלפק_ייעוץ%" or page_title like "%הכה_את_המומחה%" or page_title like "%זכויות_יוצרים%" or page_title like "%יומן_מחיקות%" or page_title like "%סדנה_לגרפיקה%" or (page_title like "%המלצות_קודמות%" and not page_title in ("ערכים_מומלצים/המלצות_קודמות/אפריל_2016", "ערכים_מומלצים/המלצות_קודמות/מרץ_2016", "ערכים_מומלצים/המלצות_קודמות/מאי_2016")) or page_title like "%אולם_דיונים%" or page_title like "%עבודות_ויקידמיות%" or page_title like "%ארכיון%" or page_title like "%סקר_ויקיפדיה%" or page_title like "%איך_להקים_אסם%" or page_title like "%מיזמי_ויקיפדיה/נבחרי_ציבור_פוטנציאליים%" or page_title like "%הומור%" or page_title like "%רשימת%" or page_title like "%חשבון_נפש%" or page_title like "%סופשבוע_ללא_ערכים%" or page_title like "%ויקימניה%" or page_title like "%תחרות_צילום%" or page_title like "%מתקפת_איכות%" or page_title like "%מזנון%" or page_title like "%מועדונים%" or page_title like "%מפגשים%" or page_title like "%תבנית%" or page_title like "%שער_לילדים%" or page_title like "%שיתופי_פעולה%" or page_title like "%מיזמי_ויקיפדיה%" or page_title like "%בוט_החלפות%" or page_title like "הידעת?/סדרה__/%" or page_title in ("הידעת?/2018/אוגוסט", "חדשות", "הידעת?/2018/ספטמבר", "הידעת?/ממרץ_2011", "ארגז_חול", "הידעת?", "כיכר_העיר", "הידעת?/המתנה") or page_title = "ויקימדיה_ישראל/תמונה_נבחרת/גלריה" and page_namespace = 4 and tl_namespace = 4 and tl_title like "ויקימדיה_ישראל/תמונה_נבחרת/%")) or exists (select * from page where page_namespace = tl_namespace and page_title = tl_title))
In order to make the change easier, move the join condition to ON clause instead of WHERE:
select page_namespace, page_title, tl_title from page inner join templatelinks ON page_id = tl_from where page_namespace in (0, 4, 6, 8, 10, 12, 14, 100) and tl_namespace != 8 and not ((page_namespace = 100 and (tl_namespace <> 10 or tl_title like "הידעת_מדינות_%")) or (page_namespace = 8 and page_title like "%.js") or (page_namespace = 4 and (page_title like "%OTRS%" or page_title like "%בקשות_לבדיקה%" or page_title like "%דלפק_ייעוץ%" or page_title like "%הכה_את_המומחה%" or page_title like "%זכויות_יוצרים%" or page_title like "%יומן_מחיקות%" or page_title like "%סדנה_לגרפיקה%" or (page_title like "%המלצות_קודמות%" and not page_title in ("ערכים_מומלצים/המלצות_קודמות/אפריל_2016", "ערכים_מומלצים/המלצות_קודמות/מרץ_2016", "ערכים_מומלצים/המלצות_קודמות/מאי_2016")) or page_title like "%אולם_דיונים%" or page_title like "%עבודות_ויקידמיות%" or page_title like "%ארכיון%" or page_title like "%סקר_ויקיפדיה%" or page_title like "%איך_להקים_אסם%" or page_title like "%מיזמי_ויקיפדיה/נבחרי_ציבור_פוטנציאליים%" or page_title like "%הומור%" or page_title like "%רשימת%" or page_title like "%חשבון_נפש%" or page_title like "%סופשבוע_ללא_ערכים%" or page_title like "%ויקימניה%" or page_title like "%תחרות_צילום%" or page_title like "%מתקפת_איכות%" or page_title like "%מזנון%" or page_title like "%מועדונים%" or page_title like "%מפגשים%" or page_title like "%תבנית%" or page_title like "%שער_לילדים%" or page_title like "%שיתופי_פעולה%" or page_title like "%מיזמי_ויקיפדיה%" or page_title like "%בוט_החלפות%" or page_title like "הידעת?/סדרה__/%" or page_title in ("הידעת?/2018/אוגוסט", "חדשות", "הידעת?/2018/ספטמבר", "הידעת?/ממרץ_2011", "ארגז_חול", "הידעת?", "כיכר_העיר", "הידעת?/המתנה") or page_title = "ויקימדיה_ישראל/תמונה_נבחרת/גלריה" and page_namespace = 4 and tl_namespace = 4 and tl_title like "ויקימדיה_ישראל/תמונה_נבחרת/%")) or exists (select * from page where page_namespace = tl_namespace and page_title = tl_title))
Note that actually the last part of your query (the subquery) is wrong as it doesn't add templatelinks.
And then change things:
select page_namespace, page_title, lt_title from page inner join templatelinks ON page_id = tl_from join linktarget on tl_target_id = lt_id where page_namespace in (0, 4, 6, 8, 10, 12, 14, 100) and lt_namespace != 8 and not ((page_namespace = 100 and (lt_namespace <> 10 or lt_title like "הידעת_מדינות_%")) or (page_namespace = 8 and page_title like "%.js") or (page_namespace = 4 and (page_title like "%OTRS%" or page_title like "%בקשות_לבדיקה%" or page_title like "%דלפק_ייעוץ%" or page_title like "%הכה_את_המומחה%" or page_title like "%זכויות_יוצרים%" or page_title like "%יומן_מחיקות%" or page_title like "%סדנה_לגרפיקה%" or (page_title like "%המלצות_קודמות%" and not page_title in ("ערכים_מומלצים/המלצות_קודמות/אפריל_2016", "ערכים_מומלצים/המלצות_קודמות/מרץ_2016", "ערכים_מומלצים/המלצות_קודמות/מאי_2016")) or page_title like "%אולם_דיונים%" or page_title like "%עבודות_ויקידמיות%" or page_title like "%ארכיון%" or page_title like "%סקר_ויקיפדיה%" or page_title like "%איך_להקים_אסם%" or page_title like "%מיזמי_ויקיפדיה/נבחרי_ציבור_פוטנציאליים%" or page_title like "%הומור%" or page_title like "%רשימת%" or page_title like "%חשבון_נפש%" or page_title like "%סופשבוע_ללא_ערכים%" or page_title like "%ויקימניה%" or page_title like "%תחרות_צילום%" or page_title like "%מתקפת_איכות%" or page_title like "%מזנון%" or page_title like "%מועדונים%" or page_title like "%מפגשים%" or page_title like "%תבנית%" or page_title like "%שער_לילדים%" or page_title like "%שיתופי_פעולה%" or page_title like "%מיזמי_ויקיפדיה%" or page_title like "%בוט_החלפות%" or page_title like "הידעת?/סדרה__/%" or page_title in ("הידעת?/2018/אוגוסט", "חדשות", "הידעת?/2018/ספטמבר", "הידעת?/ממרץ_2011", "ארגז_חול", "הידעת?", "כיכר_העיר", "הידעת?/המתנה") or page_title = "ויקימדיה_ישראל/תמונה_נבחרת/גלריה" and page_namespace = 4 and lt_namespace = 4 and lt_title like "ויקימדיה_ישראל/תמונה_נבחרת/%")) or exists (select * from page join linktarget where page_namespace = lt_namespace and page_title = lt_title))
How can it work if linktarget does not have any data about non-existing templates? What will be the id in templatelinks and where will be the tempkate name stored?
How can it work if linktarget does not have any data about non-existing templates?
That's not true. It doesn't care if the page exist or not and store it anyway, it can exist or not. That's exactly one of two reasons we created a new table instead of just using page_id.
I dropped the column (and indexes) in fawiki in beta cluster:
root@deployment-db08:/srv/sqldata/fawiki# ls -Ssh | grep -i templatelinks 9.1M templatelinks.ibd 4.0K templatelinks.frm root@deployment-db08:/srv/sqldata/fawiki# ls -Ssh | grep -i templatelinks 528K templatelinks.ibd 4.0K templatelinks.frm
5.7% of its original size ^^
FWIW, linktarget is 180K:
root@deployment-db08:/srv/sqldata/fawiki# ls -Ssh | grep -i linktarget 180K linktarget.ibd 4.0K linktarget.frm
Removed it in all of beta cluster now which removed 1GB from beta cluster:
ladsgroup@deployment-db08:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 7.9G 0 7.9G 0% /dev tmpfs 1.6G 177M 1.4G 12% /run /dev/sda2 19G 4.7G 13G 27% / tmpfs 7.9G 0 7.9G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup /dev/mapper/vd-second--local--disk 111G 37G 68G 36% /srv tmpfs 1.6G 0 1.6G 0% /run/user/0 tmpfs 1.6G 0 1.6G 0% /run/user/3182 ladsgroup@deployment-db08:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 7.9G 0 7.9G 0% /dev tmpfs 1.6G 177M 1.4G 12% /run /dev/sda2 19G 4.7G 13G 27% / tmpfs 7.9G 0 7.9G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup /dev/mapper/vd-second--local--disk 111G 37G 69G 35% /srv tmpfs 1.6G 0 1.6G 0% /run/user/0 tmpfs 1.6G 0 1.6G 0% /run/user/3182
It is now dropped from testwiki in production and it went from 60MB to 20MB there. Given its number of rows and with linear extrapolation we get ~120GB for commons and cebwiki (they are currently on 500GB and 300GB respectively), this means just fixing cebwiki and commonswiki would free 600GB. If you add some other similar wikis like arzwiki, you'd get around a TB of clean up, probably more.
Dropping on s5 now. Started with dbstore1003:3315 and will continue next week (letting it stay there for a while in case writes happen to the old fields).
Impact report:
Before the second schema change, largest wikis of s5:
217G cebwiki/templatelinks.ibd 8.5G dewiki/templatelinks.ibd 6.6G srwiki/templatelinks.ibd 1.9G shwiki/templatelinks.ibd 1.5G mgwiktionary/templatelinks.ibd
In total: 235.5GB
After:
95G cebwiki/templatelinks.ibd 3.0G dewiki/templatelinks.ibd 2.5G srwiki/templatelinks.ibd 797M shwiki/templatelinks.ibd 729M mgwiktionary/templatelinks.ibd
In total: 102.0GB, 43% of the original size. Note that this is the second schema change, the first one runs optimization on it which already reduced the total size of the db by 13% and 7% in s5 and s4 respectively. The second has done a whooping 21% redaction in size (from 620GB to 490GB) on top of the previous redaction and will definitely trigger an alert in backups.
It has such an impact that dbstore1003 while holding three sections (and we just dropped this on one section only) has reached its lowest storage utilization in the past year meaning all the growth in size in three sections (s1, s5 and s7) over a year has been already offset by this change in s5.