Once T299417: Normalize templatelinks table is done.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T300222 Implement normalizing MediaWiki link tables | |||
Open | None | T299947 Normalize pagelinks table | |||
Resolved | Ladsgroup | T342689 Add pl_target_id column to pagelinks | |||
Resolved | ABran-WMF | T343198 Add pl_target_id column to pagelinks in production | |||
Resolved | Ladsgroup | T345669 Add support for write both for pagelinks | |||
Resolved | Ladsgroup | T345732 Turn on write both for beta and production | |||
Open | Ladsgroup | T345733 Run MigrateLinksTable for pagelinks in beta and production | |||
Resolved | JAllemandou | T345771 Adapt Sqoop to pagelinks schema change | |||
Resolved | Ladsgroup | T346457 Add support for read new for pagelinks | |||
Resolved | SBisson | T346784 Add support for read new for pagelinks in Wikistories | |||
Open | Ladsgroup | T351237 Set beta and production to read new for pagelinks migration | |||
Open | Ladsgroup | T352010 Gradually drop old pagelinks columns | |||
Open | None | T364069 Rebuild pagelinks tables | |||
Declined | Marostegui | T364067 Switchover s6 master (db1173 -> db1231) | |||
Resolved | Marostegui | T363977 Switchover s8 master (db2165 -> db2161) |
Event Timeline
Wikistories queries the pagelinks table directly to surface stories associated with articles. I'll keep an eye on this task so we can migrate our code when the time comes.
FYI, T312666: Remove duplication in externallinks table is wrapping up. I will be starting this next quarter. Thankfully given that LinksMigration class and the whole infra of linktarget exist, it should be less work to migrate mediawiki bits.
@Ladsgroup if we have to do significant migrations on that entire table.. have we considered T14019: ifexist function uses pagelinks table in lieu of better options which also scores high in the wishlists every year ?
Yeah, I was contacted by CommTech about it. I wrote my ideas on T14019#7665054. I don't think it's doable in part of pagelinks work, it really needs its own table. It can use the linktarget from the get go but it's a different project altogether. It just needs dedicated resources.
Change 966213 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[operations/puppet@production] wikireplicas: Allow pagelinks.pl_target_id to be replicated to the cloud
Change 966213 merged by Ladsgroup:
[operations/puppet@production] wikireplicas: Allow pagelinks.pl_target_id to be replicated to the cloud
Hey @Ladsgroup , we're using that table in the image suggestions production data pipeline, see this query.
While I'll try to monitor this ticket and mailing lists announcements, it would be great if you could ping me in advance before the breaking change rolls out, so that we can take action.
You could also leave a comment in T350007: [M] Adapt image suggestions to comply with breaking database schema changes if you prefer, thanks!
Before dropping columns, please notify WP:VPT at enwp and similar pages on other projects. Dropping these columns will break a lot of regular and ad hoc queries using Quarry, {{database report}}, etc. Identifying and modifying them will be a major exercise.
I will but noting that this has been already announced multiple times and in multiple venues: https://lists.wikimedia.org/hyperkitty/list/cloud@lists.wikimedia.org/thread/KWWEWNFVXDXOHZBJO32NQLBKC5LUTEAB/
We deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/966213 today with a maintain-views run on all wikireplica servers.
Change 977218 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[operations/mediawiki-config@master] beta: Stop writing to the old columns of pagelinks in fawiki
Change 977218 merged by jenkins-bot:
[operations/mediawiki-config@master] beta: Stop writing to the old columns of pagelinks in fawiki
I changed PK of pagelinks in fawiki in beta cluster:
ALTER TABLE pagelinks DROP PRIMARY KEY, ADD PRIMARY KEY (`pl_from`, `pl_target_id`);
Then, stopped writing the old columns and then dropped them and indexes:
ALTER TABLE pagelinks DROP COLUMN pl_namespace, DROP COLUMN pl_title; DROP INDEX pl_namespace ON pagelinks; DROP INDEX pl_backlinks_namespace ON pagelinks;
Size of the table went from 9.1M to 448K but the catch here is that majority of size redaction came from running optimize when changing the PK, making it go to 996K, so the net gain here is cutting size of the table to around 45% which is pretty good regardless.
Running it on simplewiki gave 38% of the original size. On top, you'd get a clean up due to running optimize table which can in itself cut the size of the table from half to one fifth depending on the wiki.
Fully understand the reasoning behind this normalization. But (unexpectedly) this will result in some queries to run substantially slower. For instance:
SELECT page_title, pl_title, CASE WHEN c2.cl_to IS NULL THEN NULL ELSE '{{yes}}' END AS under_construction FROM page JOIN pagelinks ON page_id = pl_from LEFT JOIN categorylinks c2 ON page_id = c2.cl_from AND c2.cl_to = 'صفحههای_گسترده_در_دست_ساخت' WHERE page_namespace = 0 AND pl_namespace IN (2, 3) AND NOT EXISTS ( SELECT c1.cl_to FROM categorylinks c1 WHERE page_id = c1.cl_from AND c1.cl_to = 'مقالههای_نامزد_حذف_سریع' )
versus"
SELECT page_title, lt_title, CASE WHEN c2.cl_to IS NULL THEN NULL ELSE '{{yes}}' END AS under_construction FROM page JOIN pagelinks ON page_id = pl_from JOIN linktarget ON pl_target_id = lt_id LEFT JOIN categorylinks c2 ON page_id = c2.cl_from AND c2.cl_to = 'صفحههای_گسترده_در_دست_ساخت' WHERE page_namespace = 0 AND lt_namespace IN (2, 3) AND NOT EXISTS ( SELECT c1.cl_to FROM categorylinks c1 WHERE page_id = c1.cl_from AND c1.cl_to = 'مقالههای_نامزد_حذف_سریع' )
It would be nice if we had analytics on the queries run against pagelinks table (both in production and in replicas) and what the average or median query duration looks like over time.
In looking at the EXPLAIN results for those two queries, the first one runs more quickly not only because of denormalization, but also because the WHERE condition on pl_namespace uses the pl_namespace index on the pagelinks table. However, it seems like the secondary does *not* use the lt_namespace_title [[index]] on the linktargets table (where I expected it would be used as a partial index). Any ideas on how to improve my query or the indexes?
Maybe this is what you're looking for:
SELECT page_title, lt_title, CASE WHEN c2.cl_to IS NULL THEN NULL ELSE '{{yes}}' END AS under_construction FROM page JOIN pagelinks ON page_id = pl_from JOIN linktarget ON pl_target_id = lt_id LEFT JOIN categorylinks c2 ON page_id = c2.cl_from WHERE c2.cl_to = 'صفحههای_گسترده_در_دست_ساخت' AND page_namespace = 0 AND lt_namespace IN (2, 3) AND page_id NOT IN ( SELECT c1.cl_from FROM categorylinks c1 WHERE c1.cl_to = 'مقالههای_نامزد_حذف_سریع' )
(if I understood your query correctly, you want articles that are in daste sakht but not naamzade hazfe sari?)
No. The query finds all articles linking to a User or User_talk page, and distinguishes which has an "under construction" category. So the last part is optional, not mandatory. Your suggested query makes it mandatory.
But thanks for taking a look.
ah I see, unfortunately I don't think there is a way to make the query instantly fast as the knowledge that has the highest cardinality (pl from ns + pltarget ns) now is literally sitting in two different table and none of the fields have enough cardinality on their own to make query in the respective tables faster. I think it'll get a bit worse once the data is properly populated in linktarget (and more links to user and user talk ns start being added to linktarget) but OTOH, dropping the old columns makes all queries faster by increasing memory lookups and this took 48s for me:
SELECT page_title, lt_title, CASE WHEN page_id in (SELECT cl_from from categorylinks WHERE cl_to = 'صفحههای_گسترده_در_دست_ساخت') THEN '{{yes}}' ELSE NULL END AS under_construction FROM page JOIN pagelinks ON page_id = pl_from JOIN linktarget ON pl_target_id = lt_id WHERE page_namespace = 0 AND pl_from_namespace = 0 AND lt_namespace IN (2, 3) AND page_id NOT IN ( SELECT c1.cl_from FROM categorylinks c1 WHERE c1.cl_to = 'مقالههای_نامزد_حذف_سریع' )
(adding pl_from_namespace = 0 helps a bit)
@Huji Hi, the data has been fully populated for fawiki. If you want check and switch your tools and queries, it should be fine to do so now.
Change #1024450 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[mediawiki/core@master] Run pagelinks migration for third parties too
Change #1024450 merged by jenkins-bot:
[mediawiki/core@master] Run pagelinks migration for third parties too
Does "everywhere" include Manual:Pagelinks table, which has yet to be updated this year and doesn't indicate that any columns are deprecated? Nor document the new way?
I migrated all my code. In most cases, queries are running just a tad slower. In a few cases, they seem to run a tad faster. In all cases, my tests showed the results were identical.
Change #1028778 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[operations/mediawiki-config@master] Stop writing to old columns of pagelinks in most wikis
Change #1028778 merged by jenkins-bot:
[operations/mediawiki-config@master] Stop writing to old columns of pagelinks in most wikis
Mentioned in SAL (#wikimedia-operations) [2024-05-07T15:32:32Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]]
Mentioned in SAL (#wikimedia-operations) [2024-05-07T15:38:11Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
Mentioned in SAL (#wikimedia-operations) [2024-05-07T16:05:02Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]] (duration: 32m 29s)
Something along the lines of:
Two columns of pagelinks table (pl_namespace, pl_title) are being dropped soon. Users must use pl_target_id instead (foreign key to linktarget table). See T222224 for reasoning.