Page MenuHomePhabricator

Normalize pagelinks table
Open, Needs TriagePublic

Related Objects

Event Timeline

Wikistories queries the pagelinks table directly to surface stories associated with articles. I'll keep an eye on this task so we can migrate our code when the time comes.

FYI, T312666: Remove duplication in externallinks table is wrapping up. I will be starting this next quarter. Thankfully given that LinksMigration class and the whole infra of linktarget exist, it should be less work to migrate mediawiki bits.

@Ladsgroup if we have to do significant migrations on that entire table.. have we considered T14019: ifexist function uses pagelinks table in lieu of better options which also scores high in the wishlists every year ?

Yeah, I was contacted by CommTech about it. I wrote my ideas on T14019#7665054. I don't think it's doable in part of pagelinks work, it really needs its own table. It can use the linktarget from the get go but it's a different project altogether. It just needs dedicated resources.

Change 966213 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] wikireplicas: Allow pagelinks.pl_target_id to be replicated to the cloud

https://gerrit.wikimedia.org/r/966213

Change 966213 merged by Ladsgroup:

[operations/puppet@production] wikireplicas: Allow pagelinks.pl_target_id to be replicated to the cloud

https://gerrit.wikimedia.org/r/966213

Hey @Ladsgroup , we're using that table in the image suggestions production data pipeline, see this query.
While I'll try to monitor this ticket and mailing lists announcements, it would be great if you could ping me in advance before the breaking change rolls out, so that we can take action.
You could also leave a comment in T350007: [M] Adapt image suggestions to comply with breaking database schema changes if you prefer, thanks!

Sure. Will ping before dropping the columns in the ticket.

Before dropping columns, please notify WP:VPT at enwp and similar pages on other projects. Dropping these columns will break a lot of regular and ad hoc queries using Quarry, {{database report}}, etc. Identifying and modifying them will be a major exercise.

Before dropping columns, please notify WP:VPT at enwp and similar pages on other projects. Dropping these columns will break a lot of regular and ad hoc queries using Quarry, {{database report}}, etc. Identifying and modifying them will be a major exercise.

I will but noting that this has been already announced multiple times and in multiple venues: https://lists.wikimedia.org/hyperkitty/list/cloud@lists.wikimedia.org/thread/KWWEWNFVXDXOHZBJO32NQLBKC5LUTEAB/

Change 977218 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] beta: Stop writing to the old columns of pagelinks in fawiki

https://gerrit.wikimedia.org/r/977218

Change 977218 merged by jenkins-bot:

[operations/mediawiki-config@master] beta: Stop writing to the old columns of pagelinks in fawiki

https://gerrit.wikimedia.org/r/977218

I changed PK of pagelinks in fawiki in beta cluster:

ALTER TABLE pagelinks DROP PRIMARY KEY, ADD PRIMARY KEY (`pl_from`, `pl_target_id`);

Then, stopped writing the old columns and then dropped them and indexes:

ALTER TABLE pagelinks DROP COLUMN pl_namespace, DROP COLUMN pl_title;
DROP INDEX pl_namespace ON pagelinks;
DROP INDEX pl_backlinks_namespace ON pagelinks;

Size of the table went from 9.1M to 448K but the catch here is that majority of size redaction came from running optimize when changing the PK, making it go to 996K, so the net gain here is cutting size of the table to around 45% which is pretty good regardless.

Actually my measurements were incorrect, the impact is bigger let me redo that.

Running it on simplewiki gave 38% of the original size. On top, you'd get a clean up due to running optimize table which can in itself cut the size of the table from half to one fifth depending on the wiki.

Fully understand the reasoning behind this normalization. But (unexpectedly) this will result in some queries to run substantially slower. For instance:

old query (3 seconds)
SELECT
  page_title,
  pl_title,
  CASE
    WHEN c2.cl_to IS NULL THEN NULL
    ELSE '{{yes}}'
  END AS under_construction
FROM page JOIN pagelinks
  ON page_id = pl_from
LEFT JOIN categorylinks c2
  ON page_id = c2.cl_from
  AND c2.cl_to = 'صفحه‌های_گسترده_در_دست_ساخت'
WHERE
  page_namespace = 0
  AND pl_namespace IN (2, 3)
  AND NOT EXISTS (
    SELECT c1.cl_to
    FROM categorylinks c1
    WHERE
      page_id = c1.cl_from
      AND c1.cl_to = 'مقاله‌های_نامزد_حذف_سریع'
  )

versus"

new query (3.5 minutes)
SELECT
  page_title,
  lt_title,
  CASE
    WHEN c2.cl_to IS NULL THEN NULL
    ELSE '{{yes}}'
  END AS under_construction
FROM page
JOIN pagelinks
  ON page_id = pl_from
JOIN linktarget 
  ON pl_target_id = lt_id
LEFT JOIN categorylinks c2
  ON page_id = c2.cl_from
  AND c2.cl_to = 'صفحه‌های_گسترده_در_دست_ساخت'
WHERE
  page_namespace = 0
  AND lt_namespace IN (2, 3)
  AND NOT EXISTS (
    SELECT c1.cl_to
    FROM categorylinks c1
    WHERE
      page_id = c1.cl_from
      AND c1.cl_to = 'مقاله‌های_نامزد_حذف_سریع'
  )

It would be nice if we had analytics on the queries run against pagelinks table (both in production and in replicas) and what the average or median query duration looks like over time.

In looking at the EXPLAIN results for those two queries, the first one runs more quickly not only because of denormalization, but also because the WHERE condition on pl_namespace uses the pl_namespace index on the pagelinks table. However, it seems like the secondary does *not* use the lt_namespace_title [[index]] on the linktargets table (where I expected it would be used as a partial index). Any ideas on how to improve my query or the indexes?

Maybe this is what you're looking for:

SELECT
  page_title,
  lt_title,
  CASE
    WHEN c2.cl_to IS NULL THEN NULL
    ELSE '{{yes}}'
  END AS under_construction
FROM page
JOIN pagelinks
  ON page_id = pl_from
JOIN linktarget 
  ON pl_target_id = lt_id
LEFT JOIN categorylinks c2
  ON page_id = c2.cl_from
WHERE
  c2.cl_to = 'صفحه‌های_گسترده_در_دست_ساخت'
  AND page_namespace = 0
  AND lt_namespace IN (2, 3)
  AND page_id NOT IN (
    SELECT c1.cl_from
    FROM categorylinks c1
    WHERE
      c1.cl_to = 'مقاله‌های_نامزد_حذف_سریع'
  )

(if I understood your query correctly, you want articles that are in daste sakht but not naamzade hazfe sari?)

No. The query finds all articles linking to a User or User_talk page, and distinguishes which has an "under construction" category. So the last part is optional, not mandatory. Your suggested query makes it mandatory.

But thanks for taking a look.

ah I see, unfortunately I don't think there is a way to make the query instantly fast as the knowledge that has the highest cardinality (pl from ns + pltarget ns) now is literally sitting in two different table and none of the fields have enough cardinality on their own to make query in the respective tables faster. I think it'll get a bit worse once the data is properly populated in linktarget (and more links to user and user talk ns start being added to linktarget) but OTOH, dropping the old columns makes all queries faster by increasing memory lookups and this took 48s for me:

SELECT
  page_title,
  lt_title,
  CASE
    WHEN page_id in (SELECT cl_from from categorylinks WHERE cl_to = 'صفحه‌های_گسترده_در_دست_ساخت') THEN '{{yes}}'
    ELSE NULL
  END AS under_construction
FROM page
JOIN pagelinks
  ON page_id = pl_from
JOIN linktarget 
  ON pl_target_id = lt_id
WHERE
  page_namespace = 0
  AND pl_from_namespace = 0
  AND lt_namespace IN (2, 3)
  AND page_id NOT IN (
    SELECT c1.cl_from
    FROM categorylinks c1
    WHERE
      c1.cl_to = 'مقاله‌های_نامزد_حذف_سریع'
  )

(adding pl_from_namespace = 0 helps a bit)

@Huji Hi, the data has been fully populated for fawiki. If you want check and switch your tools and queries, it should be fine to do so now.

Change #1024450 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Run pagelinks migration for third parties too

https://gerrit.wikimedia.org/r/1024450

Change #1024450 merged by jenkins-bot:

[mediawiki/core@master] Run pagelinks migration for third parties too

https://gerrit.wikimedia.org/r/1024450

We are about to start dropping the old columns everywhere.

What is this change, and what should be announced in Tech News (if anything)?

We are about to start dropping the old columns everywhere.

Does "everywhere" include Manual:Pagelinks table, which has yet to be updated this year and doesn't indicate that any columns are deprecated? Nor document the new way?

@Huji Hi, the data has been fully populated for fawiki. If you want check and switch your tools and queries, it should be fine to do so now.

I migrated all my code. In most cases, queries are running just a tad slower. In a few cases, they seem to run a tad faster. In all cases, my tests showed the results were identical.

Change #1028778 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Stop writing to old columns of pagelinks in most wikis

https://gerrit.wikimedia.org/r/1028778

Change #1028778 merged by jenkins-bot:

[operations/mediawiki-config@master] Stop writing to old columns of pagelinks in most wikis

https://gerrit.wikimedia.org/r/1028778

Mentioned in SAL (#wikimedia-operations) [2024-05-07T15:32:32Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]]

Mentioned in SAL (#wikimedia-operations) [2024-05-07T15:38:11Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-05-07T16:05:02Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]] (duration: 32m 29s)

What is this change, and what should be announced in Tech News (if anything)?

Something along the lines of:

Two columns of pagelinks table (pl_namespace, pl_title) are being dropped soon. Users must use pl_target_id instead (foreign key to linktarget table). See T222224 for reasoning.