Page MenuHomePhabricator

namespaceDupes.php causes duplicate entry db errors on some updates
Open, HighPublic

Description

I guess this is an offshoot of Tims great improvements to namespaceDupes.php, which now fixes link tables

https://github.com/wikimedia/mediawiki/commit/64765720b0b5cf4290a9d8f22bc10f13e3b7c101
https://github.com/wikimedia/mediawiki/commit/9b685a4a52244db1489728f1397a7640a64f90da

pagelinks from=38058985 ns=0 dbk=Special:Contributions/Nabila.selim *** INVALID
A database query error has occurred.
Query: UPDATE  `pagelinks` SET pl_namespace = '4',pl_title = 'OTRS' WHERE pl_namespace = '0' AND pl_title = 'COM:OTRS' AND pl_from = '7962192'
Function: NamespaceConflictChecker::checkLinkTable
Error: 1062 Duplicate entry '7962192-4-OTRS' for key 'pl_from' (10.64.16.29)

Noticed whilst poking at T87645 again.

I guess the query shouldn't just fail, causing the script to be run, the duplicate pl should be presumably be deleted, as the resultant link already exists?

Details

Related Gerrit Patches:

Event Timeline

Reedy created this task.Oct 18 2015, 11:01 PM
Reedy raised the priority of this task from to Needs Triage.
Reedy updated the task description. (Show Details)
Reedy added subscribers: Reedy, tstarling.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 18 2015, 11:01 PM
Reedy renamed this task from namespaceDupes.php to namespaceDupes.php causes duplicate entry db errors on some updates.Oct 18 2015, 11:01 PM
Reedy set Security to None.
Reedy updated the task description. (Show Details)Oct 18 2015, 11:07 PM
jayvdb added a subscriber: jayvdb.Oct 19 2015, 3:57 AM
Reedy added a comment.Oct 19 2015, 1:16 PM

I guess we could change https://github.com/wikimedia/mediawiki/blob/master/maintenance/namespaceDupes.php#L381 to use a db->replace() instead of db->update()

For WMF usage, it's one query then.. Otherwise we need to do a select for the new row/target, if it exists, delete the one we were going to update

Reedy triaged this task as High priority.Oct 19 2015, 1:43 PM
Reedy added a comment.EditedOct 19 2015, 1:48 PM

I guess we could change https://github.com/wikimedia/mediawiki/blob/master/maintenance/namespaceDupes.php#L381 to use a db->replace() instead of db->update()
For WMF usage, it's one query then.. Otherwise we need to do a select for the new row/target, if it exists, delete the one we were going to update

I guess this is only potentially a problem for the pagelinks and templatelinks tables; the redirect update uses the same function, but no UNIQUE constraint (though, the comment says exactly 1 row... Could we actually end up with > 1 if we have the same problems with the links tables?). It also doesn't have a from_namespace column unlike the other two. But that col is updated in movePage, but with replace, we need to get all data and "reinsert"...

--
-- Track page-to-page hyperlinks within the wiki.
--
CREATE TABLE /*_*/pagelinks (
  -- Key to the page_id of the page containing the link.
  pl_from int unsigned NOT NULL default 0,
  -- Namespace for this page
  pl_from_namespace int NOT NULL default 0,

  -- Key to page_namespace/page_title of the target page.
  -- The target page may or may not exist, and due to renames
  -- and deletions may refer to different page records as time
  -- goes by.
  pl_namespace int NOT NULL default 0,
  pl_title varchar(255) binary NOT NULL default ''
) /*$wgDBTableOptions*/;

CREATE UNIQUE INDEX /*i*/pl_from ON /*_*/pagelinks (pl_from,pl_namespace,pl_title);
CREATE INDEX /*i*/pl_namespace ON /*_*/pagelinks (pl_namespace,pl_title,pl_from);
CREATE INDEX /*i*/pl_backlinks_namespace ON /*_*/pagelinks (pl_from_namespace,pl_namespace,pl_title,pl_from);


--
-- Track template inclusions.
--
CREATE TABLE /*_*/templatelinks (
  -- Key to the page_id of the page containing the link.
  tl_from int unsigned NOT NULL default 0,
  -- Namespace for this page
  tl_from_namespace int NOT NULL default 0,

  -- Key to page_namespace/page_title of the target page.
  -- The target page may or may not exist, and due to renames
  -- and deletions may refer to different page records as time
  -- goes by.
  tl_namespace int NOT NULL default 0,
  tl_title varchar(255) binary NOT NULL default ''
) /*$wgDBTableOptions*/;

CREATE UNIQUE INDEX /*i*/tl_from ON /*_*/templatelinks (tl_from,tl_namespace,tl_title);
CREATE INDEX /*i*/tl_namespace ON /*_*/templatelinks (tl_namespace,tl_title,tl_from);
CREATE INDEX /*i*/tl_backlinks_namespace ON /*_*/templatelinks (tl_from_namespace,tl_namespace,tl_title,tl_from);
-- For each redirect, this table contains exactly one row defining its target
CREATE TABLE /*_*/redirect (
  -- Key to the page_id of the redirect page
  rd_from int unsigned NOT NULL default 0 PRIMARY KEY,

  -- Key to page_namespace/page_title of the target page.
  -- The target page may or may not exist, and due to renames
  -- and deletions may refer to different page records as time
  -- goes by.
  rd_namespace int NOT NULL default 0,
  rd_title varchar(255) binary NOT NULL default '',
  rd_interwiki varchar(32) default NULL,
  rd_fragment varchar(255) binary default NULL
) /*$wgDBTableOptions*/;

CREATE INDEX /*i*/rd_ns_title ON /*_*/redirect (rd_namespace,rd_title,rd_from);

Change 254078 had a related patch set uploaded (by Catrope):
namespaceDupes: Ignore duplicate key errors in link table updates

https://gerrit.wikimedia.org/r/254078

Isn't it sufficient to just add IGNORE?

Isn't it sufficient to just add IGNORE?

Wouldn't it mean we ended up with the old links still around? Which is then just crap in the db that needs cleaning up at a later date when orphaned?

https://dev.mysql.com/doc/refman/5.7/en/update.html

With the IGNORE keyword, the update statement does not abort even if errors occur during the update. Rows for which duplicate-key conflicts occur on a unique key value are not updated.

Change 254078 merged by jenkins-bot:
namespaceDupes: Ignore duplicate key errors in link table updates

https://gerrit.wikimedia.org/r/254078

Isn't it sufficient to just add IGNORE?

Wouldn't it mean we ended up with the old links still around? Which is then just crap in the db that needs cleaning up at a later date when orphaned?
https://dev.mysql.com/doc/refman/5.7/en/update.html

With the IGNORE keyword, the update statement does not abort even if errors occur during the update. Rows for which duplicate-key conflicts occur on a unique key value are not updated.

Yeah, you're right, we should probably do replace or upsert instead.