Page MenuHomePhabricator

dumps - elwikisource XML dump violates UNIQUE KEY constraint
Closed, ResolvedPublic

Description

0) Problem

The elwikisource-20150602-stub-meta-current.xml.gz violates UNIQUE KEY constraint.
This makes trouble for INSERT commands.

  1. Database schema

CREATE TABLE page (
page_id int(10) unsigned NOT NULL AUTO_INCREMENT,
page_namespace int(11) NOT NULL,
page_title varbinary(255) NOT NULL,
...
PRIMARY KEY (page_id),
UNIQUE KEY name_title (page_namespace,page_title),
KEY page_random (page_random),
KEY page_len (page_len),
KEY page_redirect_namespace_len (page_is_redirect,page_namespace,page_len),
KEY page_latest (page_latest)
) ENGINE=InnoDB CHARSET=binary;

  1. Dump contents

(shell) emacs elwikisource-20150602-stub-meta-current.xml.gz
...
<namespace key="108" case="first-letter">Συγγραφέας</namespace>
...
<page>
<title>Συγγραφέας:Πλάτων</title>
<ns>0</ns>
<id>1422</id>
<redirect title="Πλάτων" />
...
<page>
<title>Πλάτων</title>
<ns>0</ns>
<id>21023</id>
<redirect title="Συγγραφέας:Πλάτων" />
...

  1. Database table contents

If I drop the UNIQUE KEY constraint, then I can INSERT both pages:

mysql> select page_id,page_namespace,page_title from elwikisource.page where page_namespace=0 and page_title='Πλάτων';

page_idpage_namespacepage_title
14220Πλάτων
210230Πλάτων

But then we discover that Plato is not the only culprit. This dump file has ten such cases:

mysql> select page_namespace,page_title,count(page_id) as n from elwikisource.page group by page_namespace,page_title having n>1;

page_namespacepage_titlen
0Main_Page2
0Αριστοτέλης2
0Δημοσθένης2
0Πέτρος_Κροπότκιν2
0Πλάτων2
0Σοφοκλής2
0Ἐπίκουρος2
0Ἐπίκτητος2
0Ἡράκλειτος_ὁ_Ἐφέσιος2
0Ὅμηρος2

Or better yet

mysql> select a.page_id,b.page_id,a.page_namespace,a.page_title from elwikisource.page as a, elwikisource.page as b where a.page_namespace=b.page_namespace and a.page_title=b.page_title and a.page_id<b.page_id order by a.page_id;

page_idpage_idpage_namespacepage_title
183790Main_Page
141114320Ἐπίκτητος
1422210230Πλάτων
142516010Ὅμηρος
142619820Ἐπίκουρος
1427210200Πέτρος_Κροπότκιν
1428210210Αριστοτέλης
1434210350Δημοσθένης
143617530Ἡράκλειτος_ὁ_Ἐφέσιος
1626210280Σοφοκλής
  1. Analysis

In nine cases, we see:

  • a pair of <redirect>s that point to each other;
  • where one <page> has the wrong value for <ns> (Συγγραφέας should be 108 not 0).
<id><ns><title><redirect> title=<id><ns><title><redirect> title=
14110ἘπίκτητοςΣυγγραφέας:Επίκτητος14320Συγγραφέας:ἘπίκτητοςἘπίκτητος
14220Συγγραφέας:ΠλάτωνΠλάτων210230ΠλάτωνΣυγγραφέας:Πλάτων
14250Συγγραφέας:ΌμηροςΌμηρος16010ὍμηροςΣυγγραφέας:Όμηρος
14260Συγγραφέας:ΕπίκουροςΕπίκουρος19820ἘπίκουροςΣυγγραφέας:Επίκουρος
14270Συγγραφέας:Πέτρος ΚροπότκινΠέτρος Κροπότκιν210200Πέτρος ΚροπότκινΣυγγραφέας:Πέτρος Κροπότκιν
14280Συγγραφέας:ΑριστοτέληςΑριστοτέλης210210ΑριστοτέληςΣυγγραφέας:Αριστοτέλης
14340Συγγραφέας:ΔημοσθένηςΔημοσθένης210350ΔημοσθένηςΣυγγραφέας:Δημοσθένης
14360Συγγραφέας:ΗράκλειτοςΗράκλειτος17530Ἡράκλειτος ὁ ἘφέσιοςΣυγγραφέας:Ηράκλειτος
16260Συγγραφέας:ΣοφοκλήςΣοφοκλής210280ΣοφοκλήςΣυγγραφέας:Σοφοκλής

The tenth case, Main_Page, I can fix myself.

  • When you set up MediaWiki, a default Main_Page is installed.
  • When you INSERT an XML dump, a second Main_Page is installed.

The fix is to DELETE the first Main_Page. This I will do in the next version of WP-MIRROR.

Event Timeline

wpmirrordev assigned this task to ArielGlenn.
wpmirrordev raised the priority of this task from to Medium.
wpmirrordev updated the task description. (Show Details)
wpmirrordev subscribed.

I guess these may have been caused by creation of a new namespace without cleanup of the old cruft left behind. Maybe from this: https://phabricator.wikimedia.org/T31172 (several years ago!) There used to be a maintenance script that would do this cleanup, maintenance/namespaceDups.php I think. Let me ask around. Adding a couple of random folks in hopes they can say yay or nay.

actually given that 'bad' pages are all redirects, let me get a community admin on this instead; they can just delete.

Which is the list of pages that need deletion? All listed in the tables? If so, maybe if we get a rawlist we can go with the pywiki script and bot-delete them. I see that a sysop is somewhat active there, however as you say it's activity is intermitent. Replied there too. Best.

Which is the list of pages that need deletion? All listed in the tables?

The following nine records should be either fixed or deleted.

  • To fix, change the value of page_namespace to 108;
  • Otherwise, delete the records.
<id><ns><title><redirect> title=
14220Συγγραφέας:ΠλάτωνΠλάτων
14250Συγγραφέας:ΌμηροςΌμηρος
14260Συγγραφέας:ΕπίκουροςΕπίκουρος
14270Συγγραφέας:Πέτρος ΚροπότκινΠέτρος Κροπότκιν
14280Συγγραφέας:ΑριστοτέληςΑριστοτέλης
14320Συγγραφέας:ἘπίκτητοςἘπίκτητος
14340Συγγραφέας:ΔημοσθένηςΔημοσθένης
14360Συγγραφέας:ΗράκλειτοςΗράκλειτος
16260Συγγραφέας:ΣοφοκλήςΣοφοκλής

Sorry for my ignorance. Which ones we need to delete? Those under <title> or <redirect> title= ? Thanks.

@ArielGlenn wouldn't it be easier if you requested temporary sysop for tech activities or assigned sysadmin global flag?

The <id> field is the PRIMARY KEY. If you choose to fix the records, please try:

mysql> update elwikisource.page set page_namespace=108
> where page_id in (1422,1425,1426,1427,1428,1432,1434,1436,1626);
Query OK, 9 rows affected (0.14 sec)
Rows matched: 9 Changed: 9 Warnings: 0

mysql> select page_id as '<id>',page_namespace as '<ns>',page_title as '<title>' from elwikisource.page
> where page_id in (1422,1425,1426,1427,1428,1432,1434,1436,1626);

<id><ns><title>
1422108Συγγραφέας:Πλάτων
1425108Συγγραφέας:Ὅμηρος
1426108Συγγραφέας:Ἐπίκουρος
1427108Συγγραφέας:Πέτρος_Κροπότκιν
1428108Συγγραφέας:Αριστοτέλης
1432108Συγγραφέας:Ἐπίκτητος
1434108Συγγραφέας:Δημοσθένης
1436108Συγγραφέας:Ἡράκλειτος_ὁ_Ἐφέσιος
1626108Συγγραφέας:Σοφοκλής

9 rows in set (0.03 sec)

Closing as resolved then. Feel free to reopen if not fixed. I am tad dissapointed though with @ArielGlenn's lack of response on this bug, meta, talk page and email regarding this issue despite being active elsewhere. Throwing us work without telling us what needs exactly to be done, or ignoring requests for information from us.