EN WP pages articles has duplicate pages
Closed, DeclinedPublic

Description

Reported by Sebastiano Vigna:

The current dump enwiki-20160204-pages-articles.xml.bz2 contains duplicate pages. In particular, "Total Nonstop Action" and "Ida de Grey" appear twice.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 23 2016, 1:42 PM
ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Feb 23 2016, 1:44 PM

The page Total Nonstop Action is in there with two different page ids, so that reflects the state of the database when the stubs were dumped.

<title>Total Nonstop Action</title>
<ns>0</ns>
<id>454401</id>

vs.

<ns>0</ns>
<id>2547790</id>
<revision>

The page Ida de Grey has the same situation.

<title>Ida de Grey</title>
<ns>0</ns>
<id>22871284</id>

vs.

<title>Ida de Grey</title>
<ns>0</ns>
<id>31345536</id>

This likely is an artifact of how MediaWiki handles certain page moves these days:

08:39, 4 February 2016‎ Oknazevad         (59,374 bytes)   (Oknazevad moved page Total Nonstop Action to Total Nonstop Action Wrestling over redirect)
08:33, 4 February 2016‎ Galatz            (59,374 bytes)  (Galatz moved page Total Nonstop Action Wrestling to Total Nonstop Action over redirect) 
08:31, 4 February 2016‎ Oknazevad         (59,352 bytes)  (Oknazevad moved page Total Nonstop Action to Total Nonstop Action Wrestling over redirect)
03:41, 4 February 2016‎ Anthony Appleyard  (59,374 bytes) (Anthony Appleyard moved page Total Nonstop Action Wrestling to Total Nonstop Action)

and

22:29, 3 February 2016‎ Chchn   (6,952 bytes)   (Chchn moved page Ida Cockayne to Ida de Grey over redirect)

I won't be able to fix this in the dumps but I can look at the move-related code and see what happens. I guess that the move consists of multiple transactions and that at some point both the old page (the redirect) and the new page have the same title. This smells to me like a bug indeed but as I don't know that code, I can't be sure.

I checked the MovePage code and it looks like there's an Atomic wrapper around everything so I'm going to pass this to ... who does MW core any more? Anyways, to "someone" to look at possible MW bugs.

ArielGlenn removed ArielGlenn as the assignee of this task.Mar 3 2016, 6:23 PM
ArielGlenn removed a project: Dumps-Generation.
Platonides added a subscriber: Platonides.EditedMar 12 2016, 7:51 PM

There's a UNIQUE index on the page_namespace, page_title so, even considering that the dumper ran at the same time that the page was moved, I don't see how this could happen.

Does the dumper run at a lower transaction isolation level, or are there several connections when generating the list?

The stubs are generated in 27 parts, all running at once. It's possible for transactions from two separate stub runs to reflect differing data. Weeding that out would be prohibitively slow as the dumps are currently generated. I have not checked to see if these two pages with the same title were in different stub chunks or not, but I can have a look.

That seems the most likely explanation.

Bonzon added a subscriber: Bonzon.Apr 11 2016, 6:04 PM
ArielGlenn triaged this task as Normal priority.Apr 12 2016, 9:07 AM
ArielGlenn claimed this task.
ArielGlenn closed this task as Declined.Apr 12 2016, 9:15 AM

/data/xmldatadumps/public/enwiki/20160204$ for i in seq 1 27; do echo "enwiki-20160204-stub-articles${i}.xml.gz"; zcat "enwiki-20160204-stub-articles${i}.xml.gz" | grep 'title>Total Nonstop Action<'; done
enwiki-20160204-stub-articles1.xml.gz
enwiki-20160204-stub-articles2.xml.gz
enwiki-20160204-stub-articles3.xml.gz
enwiki-20160204-stub-articles4.xml.gz
enwiki-20160204-stub-articles5.xml.gz

<title>Total Nonstop Action</title>

enwiki-20160204-stub-articles6.xml.gz
enwiki-20160204-stub-articles7.xml.gz
enwiki-20160204-stub-articles8.xml.gz
enwiki-20160204-stub-articles9.xml.gz
enwiki-20160204-stub-articles10.xml.gz

<title>Total Nonstop Action</title>

enwiki-20160204-stub-articles11.xml.gz
enwiki-20160204-stub-articles12.xml.gz
enwiki-20160204-stub-articles13.xml.gz
enwiki-20160204-stub-articles14.xml.gz
enwiki-20160204-stub-articles15.xml.gz
enwiki-20160204-stub-articles16.xml.gz
enwiki-20160204-stub-articles17.xml.gz
enwiki-20160204-stub-articles18.xml.gz
enwiki-20160204-stub-articles19.xml.gz
enwiki-20160204-stub-articles20.xml.gz
enwiki-20160204-stub-articles21.xml.gz
enwiki-20160204-stub-articles22.xml.gz
enwiki-20160204-stub-articles23.xml.gz
enwiki-20160204-stub-articles24.xml.gz
enwiki-20160204-stub-articles25.xml.gz
enwiki-20160204-stub-articles26.xml.gz
enwiki-20160204-stub-articles27.xml.gz

So that is indeed the explanation; these stub jobs, while they do fire off at approximately the same time, may get slightly different views of the database by a second or so, and that's enough to make the difference.

This is a wontfix with the current setup; for dumps 2.0 (see the Dumps-Rewrite project) we want internal consistency up front in the design.

ArielGlenn moved this task from Active to Done on the Dumps-Generation board.Apr 25 2016, 2:24 PM