Page MenuHomePhabricator

EN WP pages articles has duplicate pages
Closed, DeclinedPublic

Description

Reported by Sebastiano Vigna:

The current dump enwiki-20160204-pages-articles.xml.bz2 contains duplicate pages. In particular, "Total Nonstop Action" and "Ida de Grey" appear twice.

Event Timeline

The page Total Nonstop Action is in there with two different page ids, so that reflects the state of the database when the stubs were dumped.

<title>Total Nonstop Action</title>
<ns>0</ns>
<id>454401</id>

vs.

<ns>0</ns>
<id>2547790</id>
<revision>

The page Ida de Grey has the same situation.

<title>Ida de Grey</title>
<ns>0</ns>
<id>22871284</id>

vs.

<title>Ida de Grey</title>
<ns>0</ns>
<id>31345536</id>

This likely is an artifact of how MediaWiki handles certain page moves these days:

08:39, 4 February 2016‎ Oknazevad         (59,374 bytes)   (Oknazevad moved page Total Nonstop Action to Total Nonstop Action Wrestling over redirect)
08:33, 4 February 2016‎ Galatz            (59,374 bytes)  (Galatz moved page Total Nonstop Action Wrestling to Total Nonstop Action over redirect) 
08:31, 4 February 2016‎ Oknazevad         (59,352 bytes)  (Oknazevad moved page Total Nonstop Action to Total Nonstop Action Wrestling over redirect)
03:41, 4 February 2016‎ Anthony Appleyard  (59,374 bytes) (Anthony Appleyard moved page Total Nonstop Action Wrestling to Total Nonstop Action)

and

22:29, 3 February 2016‎ Chchn   (6,952 bytes)   (Chchn moved page Ida Cockayne to Ida de Grey over redirect)

I won't be able to fix this in the dumps but I can look at the move-related code and see what happens. I guess that the move consists of multiple transactions and that at some point both the old page (the redirect) and the new page have the same title. This smells to me like a bug indeed but as I don't know that code, I can't be sure.

I checked the MovePage code and it looks like there's an Atomic wrapper around everything so I'm going to pass this to ... who does MW core any more? Anyways, to "someone" to look at possible MW bugs.

There's a UNIQUE index on the page_namespace, page_title so, even considering that the dumper ran at the same time that the page was moved, I don't see how this could happen.

Does the dumper run at a lower transaction isolation level, or are there several connections when generating the list?

The stubs are generated in 27 parts, all running at once. It's possible for transactions from two separate stub runs to reflect differing data. Weeding that out would be prohibitively slow as the dumps are currently generated. I have not checked to see if these two pages with the same title were in different stub chunks or not, but I can have a look.

That seems the most likely explanation.

ArielGlenn triaged this task as Medium priority.
ArielGlenn edited projects, added Dumps-Generation; removed MediaWiki-General.

/data/xmldatadumps/public/enwiki/20160204$ for i in seq 1 27; do echo "enwiki-20160204-stub-articles${i}.xml.gz"; zcat "enwiki-20160204-stub-articles${i}.xml.gz" | grep 'title>Total Nonstop Action<'; done
enwiki-20160204-stub-articles1.xml.gz
enwiki-20160204-stub-articles2.xml.gz
enwiki-20160204-stub-articles3.xml.gz
enwiki-20160204-stub-articles4.xml.gz
enwiki-20160204-stub-articles5.xml.gz

<title>Total Nonstop Action</title>

enwiki-20160204-stub-articles6.xml.gz
enwiki-20160204-stub-articles7.xml.gz
enwiki-20160204-stub-articles8.xml.gz
enwiki-20160204-stub-articles9.xml.gz
enwiki-20160204-stub-articles10.xml.gz

<title>Total Nonstop Action</title>

enwiki-20160204-stub-articles11.xml.gz
enwiki-20160204-stub-articles12.xml.gz
enwiki-20160204-stub-articles13.xml.gz
enwiki-20160204-stub-articles14.xml.gz
enwiki-20160204-stub-articles15.xml.gz
enwiki-20160204-stub-articles16.xml.gz
enwiki-20160204-stub-articles17.xml.gz
enwiki-20160204-stub-articles18.xml.gz
enwiki-20160204-stub-articles19.xml.gz
enwiki-20160204-stub-articles20.xml.gz
enwiki-20160204-stub-articles21.xml.gz
enwiki-20160204-stub-articles22.xml.gz
enwiki-20160204-stub-articles23.xml.gz
enwiki-20160204-stub-articles24.xml.gz
enwiki-20160204-stub-articles25.xml.gz
enwiki-20160204-stub-articles26.xml.gz
enwiki-20160204-stub-articles27.xml.gz

So that is indeed the explanation; these stub jobs, while they do fire off at approximately the same time, may get slightly different views of the database by a second or so, and that's enough to make the difference.

This is a wontfix with the current setup; for dumps 2.0 (see the Dumps-Rewrite project) we want internal consistency up front in the design.

I see this marked as "done", but enwiki-20200401-pages-articles.xml has still this problem.

See my comment above. This is closed as declined (essentially 'wontfix") given that it's unsolvable without a consistent view of the database across the entire time that the dump runs, and that is not feasible with the current architecture.