Page MenuHomePhabricator

Several "Duplicate entry for key 'PRIMARY'" errors in enwiki-latest-pages-articles.xml.bz2 (05-Jun-2015 23:45, 11984805689 bytes)
Closed, ResolvedPublic


The current latest XML English-language Wikipedia dump
enwiki-latest-pages-articles.xml.bz2 (05-Jun-2015 23:45, 11984805689 bytes)
has several duplicated pages.

This leads to errors while trying to read its data and populate a MYSQL database.

For example, I get

Exception in thread "main" com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry '614219339' for key 'PRIMARY'
at org.mediawiki.importer.XmlDumpReader.readDump(

The list of all the IDs that are duplicated is:

  • 614219339
  • 663854862
  • 359952698
  • 301899471
  • 559375953
  • 603392565
  • 544004224
  • 624437388
  • 37733084

I checked some of these duplicated entries and I saw that the XML content from <page> to </page> is identical.

Event Timeline

Easy_mungo raised the priority of this task from to Needs Triage.
Easy_mungo updated the task description. (Show Details)
Easy_mungo added a subscriber: Easy_mungo.
Restricted Application added subscribers: Hydriz, Aklapper. · View Herald TranscriptJun 24 2015, 11:07 AM
Hydriz set Security to None.
Hydriz added a subscriber: ArielGlenn.
Hydriz removed a subscriber: Hydriz.Jun 24 2015, 4:23 PM

Can you have a look at the stubs file and see if they are duplicated there?

As per, I also observed the same behavior in multiple Swedish wiki dumps, as well as the Spanish Wikipedia dump. I excerpt the relevant portion below.

I checked the stubs for the Swedish wikis and verified that they are duplicated there. See the following links.

Let me know if there's anything else. Thanks.

Example 1:
Title: Audi m8
ID: 18942
SHA1: gd16v3qkmjr2w2j35zhqitjfg97igjt)
Note: Last article in dump. Repeated twice

Example 2:
Title: Sommarens tolv månader
ID: 6209
SHA1: 9yibnev7pn3atxicayjoay0ave7pcu6
Note: Last article in dump. Repeated twice

Example 3:
Title: Topologi/Metriska rum
ID: 10001
SHA1: 5zdkpxflzdxhy7gxclludnlasvl6tw3
Note: Last article in dump. Repeated twice

Example 4:
Title: Afhandling om svenska stafsättet/4
ID: 88768
SHA1: 7zyj208ur4vit0t41z7xlftlyl69bo7
Note: Last article in dump. Repeated twice

Example 5:
Title (1): Veguer
Title (2): Promo
Note: duplicates are earlier in the dump (Veguer at the 9% mark and
Promo at the 23% mark). There doesn't seem to be a dupe at the end of
the article.

jayvdb added a subscriber: jayvdb.Jul 7 2015, 12:39 AM

Many dumps have duplicate final entry:

This comment was removed by wpmirrordev.

The duplicate title at the end of a run looks to me like it's gone; that was fixed in
The duplicate in the middle is due to the same page being dumped at the end of one set of stubs and at the beginning of the next set. This is a regression; introduced in because and the other xml*.py scripts dump up to and including the last page or item specified.