Page MenuHomePhabricator

Several "Duplicate entry for key 'PRIMARY'" errors in enwiki-latest-pages-articles.xml.bz2 (05-Jun-2015 23:45, 11984805689 bytes)
Closed, ResolvedPublic

Description

The current latest XML English-language Wikipedia dump
enwiki-latest-pages-articles.xml.bz2 (05-Jun-2015 23:45, 11984805689 bytes)
has several duplicated pages.

This leads to errors while trying to read its data and populate a MYSQL database.

For example, I get

Exception in thread "main" java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry '614219339' for key 'PRIMARY'
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:92)

The list of all the IDs that are duplicated is:

  • 614219339
  • 663854862
  • 359952698
  • 301899471
  • 559375953
  • 603392565
  • 544004224
  • 624437388
  • 37733084

I checked some of these duplicated entries and I saw that the XML content from <page> to </page> is identical.

Event Timeline

Easy_mungo raised the priority of this task from to Needs Triage.
Easy_mungo updated the task description. (Show Details)
Easy_mungo subscribed.

Can you have a look at the stubs file and see if they are duplicated there?

As per https://lists.wikimedia.org/pipermail/xmldatadumps-l/2015-July/001149.html, I also observed the same behavior in multiple Swedish wiki dumps, as well as the Spanish Wikipedia dump. I excerpt the relevant portion below.

I checked the stubs for the Swedish wikis and verified that they are duplicated there. See the following links.

Let me know if there's anything else. Thanks.

Example 1:
URL: http://dumps.wikimedia.org/svwikiversity/20150602/svwikiversity-20150602-pages-articles.xml.bz2
Title: Audi m8
ID: 18942
SHA1: gd16v3qkmjr2w2j35zhqitjfg97igjt)
Note: Last article in dump. Repeated twice

Example 2:
URL: http://dumps.wikimedia.org/svwikiquote/20150602/svwikiquote-20150602-pages-articles.xml.bz2
Title: Sommarens tolv månader
ID: 6209
SHA1: 9yibnev7pn3atxicayjoay0ave7pcu6
Note: Last article in dump. Repeated twice

Example 3:
URL: http://dumps.wikimedia.org/svwikibooks/20150602/svwikibooks-20150602-pages-articles.xml.bz2
Title: Topologi/Metriska rum
ID: 10001
SHA1: 5zdkpxflzdxhy7gxclludnlasvl6tw3
Note: Last article in dump. Repeated twice

Example 4:
URL: http://dumps.wikimedia.org/svwikisource/20150602/svwikisource-20150602-pages-articles.xml.bz2
Title: Afhandling om svenska stafsättet/4
ID: 88768
SHA1: 7zyj208ur4vit0t41z7xlftlyl69bo7
Note: Last article in dump. Repeated twice

Example 5:
URL: http://dumps.wikimedia.org/eswiki/20150602/eswiki-20150602-pages-articles.xml.bz2
Title (1): Veguer
Title (2): Promo
Note: duplicates are earlier in the dump (Veguer at the 9% mark and
Promo at the 23% mark). There doesn't seem to be a dupe at the end of
the article.

Unaffected:
* http://dumps.wikimedia.org/svwiki/20150602/svwiki-20150602-pages-articles.xml.bz2
* http://dumps.wikimedia.org/svwiktionary/20150603/svwiktionary-20150603-pages-articles.xml.bz2
* http://dumps.wikimedia.org/svwikinews/20150602/svwikinews-20150602-pages-articles.xml.bz2

Many dumps have duplicate final entry:

wikiidsha1
betawikiversity-20150602-stub-meta-current.xml.gz29172qj736ayooxudbk4jwvlufhkrfxztexl
cawikisource-20150602-stub-meta-current.xml.gz381718k8ka9anu9gzrd8hvtvmlplc3en0iim
elwikisource-20150602-stub-meta-current.xml.gz22723c9tyqn20o99k5v2sdutakx1ykqi07lz
foundationwiki-20150602-stub-meta-current.xml.gz101982inhryb63iuh859qv6v44k55ygs8v4zv
glwikisource-20150602-stub-meta-current.xml.gz36629tk9uyd7e73pwbohjg2rrr6f22rf0eh
simplewiktionary-20150602-stub-meta-current.xml.gz37265ekmywczb6olyzg7it8k4twiz17z4pul
zuwiki-20150603-stub-meta-current.xml.gz6679roeo5vuvy4coe53v6aoxqd00i49e9vq
zuwiktionary-20150602-stub-meta-current.xml.gz35440lzqcxomsynenvxr53e17hk7l2jw2xk
This comment was removed by wpmirrordev.

The duplicate title at the end of a run looks to me like it's gone; that was fixed in https://gerrit.wikimedia.org/r/#/c/216416/
The duplicate in the middle is due to the same page being dumped at the end of one set of stubs and at the beginning of the next set. This is a regression; introduced in https://gerrit.wikimedia.org/r/#/c/215666/ because xmlstubs.py and the other xml*.py scripts dump up to and including the last page or item specified.