Author: elephantus_l
Description:
I downloaded two sequential Spanish wikipedia XML dump files
(eswiki-20090504-pages-articles.xml.bz2 and before that eswiki-20090421-pages-articles.xml.bz2). When I imported the file into wikitaxi it showed a strange error on a large number of pages: the titles and the content of the pages were mixed-up, that is, the title would be something and the text itself would obviously be from a different page (or it would be a combination of two pages). So I looked into the original XML file itself and this is what I found, for example:
<page>
<title>Gómez Plata</title>
<id>454035</id>
<revision>
<id>25156038</id>
<timestamp>2009-03-28T06:38:04Z</timestamp>
<contributor>
<username>SajoR</username>
<id>130444</id>
</contributor>
<minor />
<comment>leve mejora</comment>
<text xml:space="preserve">'''Montserrat Domínguez''' ([[Madrid]], [[1963]]) es una [[periodismo|periodista]] [[España|española]].Considera que la primera obligación de un periodista es ser crítico con el poder y es optimista respecto a la situación actual del periodismo. Su trabajo le ofrece, en su opinión, "un motor de vida".
Es aficionada a la [[lectura]] y a los viajes.
Biografía
Estudió [[Ciencias de la Información]] por la [[Universidad Complutense de Madrid]]. Posteriormente cursó un Master en Periodismo por la [[Universidad de Columbia]].
So the title of the page is Gómez Plata (a municipality in Colombia), but the page is about a Spanish journalist.
This didn't happen when I downloaded other wikipedia dumps (en, de, nl, sv). Could someone please look into this problem? Thank you.
Version: unspecified
Severity: normal