Author: elephantus_l
Description:
I downloaded two sequential Spanish wikipedia XML dump files
(eswiki-20090504-pages-articles.xml.bz2 and before that eswiki-20090421-pages-articles.xml.bz2). When I imported the file into wikitaxi it showed a strange error on a large number of pages: the titles and the content of the pages were mixed-up, that is, the title would be something and the text itself would obviously be from a different page (or it would be a combination of two pages). So I looked into the original XML file itself and this is what I found, for example:
<page> <title>Gómez Plata</title> <id>454035</id> <revision> <id>25156038</id> <timestamp>2009-03-28T06:38:04Z</timestamp> <contributor> <username>SajoR</username> <id>130444</id> </contributor> <minor /> <comment>leve mejora</comment> <text xml:space="preserve">'''Montserrat Domínguez''' ([[Madrid]], [[1963]]) es una [[periodismo|periodista]] [[España|española]].
Considera que la primera obligación de un periodista es ser crítico con el poder y es optimista respecto a la situación actual del periodismo. Su trabajo le ofrece, en su opinión, "un motor de vida".
Es aficionada a la [[lectura]] y a los viajes.
Biografía
Estudió [[Ciencias de la Información]] por la [[Universidad Complutense de Madrid]]. Posteriormente cursó un Master en Periodismo por la [[Universidad de Columbia]].
So the title of the page is Gómez Plata (a municipality in Colombia), but the page is about a Spanish journalist.
This didn't happen when I downloaded other wikipedia dumps (en, de, nl, sv). Could someone please look into this problem? Thank you.
Version: unspecified
Severity: normal