Page MenuHomePhabricator

mwdumper does not generates some page_id's
Closed, DeclinedPublicBUG REPORT

Description

Author: zantezuken

Description:
I have an XML-dump of ruwiki. To reduce amount of time required to import content I have converted XML into SQL script. After executing that SQL-script I noticed (via mediawiki) that some articles is missing, but 'text' table has all data regarding these articles. After some search I found the cause - 'page' table has none data about these article's name and id, mwdumper just didn't generate 'INSERT INTO page' command for some (appr. 40% of a dump) articles.

I used unofficial mwdumper build from bug 18328, because I have NO jar-based up-to-date builds except that one (the official one is from 2006 and is not compatible with new XML-dumps now).


Version: unspecified
Severity: major
OS: Windows Vista
Platform: PC

Details

Reference
bz21917

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:57 PM
bzimport set Reference to bz21917.

Which xml dump is it?
Can you provide some of those missing articles?

zantezuken wrote:

It's http://download.wikimedia.org/ruwiki/20091207/ruwiki-20091207-pages-articles.xml.bz2

Article "Операционная система" for example (line number 515495 in the dump).

zantezuken wrote:

Ok, here is the one of missing articles:
http:;/shinra.ru/kein/w/operating_system.xml (40Kb, UTF-8)

Dunno how can I help else :< Really annoying bug, make the whole dumps useless since I can't import things properly ;<

I do find the insert for Операционная система at page table:

$ bzcat ruwiki-20091207-pages-articles.xml.bz2|java -jar mwdumper.jar --format=sql:1.5 | grep -m 1 "'Операционная_система'"

INSERT INTO page (page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len) VALUES (3428,0,'Эмоция','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20389302,28622),(3432,0,'Человек_разумный','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20412105,62890), ...

... (4590,0,'1545_год','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,17287964,3074),
(4591,0,'Операционная_система','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20354505,39406),
(4593,0,'Рим','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20389427,116221),(4595,0,'Двоичные_приставки','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,19830413,15461)...

...(4904,0,'23_января','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20288736,13963),(4905,0,'24_января','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20288479,14120),(4906,0,'25_января','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20288290,29154),(4907,0,'26_января','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20313506,16559),(4908,0,'27_января','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20313334,14701),(4909,0,'28_января','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20420267,22861),(4910,0,'29_января','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20352730,19695);

Maybe mysql didn't accept the full line for some reason?

zantezuken wrote:

No, as I said I didn't get INSERT for that page at all.
Can you compile latest mwdumper for window, please? So, I can test.

I used the same mwdumper.jar as you. Jar files work cross platform. How were you looking for the insert?

zantezuken wrote:

I searched the whole dump for INSERT into page with 'Операционная система'. I found many articles which I already has in DB, but articles which is missing in my current DB missing in the generated SQL-dump as well.
Well, the only missing thing is INSERT into page, old_text, old_data and old_id is here.

Don't look for 'Операционная система', you must look for 'Операционная_система', it will be in db form, with spaces converted into underscores.
There will be three instances: [[Операционная система]] which is the insert line I included above, [[Category:Операционная система]] and [[Template:Операционная система]].
All of them "INSERT INTO page" lines, albeit really long lines.

zantezuken wrote:

Yeh, I found 'Операционная_система' in the SQL dump, but that's weird... the whole INSERT into page was skipped, I can't find any page_id's for all these articles in that INSERT script. Weird. Tho, I have [[Category:Операционная система]] and [[Template:Операционная система]] ;<
Annoying.
Well, anyway, bug is INVALID, sorry for the false report.

zantezuken wrote:

Or, perhaps, it is valid since mwdumper does not generate correct SQL dump. Too many duplicate entries but why? The DB is empty. Looks like mwdumper does something wrong.

zantezuken wrote:

Here is the [http://shinra.ru/kein/out.7z full log].
As you can see all errors related to 'rev_comment' only, so, if generated SQL
is correct there shoul not by any issue with missing articles. But there is ;<

Which option did you select for the database? utf-8, binary or backwards-compatible with mysql4?

Works for my with ruwiki-20100331-pages-articles.xml.

Have the tables page, revision and text all the same number of rows? (1478943)

Maybe that is a encoding problem, try to append
&characterEncoding=utf8
to the --output parameter

brion set Security to None.
Aklapper lowered the priority of this task from Medium to Low.Apr 23 2016, 9:10 AM
Aklapper changed the subtype of this task from "Task" to "Bug Report".Feb 6 2022, 5:56 PM
hashar subscribed.

mwdumper is no more able to process dump generated since MediaWiki 1.31 (released in June 2018). The tool started in 2005 and is no more maintained, it is thus being archived, see T351228 for reference.