Page MenuHomePhabricator

Error while importing xml dump
Closed, DeclinedPublicBUG REPORT


While importing a xml dump via mwdumper I got an "incorrect string value" error. I tried changing default charset to utf8mb4 but it seemed to break even more things.

Clipboard01.png (586×641 px, 1 MB)

Event Timeline

ArielGlenn triaged this task as Medium priority.Dec 22 2016, 7:43 AM
ArielGlenn added a project: Dumps-Generation.

Can you give the command you were running with the exact arguments, and also a file containing the XML entry for the page causing the problem (and the xml namespace/header lines and footer lines so that it's a complete if tiny file)?

I was importing latest's xml pages' dump, namely itwiki-20161201-pages-meta-current.xml.bz2 (MD5 checked). I think you can more efficently look for the relevant line in dump than me, otherwise I'll try. It seems to encounter a char needing a fourth byte to be encoded. I've subscribed Ori since he seems to have already dealt with a similar issue in another context.

C:\Users\USERNAME\Desktop>java -classpath C:\Users\USERNAME\Desktop\mariadb-java-client-1.5.4.jar;C:\Users\USERNAME\Desktop\mwdumper.jar org.mediawiki.dumper.Dumper --output="mysql://" --format=sql:1.5 itwiki-20161201-pages-meta-current.xml.bz2

show variables like 'char%';






You actually ran across as far as I see. The page in question would be then.

As a workaround, you might want to just manually remove that from the dump or change the page title.

I'm still puzzled, it seems to be a true 4byte char trying to be insert in a column whose encoding is 3byte.

I eventually managed to fix it: WMF's install actually uses VARBINARY(255) instead of VACHAR(255) (as for mediawiki's default) for page_title column. An ALTER table 'page' CHANGE COLUMN 'page_title' 'page_title' VARBINARY(255) NOT NULL COLLATE 'binary'; did the trick. I'll go on testing around and I'll eventually change's mwdumper guide.

Aklapper changed the subtype of this task from "Task" to "Bug Report".Feb 6 2022, 5:56 PM
hashar subscribed.

mwdumper is no more able to process dump generated since MediaWiki 1.31 (released in June 2018). The tool started in 2005 and is no more maintained, it is thus being archived, see T351228 for reference.