Page MenuHomePhabricator

Error while importing xml dump
Open, NormalPublic


While importing a xml dump via mwdumper I got an "incorrect string value" error. I tried changing default charset to utf8mb4 but it seemed to break even more things.

Event Timeline

Vituzzu created this task.Dec 21 2016, 11:48 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 21 2016, 11:48 PM
ArielGlenn triaged this task as Normal priority.Dec 22 2016, 7:43 AM
ArielGlenn added a project: Dumps-Generation.

Can you give the command you were running with the exact arguments, and also a file containing the XML entry for the page causing the problem (and the xml namespace/header lines and footer lines so that it's a complete if tiny file)?

Vituzzu added a comment.EditedDec 22 2016, 10:25 AM

I was importing latest's xml pages' dump, namely itwiki-20161201-pages-meta-current.xml.bz2 (MD5 checked). I think you can more efficently look for the relevant line in dump than me, otherwise I'll try. It seems to encounter a char needing a fourth byte to be encoded. I've subscribed Ori since he seems to have already dealt with a similar issue in another context.

C:\Users\USERNAME\Desktop>java -classpath C:\Users\USERNAME\Desktop\mariadb-java-client-1.5.4.jar;C:\Users\USERNAME\Desktop\mwdumper.jar org.mediawiki.dumper.Dumper --output="mysql://" --format=sql:1.5 itwiki-20161201-pages-meta-current.xml.bz2

show variables like 'char%';






hoo added a subscriber: hoo.Dec 22 2016, 10:46 AM

You actually ran across as far as I see. The page in question would be then.

As a workaround, you might want to just manually remove that from the dump or change the page title.

I'm still puzzled, it seems to be a true 4byte char trying to be insert in a column whose encoding is 3byte.

I eventually managed to fix it: WMF's install actually uses VARBINARY(255) instead of VACHAR(255) (as for mediawiki's default) for page_title column. An ALTER table 'page' CHANGE COLUMN 'page_title' 'page_title' VARBINARY(255) NOT NULL COLLATE 'binary'; did the trick. I'll go on testing around and I'll eventually change's mwdumper guide.