Page MenuHomePhabricator

Error while importing xml dump
Open, NormalPublic

Description

While importing a xml dump via mwdumper I got an "incorrect string value" error. I tried changing default charset to utf8mb4 but it seemed to break even more things.

Event Timeline

Vituzzu created this task.Dec 21 2016, 11:48 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 21 2016, 11:48 PM
ArielGlenn triaged this task as Normal priority.Dec 22 2016, 7:43 AM
ArielGlenn added a project: Dumps-Generation.

Can you give the command you were running with the exact arguments, and also a file containing the XML entry for the page causing the problem (and the xml namespace/header lines and footer lines so that it's a complete if tiny file)?

Vituzzu added a comment.EditedDec 22 2016, 10:25 AM

I was importing latest it.wiki's xml pages' dump, namely itwiki-20161201-pages-meta-current.xml.bz2 (MD5 checked). I think you can more efficently look for the relevant line in dump than me, otherwise I'll try. It seems to encounter a https://chars.suikawiki.org/char/1050A char needing a fourth byte to be encoded. I've subscribed Ori since he seems to have already dealt with a similar issue in another context.

C:\Users\USERNAME\Desktop>java -classpath C:\Users\USERNAME\Desktop\mariadb-java-client-1.5.4.jar;C:\Users\USERNAME\Desktop\mwdumper.jar org.mediawiki.dumper.Dumper --output="mysql://127.0.0.1/DBNAME?user=USERNAME&password=PASSWORD" --format=sql:1.5 itwiki-20161201-pages-meta-current.xml.bz2

show variables like 'char%';

+--------------------------+---------------------------------+

Variable_nameValue

+--------------------------+---------------------------------+

character_set_clientutf8
character_set_connectionutf8
character_set_databaseutf8
character_set_filesystembinary
character_set_resultsutf8
character_set_serverutf8
character_set_systemutf8

[...]

hoo added a subscriber: hoo.Dec 22 2016, 10:46 AM

You actually ran across https://chars.suikawiki.org/char/1D50A as far as I see. The page in question would be https://it.wikipedia.org/wiki/Discussioni_utente:%F0%9D%94%8A then.

As a workaround, you might want to just manually remove that from the dump or change the page title.

I'm still puzzled, it seems to be a true 4byte char trying to be insert in a column whose encoding is 3byte.

I eventually managed to fix it: WMF's install actually uses VARBINARY(255) instead of VACHAR(255) (as for mediawiki's default) for page_title column. An ALTER table 'page' CHANGE COLUMN 'page_title' 'page_title' VARBINARY(255) NOT NULL COLLATE 'binary'; did the trick. I'll go on testing around and I'll eventually change mediawiki.org's mwdumper guide.