Page MenuHomePhabricator

Error while importing xml dump
Closed, DeclinedPublicBUG REPORT

Description

While importing a xml dump via mwdumper I got an "incorrect string value" error. I tried changing default charset to utf8mb4 but it seemed to break even more things.

Clipboard01.png (586×641 px, 1 MB)

Event Timeline

ArielGlenn triaged this task as Medium priority.Dec 22 2016, 7:43 AM
ArielGlenn added a project: Dumps-Generation.

Can you give the command you were running with the exact arguments, and also a file containing the XML entry for the page causing the problem (and the xml namespace/header lines and footer lines so that it's a complete if tiny file)?

I was importing latest it.wiki's xml pages' dump, namely itwiki-20161201-pages-meta-current.xml.bz2 (MD5 checked). I think you can more efficently look for the relevant line in dump than me, otherwise I'll try. It seems to encounter a https://chars.suikawiki.org/char/1050A char needing a fourth byte to be encoded. I've subscribed Ori since he seems to have already dealt with a similar issue in another context.

C:\Users\USERNAME\Desktop>java -classpath C:\Users\USERNAME\Desktop\mariadb-java-client-1.5.4.jar;C:\Users\USERNAME\Desktop\mwdumper.jar org.mediawiki.dumper.Dumper --output="mysql://127.0.0.1/DBNAME?user=USERNAME&password=PASSWORD" --format=sql:1.5 itwiki-20161201-pages-meta-current.xml.bz2

show variables like 'char%';

+--------------------------+---------------------------------+

Variable_nameValue

+--------------------------+---------------------------------+

character_set_clientutf8
character_set_connectionutf8
character_set_databaseutf8
character_set_filesystembinary
character_set_resultsutf8
character_set_serverutf8
character_set_systemutf8

[...]

You actually ran across https://chars.suikawiki.org/char/1D50A as far as I see. The page in question would be https://it.wikipedia.org/wiki/Discussioni_utente:%F0%9D%94%8A then.

As a workaround, you might want to just manually remove that from the dump or change the page title.

I'm still puzzled, it seems to be a true 4byte char trying to be insert in a column whose encoding is 3byte.

I eventually managed to fix it: WMF's install actually uses VARBINARY(255) instead of VACHAR(255) (as for mediawiki's default) for page_title column. An ALTER table 'page' CHANGE COLUMN 'page_title' 'page_title' VARBINARY(255) NOT NULL COLLATE 'binary'; did the trick. I'll go on testing around and I'll eventually change mediawiki.org's mwdumper guide.

Aklapper changed the subtype of this task from "Task" to "Bug Report".Feb 6 2022, 5:56 PM
hashar subscribed.

mwdumper is no more able to process dump generated since MediaWiki 1.31 (released in June 2018). The tool started in 2005 and is no more maintained, it is thus being archived, see T351228 for reference.