Author: fleming
Description:
Should maintenance/postgres/mediawiki_mysql2postgres.pl use "--compatible=postgres" for mysqldump?
I was migrating from a MySQL to PostgreSQL. My MySQL (stock Debian install) database contains UTF-8-encoded text, although every column of every table I checked was set to use "Character Set" "cp1252"; I don't know why, and I don't know what that setting affects, but anyway my MediaWiki installation had no problem pulling the UTF-8 data from MySQL and properly presenting it to the web browser.
mysqldump, using the options as specified in mediawiki_mysql2postgres.pl, seems to take the UTF-8-encoded data, treat it as latin1 (or cp1252?), and re-encode that text as UTF-8. So when PostgreSQL slurps that file up (into a UTF8-mode database), it decodes the UTF-8 once, and stores the resulting UTF-8 byte stream, kind of literally in the database. The result is that the web page incorrectly displays the UTF-8 byte stream sort of literally.
Sorry, it's hard to describe this comprehensibly yet concisely. Basically, the UTF-8 text is needlessly getting UTF-8-encoded by mysqldump, and the --compatible=postgres option stops it from doing that. It might have to do with my original MySQL database thinking its columns were cp1252, but anyway it was MediaWiki that originally created that schema (ca. Mar. 2004), so I think I'm unlikely to be the only one to have this problem. I'm guessing it wouldn't affect wikis that have only ASCII text in their wikis.
Sorry if this is a duplicate bug. I was not able to find any existing bugs about mediawiki_mysql2postgres.pl.
FWIW, mediawiki_mysql2postgres.pl in MW 1.11.0 still does not use "--compatible=postgres".
Version: 1.10.x
Severity: normal