Page MenuHomePhabricator

mwdumper crashes on non-latin input characters
Closed, DeclinedPublicBUG REPORT

Description

Author: jymj2002

Description:
I downloaded the latest version of the spanish articles in 'xml' and the latest version of mwdumper (2008-04-13):

eswiki-20080507-pages-articles.xml.bz2

and I followed the instructions to load it in a mysql database. The exact line I type is:

java -client -classpath mwdumper.jar;mysql-connector-java-3.1.12-bin.jar org.mediawiki.dumper.Dumper "--output=mysql://127.0.0.1/wikidb?user=<user>&password=<password>" "--format=sql:1.5" "C:\eswiki-20080507-pages-articles.xml.bz2"

(where <user> and <password> are correctly especified).

Everything seems to work ok, the output I get is:

1.000 pages (249,004/sec), 1.000 revs (249,004/sec)

and similar lines starting with 2.000, 3.000... till it reaches the line starting with 17.000. At this point I get the following message:

17.000 pages (366,08/sec), 17.000 revs (366,08/sec)
Exception in thread "main" java.io.IOException: java.sql.SQLException: Duplicate entry '0-?' for key 2

(and then the typical exception stack trace).

I think maybe it could be something with the encoding of spanish accents (á, é....) or special characters such as 'ñ', so I tried creating the database with other charsets but I get the same error.


See Also: T11279: mwdumper direct MySQL connection needs to distinguish UTF-8 and compat schemas
Version: unspecified
Severity: normal
OS: Windows XP
Platform: PC

Details

Reference
bz14379

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:13 PM
bzimport set Reference to bz14379.

Please provide the stack trace.

Offhand it's likely an encoding issue; a possibly-default "Latin-1" schema will cause failure with this direct connection as the data will be converted from UTF-8 and titles will start to conflict when non-Latin-1 chars come in. A "UTF-8" schema may similarly cause failures when a title with a non-BMP character in it comes along, as MySQL's UTF-8 charset support is incomplete.

If using the binary schema, things _should_ work.

jymj2002 wrote:

Thank you very much for your help but it still doesn´t work...

The stack trace is:

17.000 pages (87,04/sec), 17.000 revs (87,04/sec)
Exception in thread "main" java.io.IOException: java.sql.SQLException: Duplicate entry '0-?' for key 2
at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
at org.mediawiki.dumper.Dumper.main(Unknown Source)
Caused by: org.xml.sax.SAXException: java.sql.SQLException: Duplicate entry '0-?' for key 2
at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.sml.parsers.SAXParser.parse(Unknown Source)
... 2 more
Caused by: java.io.IOException: java.sql.SQLException: Duplicate entry '0-?' for key 2
at org.mediawiki.importer.SqlServerStream.writeStatement(Unknown Source)
at org.mediawiki.importer.SqlWriter.flushInsertBuffers(Unknown Source)
at org.mediawiki.importer.SqlWriter.checkpoint(Unknown Source)
at org.mediawiki.importer.SqlWriter15.updatePage(Unknown Source)
at org.mediawiki.importer.SqlWriter15.writeEndPage(Unknown Source)
at org.mediawiki.importer.MultiWriter.writeEndPage(Unknown Source)
at org.mediawiki.importer.PageFilter.writeEndPage(Unknown Source)
at org.mediawiki.importer.XmlDumpreader.closePage(Unknown Source)
... 14 more

Sorry for my inexperience but, Brion, What do you mean with a "binary schema"?? I have 4 parameters wich could be "binary":

  • MySQL connection collation (could be set from phpMyAdmin)
  • Database collation (set while creating the 'wikidb' database)
  • MySQLCharSet (is set to UTF-8 Unicode but I can´t change it from phpMyAdmin. Should I change it? How can I change it?)
  • Database Character Set (I can set it in the MediaWiki configuration page with options: # Backwards-compatible UTF-8,
    1. Experimental MySQL 4.1/5.0 UTF-8 or # Experimental MySQL 4.1/5.0 binary)

I tried many configurations of this parameters but the problem persists. Could you help me, please?

Thank you very much.

  • Bug 14958 has been marked as a duplicate of this bug. ***

Try to set the default-character-set in the my.ini or my.cnf (mysql\bin) of mysql to

default-character-set="utf8"

and restart the server.

(In reply to comment #5)

Try to set the default-character-set in the my.ini or my.cnf (mysql\bin) of
mysql to
default-character-set="utf8"
and restart the server.

You can also append

&characterEncoding=UTF-8

to the --output parameter

I've hit a similar encoding bug while importing enwiki. I was piping to sql using this cmdline:

bunzip2 -c enwiki-20120104-pages-articles.xml.bz2 | mwdumper --format=sql:1.5 > out.sql
Exception in thread "main" java.io.IOException: not a name start character: "U+26"
   at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper)
   at org.mediawiki.dumper.Dumper.main(mwdumper)
Caused by: org.xml.sax.SAXParseException: not a name start character: "U+26"
   at gnu.xml.stream.SAXParser.parse(libgcj.so.10)
   at javax.xml.parsers.SAXParser.parse(libgcj.so.10)
   at javax.xml.parsers.SAXParser.parse(libgcj.so.10)
   at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper)
   ...1 more
Caused by: javax.xml.stream.XMLStreamException: not a name start character: "U+26"
   at gnu.xml.stream.XMLParser.error(libgcj.so.10)
   at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.10)
   at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.10)
   at gnu.xml.stream.XMLParser.readCharData(libgcj.so.10)
   at gnu.xml.stream.XMLParser.next(libgcj.so.10)
   at gnu.xml.stream.SAXParser.parse(libgcj.so.10)
   ...4 more

EDIT: This was definitely unrelated. It looks like the issue was caused by attempting to read a corrupted XML dump.

brooke set Security to None.

Reading brion's comment again, I think we can work around the default encoding, either by explicitly declaring the encoding in SQL, or enabling the --output utf8 &characterEncoding switch as the default mode.

I haven't been able to reproduce the issue yet. The error is caused by the unique key on page.(page_namespace, page_title) colliding when page_title is collated differently in the source and destination databases.

My SQL-fu is probably not up to the task... I thought it would be enough to define, for example, pages entitled 'Bar' and 'Bär', then damage the target database with,

alter table page convert to character set latin1 collate latin1_german1_ci;

however, the varbinary column type seems to prevent the duplicate key collision.

@awight: This issue has been assigned to you two years ago.
Could you please share a status update? Are you still working (or still plan to work) on this issue? Is there anything that others could help with?
Only in case you do not plan to work on this issue anymore, should you be removed as assignee (via 'Assign / Claim' in the 'Actions' dropdown menu)?

@awight: I am resetting the assignee of this task because there have been no signs of progress lately (please correct me if I'm wrong).
Resetting the assignee avoids the impression that somebody is already working on fixing this task and it also allows anybody else to potentially work towards fixing this task.
Please claim this task again when you plan to fix this task (via 'Assign / Claim' in the 'Actions' dropdown menu) - it would be very welcome!
Thanks for your understanding!

Aklapper changed the subtype of this task from "Task" to "Bug Report".Feb 6 2022, 5:56 PM
hashar subscribed.

mwdumper is no more able to process dump generated since MediaWiki 1.31 (released in June 2018). The tool started in 2005 and is no more maintained, it is thus being archived, see T351228 for reference.