Page MenuHomePhabricator

May dump - MW Dumper exception
Closed, DeclinedPublic

Description

During the injection with the last dump May, we are getting an exception during the process:

Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException: 8192

at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:543)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1742)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1619)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1657)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1748)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2939)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:647)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
at org.mediawiki.dumper.Dumper.main(Dumper.java:142)

ERROR 1064 (42000) at line 147960: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘’{{Administrators\' noticeboard navbox all}}\n\n== JHunterJ ==\n\nI am very sad ' at line 1

So we are getting not all the pages uploaded on the mysql
select count(*) from page;
+----------+

count(*)

+----------+

14795000

+----------+
1 row in set (3.12 sec)
vs
17497776 number of titles from the dump
Does anyone see this issue?

Event Timeline

I've guessed at the page around which the error occurs, and put together a much smaller xml file of page content for around 100 pages around the problem area. https://people.wikimedia.org/~ariel/enwiki-pages-p45253660p45253760.xml.gz Can you check and see if you get the same error? Page referenced in the error seems to be https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard/IncidentArchive871 but it's hard to know if that was theissue of some page a little earlier.

We have tried the smaller dump https://people.wikimedia.org/~ariel/enwiki-pages-p45253660p45253760.xml.gz and it was succeeded:

We got: 27 pages (42.722/sec), 27 revs (42.722/sec)

We have tried this file: https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-pages-articles27.xml-p44163464p45663464.bz2

It failed again regarding another syntax issue:

ERROR 1064 (42000) at line 2755: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘’ut’ at line 1

The file with one insert

I see the last set of values in that file appears to be truncated:

(775820569,'{{Infobox mountain\n| name = Ch\'uñuna\n| photo = \n| photo_caption = \n| elevation_m = 5100\n| elevation_ref = <ref name=map>escale.minedu.gob.pe/ UGEL map of the Quispicanchi Province 1 (Cusco Region)]</ref>\n| prominence_m = \n| prominence_ref = \n| range = [[Andes]], [[Willkanuta mountain range|Willkanuta]]\n| listing = \n| location = [[Peru]], [[Cusco Region]], [[Quispicanchi Province]]\n| map = Peru\n| range_coordinates = \n| map_caption = Peru\n| map_size = 200\n| label_position = \n| coordinates = {{coord|13|33|39|S|71|08|54|W|type:mountain_region:PE_scale:100000|format=dms|display=inline,title}}\n| coordinates_ref = \n| topo = \n| type = \n| age = \n| first_ascent = \n| easiest_route = \n}}\n\'\'\'Ch\'uñuna\'\'\' ([[Quechua language|Quechua]] \'\'[[Chuño|ch\'uñu]]\'\' a freeze-dried potato, \'\'-na\'\' a [[suffix]],<ref>{{Ref Laime}}</ref> \"where \'\'ch\'uñu\'\' is made\", also spelled \'\'Chuñuna\'\') is mountain in the [[Willkanuta mountain range|Willkanuta]] mountain range in the [[Andes]] of [[Peru]], about {{convert|5100|m|ft|0}} high. It is located in the [[Cusco Region]], [[Quispicanchi Province]], on the border of the districts of [[Marcapata District|Marcapata]] and [[Ocongate District|Ocongate]]. Ch\'uñuna lies southwest of [[Anka Wachana (Quispicanchi)|Anka Wachana]] and southeast of  [[Qullqip\'unqu]] and [[Wilaquta (Cusco)|Wilaquta]].<ref name=map/>\n\n== References ==\n{{reflist}}\n\n{{DEFAULTSORT:Chununa}}\n[[Category:Mountains of Cusco Region]]\n[[Category:Mountains of Peru]]\n\n\n{{Cusco-geo-stub}}','ut

Can you verify that this happens every time at the same line number?

I've downloaded the file, checked that it's intact, run it through mwdumper, and verified that it completes, processes all pages, and writes out the above entry to completion and continues on.

The last update to mwdumper was March 7, so a couple months ago; in case you dobn't hve that latest version, can you try with it?

git clone https://gerrit.wikimedia.org/r/mediawiki/tools/mwdumper and mvn package to build, I've been running my copy via java -jar /mnt/hd/ariel/src/wmf/mediawiki/tools/mwdumper/mwdumper/target/mwdumper-1.25-jar-with-dependencies.jar --format=mysql:1.5 /home/ariel/wmf/dumps/mwdumper-issues/enwiki-20170501-pages-articles27.xml-p44163464p45663464.bz2 > sqlout.txt and it's working for me, as described earlier.

I cloned and run the mw dumper agains the file enwiki-20170501-pages-articles27.xml-p44163464p45663464.bz2 and it worked
I am going to run against this one: /enwiki-20170501-pages-articles.xml.bz2

I ran the mw dumper and it failed again for the full dump:

/usr/bin/java -server -classpath /data/servers/data_load/lib/commons-compress.jar:/data/servers/mwdumper/mwdumper-1.25.jar org.mediawiki.dumper.Dumper --format=sql:1.5 /data/servers/data_load/en/20170501/enwiki-20170501-pages-articles.xml.bz2 > wikimirror_en.sql^C
[root@wikimirror-article-en-m01 data_load]# /usr/bin/java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

4,820,000 pages (1,443.016/sec), 4,820,000 revs (1,443.016/sec)
Exception in thread "main" java.io.IOException: Invalid byte 2 of 4-byte UTF-8 sequence.
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:95)
at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
Caused by: org.xml.sax.SAXParseException; lineNumber: 316480135; columnNumber: 287; Invalid byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:91)
... 1 more
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
... 11 more

Would you be able to track down the area where the exception happens? If you could get it down even to 1000 pages that would be very helpful. A first easy step would be to download the enwiki-20170501-pages-articles<number>...bz2 files and see which of them fails.

Aklapper changed the task status from Open to Stalled.Jan 10 2018, 12:28 PM

Would you be able to track down the area where the exception happens?

@DianaArq: Could you answer that question, please?

Unfortunately closing this Phabricator task as no further information has been provided.

@DianaArq: After you have provided the information asked for and if this still happens, please set the status of this task back to "Open" via the Add Action...Change Status dropdown. Thanks!