Page MenuHomePhabricator

PHP Warning: XMLReader::read():
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

MediaWiki version: 1.36.0-wmf.34

message
PHP Warning: XMLReader::read():

or

message2
PHP Warning: XMLReader::read(): uploadsource://fe5d561d543c83217faaa6401ab0bea6:1: parser error : Extra content at the end of the document

Impact

Log spam.

Notes

About 1000 of these in the last 2 hours. All against trv.wikipedia.org. Looks like an automated process.

Details

Request ID
YFNtTdItSiQp5cp5Nq5JQgAAABM
Request URL
https://trv.wikipedia.org/w/api.php
Stack Trace
exception.trace
from /srv/mediawiki/php-1.36.0-wmf.34/includes/import/WikiImporter.php(615)
#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/php-1.36.0-wmf.34/includes/import/WikiImporter.php(615): XMLReader->read()
#2 /srv/mediawiki/php-1.36.0-wmf.34/includes/api/ApiImport.php(97): WikiImporter->doImport()
#3 /srv/mediawiki/php-1.36.0-wmf.34/includes/api/ApiMain.php(1612): ApiImport->execute()
#4 /srv/mediawiki/php-1.36.0-wmf.34/includes/api/ApiMain.php(591): ApiMain->executeAction()
#5 /srv/mediawiki/php-1.36.0-wmf.34/includes/api/ApiMain.php(562): ApiMain->executeActionWithErrorHandling()
#6 /srv/mediawiki/php-1.36.0-wmf.34/api.php(90): ApiMain->execute()
#7 /srv/mediawiki/php-1.36.0-wmf.34/api.php(45): wfApiMain()
#8 /srv/mediawiki/w/api.php(3): require(string)
#9 {main}
Related Changes in Gerrit:

Event Timeline

Urbanecm subscribed.

Interesting.... I doubt this has to do anything with Tool-wiki-importer, as that's a Toolforge tool that makes use of MW importing API. The tool is used to import a new Wikipedia from incubator.wikimedia.org to xx.wikipedia.org. Creating new projects happens only ocasionally, so I wouldn't rely on timing to decide when this started to happen. Moving to Radar on tool's workboard and tagging MediaWiki-Core-Snapshots as the issue is in MW core.

The first request in the sequence for trvwiki is reqId:YFNWt-vAsYz6D7qdpaX-rgAAAAA with the rather odd message

PHP Warning: XMLReader::read(): <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://w

I presume it's truncated but clearly it didn't like something in the xml. It might be nice to start there somehow.
Stack trace for completeness:

from /srv/mediawiki/php-1.36.0-wmf.34/includes/import/WikiImporter.php(587)
#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/php-1.36.0-wmf.34/includes/import/WikiImporter.php(587): XMLReader->read()
#2 /srv/mediawiki/php-1.36.0-wmf.34/includes/import/WikiImporter.php(884): WikiImporter->nodeContents()
#3 /srv/mediawiki/php-1.36.0-wmf.34/includes/import/WikiImporter.php(832): WikiImporter->handleRevision(array)
#4 /srv/mediawiki/php-1.36.0-wmf.34/includes/import/WikiImporter.php(649): WikiImporter->handlePage()
#5 /srv/mediawiki/php-1.36.0-wmf.34/includes/api/ApiImport.php(97): WikiImporter->doImport()
#6 /srv/mediawiki/php-1.36.0-wmf.34/includes/api/ApiMain.php(1612): ApiImport->execute()
#7 /srv/mediawiki/php-1.36.0-wmf.34/includes/api/ApiMain.php(591): ApiMain->executeAction()
#8 /srv/mediawiki/php-1.36.0-wmf.34/includes/api/ApiMain.php(562): ApiMain->executeActionWithErrorHandling()
#9 /srv/mediawiki/php-1.36.0-wmf.34/api.php(90): ApiMain->execute()
#10 /srv/mediawiki/php-1.36.0-wmf.34/api.php(45): wfApiMain()
#11 /srv/mediawiki/w/api.php(3): require(string)
#12 {main}

Interesting.... I doubt this has to do anything with Tool-wiki-importer, as that's a Toolforge tool that makes use of MW importing API.

But where does it get its XML from? XML is usually not user input, so it seems to me that the log entry is likely correct, and the problem is with whatever is generating the XML, not with import.

Interesting.... I doubt this has to do anything with Tool-wiki-importer, as that's a Toolforge tool that makes use of MW importing API.

But where does it get its XML from? XML is usually not user input, so it seems to me that the log entry is likely correct, and the problem is with whatever is generating the XML, not with import.

From MW's export feature. The tool takes pages from incubator, cleans the XML (applies few regexes to remove prefixes etc) and imports it. The logic is in https://github.com/wikimedia/labs-tools-wiki-importer/blob/master/src/app.py#L202.

Processed XMLs that were imported can be found under /data/project/wiki-importer/data in toolforge (not yet automatically deleted for pretty much similar reasons like this task).

@jhsoby developed the XML cleanup logic, I developed the rest of the tool.

That's not a safe way to process XML.

tstarling@tools-sgebastion-07:/data/project/wiki-importer/data/trvwiki$ for f in *.xml; do
> xmllint --noout "$f"
> done
02aed0c4a5971800bc1f5978a11777f6.xml:3512: parser error : Extra content at the end of the document
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://w
^
037ea4c6058871c008d91ea755681db7.xml:223: parser error : Extra content at the end of the document
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://w
^
0907ef57082964dc6158b0531bee6f4c.xml:178: parser error : Extra content at the end of the document
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://w
^
0a7a655c265cc8f84cae7a69aac80af6.xml:147: parser error : Extra content at the end of the document
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://w
^
...

You should use SAX or expat. With a streaming XML parser it should be possible to modify text nodes and reserialize without storing a DOM tree in memory.

Sorry, made the same mistake. I'm not sure how the jump was made from the prod error to this Toolforge tool, but I'm guessing it was in the referrer or user-agent string. I agree the input seems invalid and that should be fixed.

However, I'm not adding it back. Looking closely here, this remains a production error. If the data we pass to XMLReader is expected to sometimes be invalid, we should silence this production warning. That's what this task is about.

Is there a good way to validate the xml on processing or before processing?

Api's action=import and Special:Import both catch exception from the import process and reports a failure to the user, but this is a php warning, still logged.

LIBXML_NOERROR | LIBXML_NOWARNING exists, but not used here. Maybe a way to stop the logspam?

Sorry, made the same mistake. I'm not sure how the jump was made from the prod error to this Toolforge tool, but I'm guessing it was in the referrer or user-agent string. I agree the input seems invalid and that should be fixed.

The import log on trvwiki only has @jhsoby's import with WikiImporter 1.2, then @Urbanecm's import with WikiImporter 1.3. So it must have been that tool. The source of the tool shows that it is still broken.

However, I'm not adding it back. Looking closely here, this remains a production error. If the data we pass to XMLReader is expected to sometimes be invalid, we should silence this production warning. That's what this task is about.

We shouldn't just throw away the error and accept invalid XML. An error should be shown to the user. I think it's reasonable to also log errors, especially in the interwiki case, since if Special:Export generates invalid XML, we presumably want to know about it. But maybe there should be a special channel.

Is there a good way to validate the xml on processing or before processing?

We could validate the file contents and then rewind the stream. Even in the HTTP case, a temporary file is used and so the stream is seekable. Just ingest with XMLReader while capturing errors. There is XMLReader::isValid().

Is there a good way to validate the xml on processing or before processing?

We could validate the file contents and then rewind the stream. Even in the HTTP case, a temporary file is used and so the stream is seekable. Just ingest with XMLReader while capturing errors. There is XMLReader::isValid().

That sounds like a way, but would count against the php timeout as well, for huge documents that could be problematic. Also there exists unclosed documents from some other bugs: T31961, not sure if such documents are processed correctly after validation, but it seems when import fails not all revisions are rollbacked.

isValid() checks for validity in the XML sense: it validates against the DTD, which we don't really need. We only really need to check for well-formedness.

That sounds like a way, but would count against the php timeout as well, for huge documents that could be problematic.

A quick benchmark shows that my laptop can validate about 200MB of MediaWiki XML per second.

Also there exists unclosed documents from some other bugs: T31961, not sure if such documents are processed correctly after validation, but it seems when import fails not all revisions are rollbacked.

I think it would be fine to reject those documents. It's not possible keep a transaction open during an import of many revisions because the long lock times would risk bringing the site down.

Change 881720 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] import: Add a syntax check for xml imports

https://gerrit.wikimedia.org/r/881720

Change 881720 merged by jenkins-bot:

[mediawiki/core@master] import: Add a syntax check for xml imports

https://gerrit.wikimedia.org/r/881720