Page MenuHomePhabricator

GWToolset can not determine the file extension from the file URL
Closed, ResolvedPublic

Description

Hello, after to have asked that a domain be whitelisted (T250646), I have prepared a XML file (near 1800 records), after completing all the steps, at the time to "Preview batch" with the tool GWToolset, I got this kind of message:
"The file extension could not be determined from the file URL: https://media.api.aucklandmuseum.com/id/media/v/614581"
So I tried to upload manualy one of the file from my XML file, and I was successful, this is https://commons.wikimedia.org/wiki/File:Tucetona_laticostata_(Quoy_and_Gairmard,_1835)_(AM_MA98708-1).jpg
If you want to reproduce the issue, one exemple of a record of my XML file is:

<?xml version="1.0" encoding="UTF-8"?>

<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">

<record>	<taxon>	Tucetona laticostata	</taxon>	<authority>	(Quoy &amp; Gairmard, 1835)	</authority>	<description>	Tucetona laticostata	(Quoy &amp; Gairmard, 1835)	</description>	<identifier>	MA98708	</identifier>	<institution>	Auckland War Memorial Museum	</institution>	<license>	http://creativecommons.org/licenses/by/4.0/	</license>	<source>	https://www.aucklandmuseum.com/collections-research/collections/record/am_naturalsciences-object-220360	</source>	<title>	Tucetona laticostata (Quoy and Gairmard, 1835) (AM MA98708-2)	</title>	<url>	https://media.api.aucklandmuseum.com/id/media/v/614581	</url>	</record>


</metadata>

And the following metadata mappings can be used:
https://commons.wikimedia.org/wiki/GWToolset:Metadata_Mappings/Christian_Ferrer/Invertebrate_Zoology_Yale.json

Event Timeline

Reedy subscribed.
	/**
	 * attempts to get the file extension of a media file url using the
	 * $options provided. it will first look for a valid file extension in the
	 * url; if none is found it will fallback to an appropriate file extention
	 * based on the content-type
	 *
	 * @param array $options
	 *   ['url'] final url to the media file
	 *   ['content-type'] content-type of that final url
	 *
	 * @throws GWTException
	 * @return null|string
	 */
	protected function getFileExtension( array $options ) {
		global $wgFileExtensions;
		$result = null;

		if ( empty( $options['url'] ) ) {
			throw new GWTException(
				[
					'gwtoolset-mapping-media-file-url-bad' =>
					[ $options['url'], '' ]
				]
			);
		}

		if ( empty( $options['content-type'] ) ) {
			throw new GWTException(
				[
					'gwtoolset-mapping-media-file-no-content-type' =>
					[ $options['url'] ]
				]
			);
		}

		$pathinfo = pathinfo( $options['url'] );
		$mimeAnalyzer = \MediaWiki\MediaWikiServices::getInstance()->getMimeAnalyzer();

		if ( !empty( $pathinfo['extension'] )
			&& in_array( $pathinfo['extension'], $wgFileExtensions )
			&& strpos( $mimeAnalyzer->getTypesForExtension( $pathinfo['extension'] ),
					$options['content-type']
				) !== false
		) {
			$result = $pathinfo['extension'];
		} elseif ( !empty( $options['content-type'] ) ) {
			$result = explode( ' ', $mimeAnalyzer->getExtensionsForType( $options['content-type'] ) );

			if ( !empty( $result ) ) {
				$result = $result[0];
			}
		}

		return $result;
	}

It's doing it from the URL or the content-type... Obviously, for that URL there is no extension

% curl -I https://media.api.aucklandmuseum.com/id/media/v/614581  
HTTP/1.1 200 OK
Access-Control-Allow-Headers: Content-Type, X-Requested-With, Accept, Origin, Access-Control-Request-Method, Access-Control-Request-Headers
Access-Control-Allow-Methods: GET, POST
Access-Control-Allow-Origin: *
Cache-Control: max-age=3600, public
Content-Length: 21530270
Content-Type: image/jpeg;charset=utf-8
Date: Fri, 15 May 2020 19:04:58 GMT
Last-Modified: Fri, 10 Aug 2018 01:47:04 GMT
Server: Apache-Coyote/1.1
Connection: keep-alive

But the Content-Type is there; but it seems the charset=utf-8 upsets it:

reedy@deploy1001:~$ mwscript eval.php enwiki
> $mimeAnalyzer = \MediaWiki\MediaWikiServices::getInstance()->getMimeAnalyzer();

> var_dump( $mimeAnalyzer->getExtensionsForType( 'image/jpeg;charset=utf-8' ) );
NULL

> var_dump( $mimeAnalyzer->getExtensionsForType( 'image/jpeg' ) );
string(25) "jpeg jpg jpe jpeg jpg jpe"

>

getExtensionsForType is very simple... So it doesn't match, it won't find it. Not sure offhand if this is a bug in getExtensionsForType, or whether the wrong function is being called...

	public function getExtensionsForType( $mime ) {
		$mime = strtolower( $mime );

		// Check the mime-to-ext map
		if ( isset( $this->mimeToExt[$mime] ) ) {
			return $this->mimeToExt[$mime];
		}

		// Resolve the MIME type to the canonical type
		if ( isset( $this->mimeTypeAliases[$mime] ) ) {
			$mime = $this->mimeTypeAliases[$mime];
			if ( isset( $this->mimeToExt[$mime] ) ) {
				return $this->mimeToExt[$mime];
			}
		}

		return null;
	}

I do see in detectMimeType (called via guessMimeType)

			$m = preg_replace( '![;, ].*$!', '', $m ); # strip charset, etc

But that seems to rely on the file existing on disk...

Change 596736 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/extensions/GWToolset@master] Remove charset from Content-Type

https://gerrit.wikimedia.org/r/596736

Change 596736 merged by jenkins-bot:
[mediawiki/extensions/GWToolset@master] Strip charset from Content-Type before parsing as mime type

https://gerrit.wikimedia.org/r/596736

Reedy claimed this task.