GWToolset can not determine the file extension from the file URL
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ChristianFerrer
	May 15 2020, 7:02 PM

Description

Hello, after to have asked that a domain be whitelisted (T250646), I have prepared a XML file (near 1800 records), after completing all the steps, at the time to "Preview batch" with the tool GWToolset, I got this kind of message:
"The file extension could not be determined from the file URL: https://media.api.aucklandmuseum.com/id/media/v/614581"
So I tried to upload manualy one of the file from my XML file, and I was successful, this is https://commons.wikimedia.org/wiki/File:Tucetona_laticostata_(Quoy_and_Gairmard,_1835)_(AM_MA98708-1).jpg
If you want to reproduce the issue, one exemple of a record of my XML file is:

<?xml version="1.0" encoding="UTF-8"?>

<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">

<record>	<taxon>	Tucetona laticostata	</taxon>	<authority>	(Quoy &amp; Gairmard, 1835)	</authority>	<description>	Tucetona laticostata	(Quoy &amp; Gairmard, 1835)	</description>	<identifier>	MA98708	</identifier>	<institution>	Auckland War Memorial Museum	</institution>	<license>	http://creativecommons.org/licenses/by/4.0/	</license>	<source>	https://www.aucklandmuseum.com/collections-research/collections/record/am_naturalsciences-object-220360	</source>	<title>	Tucetona laticostata (Quoy and Gairmard, 1835) (AM MA98708-2)	</title>	<url>	https://media.api.aucklandmuseum.com/id/media/v/614581	</url>	</record>


</metadata>

And the following metadata mappings can be used:
https://commons.wikimedia.org/wiki/GWToolset:Metadata_Mappings/Christian_Ferrer/Invertebrate_Zoology_Yale.json

Details

	Subject	Repo	Branch	Lines +/-
	Strip charset from Content-Type before parsing as mime type	mediawiki/extensions/GWToolset	master	+4 -1

Customize query in gerrit

Related Objects

Mentioned In: T252929: MimeAnalyser: Support mapping Content-Type to mime type and/or file extension
Mentioned Here: T250646: Add media.api.aucklandmuseum.com to $wgCopyUploadsDomains

Event Timeline

ChristianFerrer created this task.May 15 2020, 7:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 15 2020, 7:02 PM

Reedy added a project: MediaWiki-extensions-GWToolset.May 15 2020, 7:03 PM

Reedy updated the task description. (Show Details)

Restricted Application added a project: Commons. · View Herald TranscriptMay 15 2020, 7:03 PM

	/**
	 * attempts to get the file extension of a media file url using the
	 * $options provided. it will first look for a valid file extension in the
	 * url; if none is found it will fallback to an appropriate file extention
	 * based on the content-type
	 *
	 * @param array $options
	 *   ['url'] final url to the media file
	 *   ['content-type'] content-type of that final url
	 *
	 * @throws GWTException
	 * @return null|string
	 */
	protected function getFileExtension( array $options ) {
		global $wgFileExtensions;
		$result = null;

		if ( empty( $options['url'] ) ) {
			throw new GWTException(
				[
					'gwtoolset-mapping-media-file-url-bad' =>
					[ $options['url'], '' ]
				]
			);
		}

		if ( empty( $options['content-type'] ) ) {
			throw new GWTException(
				[
					'gwtoolset-mapping-media-file-no-content-type' =>
					[ $options['url'] ]
				]
			);
		}

		$pathinfo = pathinfo( $options['url'] );
		$mimeAnalyzer = \MediaWiki\MediaWikiServices::getInstance()->getMimeAnalyzer();

		if ( !empty( $pathinfo['extension'] )
			&& in_array( $pathinfo['extension'], $wgFileExtensions )
			&& strpos( $mimeAnalyzer->getTypesForExtension( $pathinfo['extension'] ),
					$options['content-type']
				) !== false
		) {
			$result = $pathinfo['extension'];
		} elseif ( !empty( $options['content-type'] ) ) {
			$result = explode( ' ', $mimeAnalyzer->getExtensionsForType( $options['content-type'] ) );

			if ( !empty( $result ) ) {
				$result = $result[0];
			}
		}

		return $result;
	}

It's doing it from the URL or the content-type... Obviously, for that URL there is no extension

% curl -I https://media.api.aucklandmuseum.com/id/media/v/614581  
HTTP/1.1 200 OK
Access-Control-Allow-Headers: Content-Type, X-Requested-With, Accept, Origin, Access-Control-Request-Method, Access-Control-Request-Headers
Access-Control-Allow-Methods: GET, POST
Access-Control-Allow-Origin: *
Cache-Control: max-age=3600, public
Content-Length: 21530270
Content-Type: image/jpeg;charset=utf-8
Date: Fri, 15 May 2020 19:04:58 GMT
Last-Modified: Fri, 10 Aug 2018 01:47:04 GMT
Server: Apache-Coyote/1.1
Connection: keep-alive

But the Content-Type is there; but it seems the charset=utf-8 upsets it:

reedy@deploy1001:~$ mwscript eval.php enwiki
> $mimeAnalyzer = \MediaWiki\MediaWikiServices::getInstance()->getMimeAnalyzer();

> var_dump( $mimeAnalyzer->getExtensionsForType( 'image/jpeg;charset=utf-8' ) );
NULL

> var_dump( $mimeAnalyzer->getExtensionsForType( 'image/jpeg' ) );
string(25) "jpeg jpg jpe jpeg jpg jpe"

>

getExtensionsForType is very simple... So it doesn't match, it won't find it. Not sure offhand if this is a bug in getExtensionsForType, or whether the wrong function is being called...

	public function getExtensionsForType( $mime ) {
		$mime = strtolower( $mime );

		// Check the mime-to-ext map
		if ( isset( $this->mimeToExt[$mime] ) ) {
			return $this->mimeToExt[$mime];
		}

		// Resolve the MIME type to the canonical type
		if ( isset( $this->mimeTypeAliases[$mime] ) ) {
			$mime = $this->mimeTypeAliases[$mime];
			if ( isset( $this->mimeToExt[$mime] ) ) {
				return $this->mimeToExt[$mime];
			}
		}

		return null;
	}

I do see in detectMimeType (called via guessMimeType)

			$m = preg_replace( '![;, ].*$!', '', $m ); # strip charset, etc

But that seems to rely on the file existing on disk...

Change 596736 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/extensions/GWToolset@master] Remove charset from Content-Type

https://gerrit.wikimedia.org/r/596736

gerritbot added a project: Patch-For-Review.May 15 2020, 7:29 PM

Krinkle mentioned this in T252929: MimeAnalyser: Support mapping Content-Type to mime type and/or file extension.May 15 2020, 11:14 PM

Change 596736 merged by jenkins-bot:
[mediawiki/extensions/GWToolset@master] Strip charset from Content-Type before parsing as mime type

https://gerrit.wikimedia.org/r/596736

Reedy closed this task as Resolved.May 16 2020, 12:03 AM

Reedy claimed this task.

Maintenance_bot removed a project: Patch-For-Review.May 16 2020, 12:10 AM

GWToolset can not determine the file extension from the file URLClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

GWToolset can not determine the file extension from the file URL
Closed, ResolvedPublic
Actions