Page MenuHomePhabricator

MimeMagic doesn't like it when webservers return content-type: image/pjpeg
Closed, ResolvedPublic

Description

I'm noticing a bunch of this in the log:

2015-06-11T12:59:40 Wittylama (talk | contribs) mediafile job failed. Metadata record 160. The file extension could not be determined from the file URL: <a rel="nofollow" class="external free" href="http://www.europeana1914-1918.eu/attachments/121556/11960.121556.full.jpg">http://www.europeana1914-1918.eu/attachments/121556/11960.121556.full.jpg</a>.. (Batch upload of 773 images from Europeana 1914-18. Progress of post-upload metadata checking at Commons:Europeana/Europeana 1914-1918 batch upload)

I'm guessing gwtoolset doesn't like having multiple periods in the filename, I guess it treats each one as an extension (?). That should be fixed.

Event Timeline

Bawolff raised the priority of this task from to Needs Triage.
Bawolff updated the task description. (Show Details)
Bawolff added subscribers: Bawolff, Wittylama.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The GWT has known issues with various symbols and characters, described in the documentation here: https://www.mediawiki.org/wiki/Help:Extension:GWToolset#Common_xml_problems
However, in the example you've given I don't understand what I did wrong, as there's not a 'double period' in the raw metadata that I have for that particular item (below).

Aside from the bug report, I'm interested to know how I can identify the items that failed in this manner - so I can not do it again, but also so I can fix their metadata and upload them properly. Suggestions welcome! I'm tracking all the items that I uploaded for this project here: https://commons.wikimedia.org/wiki/Commons:Europeana/Europeana_1914-1918_batch_upload

				<dc:dc>
					<dc:contributor>Archives départementales du Cher</dc:contributor>
					<dc:coverage>Western Front</dc:coverage>
					<commonscat>World War I</commonscat>
					<commonscat>Europeana 1914-1918 batch upload: needs checking</commonscat>
					<dc:description lang="fr">René Courcou est né le 17 juin 1898 à Vierzon-Ville (Cher). Inspecteur de l'Assistance publique à Limoges, il est mobilisé en 1916 et sert au 311e régiment d'infanterie. Il reçoit la croix de guerre pour son action comme agent de liaison en octobre 1918 au bois de la Garenne à Nanteuil-sur-Aisne (Ardennes). Le contributeur est le fils de René Courcou. </dc:description>
					<dc:identifier>121556</dc:identifier>
					<dc:identifier>11960</dc:identifier>
					<dc:relation>http://www.europeana1914-1918.eu/en/contributions/11960</dc:relation>
					<dc:source>UGC</dc:source>
					<dc:subject>Trench Life</dc:subject>
					<dc:subject>World War I</dc:subject>
					<dc:subject>http://dbpedia.org/page/World_War_I</dc:subject>
					<dcterms:alternative/>
					<dc:title>Couteau de poche de René Courcou, 2</dc:title>
					<dc:type>item</dc:type>
					<europeana:collectionName>2020601_Ag_ErsterWeltkrieg_EU</europeana:collectionName>
					<europeana:completeness>10</europeana:completeness>
					<europeana:country>europe</europeana:country>
					<europeana:dataProvider>Europeana 1914-1918</europeana:dataProvider>
					<europeana:photographer>Europeana staff photographer</europeana:photographer>
					<europeana:isShownAt>http://www.europeana1914-1918.eu/en/contributions/11960</europeana:isShownAt>
					<europeana:isShownBy>http://www.europeana1914-1918.eu/attachments/121556/11960.121556.full.jpg</europeana:isShownBy>
					<europeana:provider>Europeana 1914-1918</europeana:provider>
					<europeana:rights>http://creativecommons.org/licenses/by-sa/3.0/</europeana:rights>
					<europeana:cc_rights>{{Cc-by-sa-3.0}}</europeana:cc_rights>
					<europeana:photodate>2014</europeana:photodate>
					<europeana:ugc>true</europeana:ugc>
					<europeana:wikimediarights>{{Europeana 1914-1918}} {{Copyright information|object={{Europeana 1914-1918 PD-own upload}}|photograph={{cc-by-sa-3.0|Europeana 1914-1918 project}}}}</europeana:wikimediarights>
					<europeana:uri>http://www.europeana.eu/resolve/record/2020601/attachments_121556_11960_121556_original_121556_jpg</europeana:uri>
					<europeana:object>http://www.europeana1914-1918.eu/attachments/121556/11960.121556.full.jpg</europeana:object>
					<type>IMAGE</type>
				</dc:dc>

I have not investigated, so this is speculation, but I suspect there's a bug in how GWToolset determines if the file has an allowed file type.

So the url of the file is: http://www.europeana1914-1918.eu/attachments/121556/11960.121556.full.jpg

I suspect gwtoolset is seeing "11960.121556.full.jpg", and either thinking the type is ".121556.full.jpg" instead of ".jpg", or trying to match each "121556", "full" and "jpg" separately.

I suspect gwtoolset is seeing "11960.121556.full.jpg", and either thinking the type is ".121556.full.jpg" instead of ".jpg", or trying to match each "121556", "full" and "jpg" separately.

It can't be that (I say, pretending to know what I'm talking about...) because ALL of the files that I've uploaded with this batch have a similar URL structure for where the image is originally stored.

For example:
This file, https://commons.wikimedia.org/wiki/File:FRAD022_-_Julien_PASCO,_item_19.jpg , which successfully uploaded (one of many in that batch) has this original location in the XML file:

					<europeana:isShownBy>http://www.europeana1914-1918.eu/attachments/135270/13256.135270.full.jpg</europeana:isShownBy>

Nevermind, I'm wrong. It appears to be related to the webserver returning a mime type of image/pjpeg(for progressive jpeg i guess) instead of image/jpeg

Change 217592 had a related patch set uploaded (by Brian Wolff):
Add image/pjpeg as an alias for image/jpeg

https://gerrit.wikimedia.org/r/217592

Bawolff renamed this task from gwtoolset doesn't like files with periods in the name to gwtoolset doesn't like it when webservers return content-type: image/pjpeg.Jun 11 2015, 7:07 PM
Bawolff set Security to None.

Aside from the bug report, I'm interested to know how I can identify the items that failed in this manner - so I can not do it again, but also so I can fix their metadata and upload them properly.

For reference, a list of all your failed gwtoolset uploads is at https://commons.wikimedia.org/w/api.php?action=query&list=logevents&leuser=Wittylama&letype=gwtoolset&leaction=gwtoolset/mediafile-job-failed&lelimit=max (unfortunately, there's not really a ui friendly version of that yet).

GWToolset uses GWToolset\Handlers:UploadHandler:getFileExtension() to attempt to determine the media file’s extension. it uses php’s pathinfo and mediawiki’s MimeMagic to determine a valid file extension. if either of those methods fails, the file extension is assumed to be invalid.

based on @Bawolff’s gerrit patch above, it looks like the mime info available to MimeMagic didn’t accept the non-standard image/pjpeg mime type and thus the most likely cause for failures.

bawollf, thanks for your help on this and for the patch.

Aside from the bug report, I'm interested to know how I can identify the items that failed in this manner - so I can not do it again, but also so I can fix their metadata and upload them properly.

For reference, a list of all your failed gwtoolset uploads is at https://commons.wikimedia.org/w/api.php?action=query&list=logevents&leuser=Wittylama&letype=gwtoolset&leaction=gwtoolset/mediafile-job-failed&lelimit=max (unfortunately, there's not really a ui friendly version of that yet).

Sorry... but I have no idea how to "read" that report... Can you tell me if the failures of these items were something I could/should have fixed in my pre-upload metadata wrangling (and how to not do it again in the future)?

I can work out which items failed to upload by seeing which ones never turn into blue links (and thumbnails) on my tracking page: https://commons.wikimedia.org/wiki/Commons:Europeana/Europeana_1914-1918_batch_upload I'll just upload those remaining ones with the normal upload wizard.

9 were due to a file already existing at the title, 3 were due to duplicate files already on commons, the rest were due to this bug report (which is not something you could have done anything about)

Change 217592 merged by jenkins-bot:
Add image/pjpeg as an alias for image/jpeg

https://gerrit.wikimedia.org/r/217592

Bawolff claimed this task.

The fix for this bug is now live on beta commons, and should make it to real commons come Wednesday, roughly around 20:00 UTC

dan-nl renamed this task from gwtoolset doesn't like it when webservers return content-type: image/pjpeg to MimeMagic doesn't like it when webservers return content-type: image/pjpeg.Jun 13 2015, 5:28 AM

Just to confirm....
has this patch now been merged to the production website, and therefore the issue won't appear anymore?
Following that, if so, can I now re-run the batch upload that caused these ~80 failures, so that they upload successfully this time?
(this should cause the other ~700 files to fail because there's already the same file uploaded to Commons, but that's ok, right?)

Just to confirm....
has this patch now been merged to the production website, and therefore the issue won't appear anymore?
Following that, if so, can I now re-run the batch upload that caused these ~80 failures, so that they upload successfully this time?
(this should cause the other ~700 files to fail because there's already the same file uploaded to Commons, but that's ok, right?)

Yep, https://commons.wikimedia.org/wiki/Special:Version says wmf10, so the patch should be there.

So your plan of just re-running the batch should work fine (Hopefully anyways, I never actually did test gwtoolset with the actual metadata file you were using, so there's a small chance the problem was misdiagnosed, but I think that's unlikely).

So your plan of just re-running the batch should work fine

On this basis I've just pressed 'go' on a re-run of the same dataset to try to catch the previously failed uploads (and leave the ones that did upload successfully alone.

However! It looks like the GWT is now undoing all the post-upload metadata improvements I've made on the files that did work!!! Help! E.g. in this case it UNDOING the addition of the improved license text that I added after the upload the first time: https://commons.wikimedia.org/w/index.php?title=File%3AThe_Elgin_Regiment.jpg&type=revision&diff=164160341&oldid=163117317
That's not supposed to happen - and I don't want the tool to revert all the other files I've worked on post-upload! I just wanted it to add the files that failed because of this bug! What do I do?!

However! It looks like the GWT is now undoing all the post-upload metadata improvements I've made on the files that did work!!! Help! E.g. in this case it UNDOING the addition of the improved license text that I added after the upload the first time: https://commons.wikimedia.org/w/index.php?title=File%3AThe_Elgin_Regiment.jpg&type=revision&diff=164160341&oldid=163117317
That's not supposed to happen - and I don't want the tool to revert all the other files I've worked on post-upload! I just wanted it to add the files that failed because of this bug! What do I do?!

I thought this was the default behaviour too ; which contradicts T93581 though.

In the lack of Stop button (see T100972) I believe the best course of action is to let the GWT run through its batch, in the hope it will at least upload the files it failed to upload earlier, and hire a bot to revert the GWT above all the metadata improvements.

In the lack of Stop button (see T100972) I believe the best course of action is to let the GWT run through its batch, in the hope it will at least upload the files it failed to upload earlier...

I can confirm that the new batch HAS successfully uploaded at least 1 of the files that previously failed: https://commons.wikimedia.org/w/index.php?title=File:Speld_van_Pierre_(Louis)_Robeyns.jpeg&action=history

So, at least that means that this bug has been successfully patched to resolve the issue that originally caused the images to fail. And, therefore, @JeanFred's suggestion would seem the best course of action.

Some pseudo code for tracking down and reverting the changes (once GWT is done). Not sure I'll have the time to run it so sticking it here in hope that it will make it quicker for someone else.

hitlist = {}

# run for all files in Category:Images uploaded by Europeana
pageids = #result from (/w/api.php?action=query&list=categorymembers&format=json&cmtitle=Category%3AImages%20uploaded%20by%20Europeana&cmprop=ids&cmnamespace=6)

# find any with multiple revisions where the latest was done by GWT
for pageid in pageids:
	response = #get /w/api.php?action=query&prop=revisions&format=json&rvlimit=2&pageids=<pageid>
	revs = response['query']['pages'][revid]['revisions']
	if len(revs) > 1 and revs[0]['comment'].startswith('[[Commons:GWT|GWToolset]]:'):
		hitlist[pageid] = revs[0]['revid']

# get yourself a token
response = #get /w/api.php?action=query&meta=tokens&format=json&type=csrf
token = response['query']['tokens']['csrftoken']

# undo all in hitlist
for pageid, revid in hitlist.iteritems():
	requrl = 
	data = {
		'pageid': pageid,
		'undo': revid,
		'token': token
	}
	#Then request /w/api.php?action=edit&format=json with data as POST-data

Just to clarify, is the more that needs doing on this bug, or has the additional issues been split to other bugs?

Unassigning from self so I'm not on the wall of shame at https://lists.wikimedia.org/pipermail/wikitech-l/2015-September/083363.html ;)

I don't know if @Wittylama manually reverted the undesired changes, alternatively if someone else did.

I don't know if @Wittylama manually reverted the undesired changes, alternatively if someone else did.

I don't remember anymore, it was a while ago! I do recall manually moving several dozen files from a .jpeg extension to a .jpg extension and reintroduced all the changes to license tags etc. that had been undone by the re-upload. So I'll take a guess that I manually reverted the undesired changes myself.

Lokal_Profil claimed this task.

In that case I'm m

I don't know if @Wittylama manually reverted the undesired changes, alternatively if someone else did.

I don't remember anymore, it was a while ago! I do recall manually moving several dozen files from a .jpeg extension to a .jpg extension and reintroduced all the changes to license tags etc. that had been undone by the re-upload. So I'll take a guess that I manually reverted the undesired changes myself.

In that case I think it is probably safe to close this task.