Page MenuHomePhabricator

'\n' are added to various elements in CommonsMetadata output
Closed, ResolvedPublic

Details

Reference
bz57458

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:38 AM
bzimport added a project: CommonsMetadata.
bzimport set Reference to bz57458.

Looking more around, '\n' are added to several values:

See https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&prop=imageinfo&format=json&iiprop=extmetadata&iilimit=10&titles=File%3ACompans%20lake%20-%20Anas%20platyrhynchos%2007.JPG :

"Credit": {
"value": "\nSelf-photographed",
"source": "commons-desc-page",
"hidden": ""
},
"LicenseUrl": {
"value": "http://creativecommons.org/licenses/by-sa/3.0\n",
"source": "commons-desc-page",
"hidden": ""
},
"LicenseShortName": {
"value": "CC-BY-SA-3.0\n",
"source": "commons-desc-page",
"hidden": ""
},
"UsageTerms": {
"value": "Creative Commons Attribution-Share Alike 3.0\n",
"source": "commons-desc-page",
"hidden": ""
},

Change 97743 had a related patch set uploaded by Gergő Tisza:
Trim HTML-based metadata values

https://gerrit.wikimedia.org/r/97743

Change 97743 abandoned by Gergő Tisza:
Trim HTML-based metadata values

Reason:
Abandoning this change since InformationParser has been completely rewritten in the meantime.

https://gerrit.wikimedia.org/r/97743

Change 120948 had a related patch set uploaded by Gergő Tisza:
Clean parsed HTML

https://gerrit.wikimedia.org/r/120948

Change 120948 merged by jenkins-bot:
Clean parsed HTML

https://gerrit.wikimedia.org/r/120948

This issue is occurring again. See e.g. https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&format=json&iiprop=commonmetadata|extmetadata&iilimit=1&titles=File%3ALandsort%20Lighthouse%20August%202013%2009.jpg

where
"LicenseShortName": {

"value": "CC-BY-SA-3.0\n",
"source": "commons-desc-page",
"hidden": ""

},
"UsageTerms": {

"value": "Creative Commons Attribution-Share Alike 3.0\n",
"source": "commons-desc-page",
"hidden": ""

},
"LicenseUrl": {

"value": "http://creativecommons.org/licenses/by-sa/3.0\n",
"source": "commons-desc-page",
"hidden": ""

},

Looking at the html source of the example above [1] there is no trace of these newline characters. Hence it might not be a cleaning/trimming issue in the TemplateParser but rather inserted by it?

[1] https://commons.wikimedia.org/wiki/File:Landsort_Lighthouse_August_2013_09.jpg

  • Bug 69497 has been marked as a duplicate of this bug. ***

As stated in bug 69497, these newlines are in the license template, and the code doing the HTML scraping there had better remove them.

The code to remove is in https://gerrit.wikimedia.org/r/#/c/120948/1/TemplateParser.php which at a glance seems correct to me. Also, Lokal_Profil is right that the newline is not always present in the HTML code. I'll test locally with the examples mentioned here.

This code does _not_ look good. '/^\s+(.*)\s+$/' is wrong. It fails to trim if there are no leading blanks (or no trailing blanks). And watch out for the greedy (.*), that also looks wrong.

(In reply to Tisza Gergő from comment #10)

Also, Lokal_Profil is right that the newline is
not always present in the HTML code. I'll test locally with the examples
mentioned here.

Not correct. See

https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=extmetadata&format=jsonfm&titles=File:Landsort_Lighthouse_August_2013_09.jpg

Returns the same trailing newlines for UsageTerms and LicenseUrl.

Change 155901 had a related patch set uploaded by TheDJ:
TemplateParser: Fix whitespace trim

https://gerrit.wikimedia.org/r/155901

Change 155901 merged by jenkins-bot:
TemplateParser: Fix whitespace trim

https://gerrit.wikimedia.org/r/155901

(In reply to Lupo from comment #11)

This code does _not_ look good. '/^\s+(.*)\s+$/' is wrong. It fails to trim
if there are no leading blanks (or no trailing blanks). And watch out for
the greedy (.*), that also looks wrong.

D'oh, that was stupid. Thanks for fixing, Lupo & TheDJ!

  • Bug 66652 has been marked as a duplicate of this bug. ***
Gilles triaged this task as Unbreak Now! priority.Dec 4 2014, 10:11 AM
Gilles moved this task from Untriaged to Done on the Multimedia board.
Gilles lowered the priority of this task from Unbreak Now! to Needs Triage.Dec 4 2014, 11:23 AM