Page MenuHomePhabricator

'\n' are added to various elements in CommonsMetadata output
Closed, ResolvedPublic

Details

Reference
bz57458

Event Timeline

bzimport raised the priority of this task from to Needs Triage.
bzimport set Reference to bz57458.

Looking more around, '\n' are added to several values:

See https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&prop=imageinfo&format=json&iiprop=extmetadata&iilimit=10&titles=File%3ACompans%20lake%20-%20Anas%20platyrhynchos%2007.JPG :

"Credit": {
"value": "\nSelf-photographed",
"source": "commons-desc-page",
"hidden": ""
},
"LicenseUrl": {
"value": "http://creativecommons.org/licenses/by-sa/3.0\n",
"source": "commons-desc-page",
"hidden": ""
},
"LicenseShortName": {
"value": "CC-BY-SA-3.0\n",
"source": "commons-desc-page",
"hidden": ""
},
"UsageTerms": {
"value": "Creative Commons Attribution-Share Alike 3.0\n",
"source": "commons-desc-page",
"hidden": ""
},

Change 97743 had a related patch set uploaded by Gergő Tisza:
Trim HTML-based metadata values

https://gerrit.wikimedia.org/r/97743

Change 97743 abandoned by Gergő Tisza:
Trim HTML-based metadata values

Reason:
Abandoning this change since InformationParser has been completely rewritten in the meantime.

https://gerrit.wikimedia.org/r/97743

Change 120948 had a related patch set uploaded by Gergő Tisza:
Clean parsed HTML

https://gerrit.wikimedia.org/r/120948

Change 120948 merged by jenkins-bot:
Clean parsed HTML

https://gerrit.wikimedia.org/r/120948

This issue is occurring again. See e.g. https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&format=json&iiprop=commonmetadata|extmetadata&iilimit=1&titles=File%3ALandsort%20Lighthouse%20August%202013%2009.jpg

where
"LicenseShortName": {

"value": "CC-BY-SA-3.0\n",
"source": "commons-desc-page",
"hidden": ""

},
"UsageTerms": {

"value": "Creative Commons Attribution-Share Alike 3.0\n",
"source": "commons-desc-page",
"hidden": ""

},
"LicenseUrl": {

"value": "http://creativecommons.org/licenses/by-sa/3.0\n",
"source": "commons-desc-page",
"hidden": ""

},

Looking at the html source of the example above [1] there is no trace of these newline characters. Hence it might not be a cleaning/trimming issue in the TemplateParser but rather inserted by it?

[1] https://commons.wikimedia.org/wiki/File:Landsort_Lighthouse_August_2013_09.jpg

Tgr added a comment.Aug 14 2014, 1:56 PM
  • Bug 69497 has been marked as a duplicate of this bug. ***
Lupo added a comment.Aug 14 2014, 1:58 PM

As stated in bug 69497, these newlines are in the license template, and the code doing the HTML scraping there had better remove them.

Tgr added a comment.Aug 14 2014, 2:05 PM

The code to remove is in https://gerrit.wikimedia.org/r/#/c/120948/1/TemplateParser.php which at a glance seems correct to me. Also, Lokal_Profil is right that the newline is not always present in the HTML code. I'll test locally with the examples mentioned here.

Lupo added a comment.Aug 14 2014, 2:41 PM

This code does _not_ look good. '/^\s+(.*)\s+$/' is wrong. It fails to trim if there are no leading blanks (or no trailing blanks). And watch out for the greedy (.*), that also looks wrong.

Lupo added a comment.Aug 14 2014, 2:48 PM

(In reply to Tisza Gergő from comment #10)

Also, Lokal_Profil is right that the newline is
not always present in the HTML code. I'll test locally with the examples
mentioned here.

Not correct. See

https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=extmetadata&format=jsonfm&titles=File:Landsort_Lighthouse_August_2013_09.jpg

Returns the same trailing newlines for UsageTerms and LicenseUrl.

Change 155901 had a related patch set uploaded by TheDJ:
TemplateParser: Fix whitespace trim

https://gerrit.wikimedia.org/r/155901

Change 155901 merged by jenkins-bot:
TemplateParser: Fix whitespace trim

https://gerrit.wikimedia.org/r/155901

Tgr added a comment.Aug 23 2014, 4:44 PM

(In reply to Lupo from comment #11)

This code does _not_ look good. '/^\s+(.*)\s+$/' is wrong. It fails to trim
if there are no leading blanks (or no trailing blanks). And watch out for
the greedy (.*), that also looks wrong.

D'oh, that was stupid. Thanks for fixing, Lupo & TheDJ!

Tgr added a comment.Sep 5 2014, 12:27 PM
  • Bug 66652 has been marked as a duplicate of this bug. ***
Gilles moved this task from Untriaged to Done on the Multimedia board.Dec 4 2014, 10:11 AM
Gilles triaged this task as Unbreak Now! priority.
Gilles lowered the priority of this task from Unbreak Now! to Needs Triage.Dec 4 2014, 11:23 AM