Insertion of <i> tag
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	• gpaumier
	May 26 2015, 10:44 PM

Description

On the French Wikipedia, when a page title contains text in another language, the title is usually altered to include a language span and to be formatted with italics (example).

In this edit, the user added a reference to such a page, and Citoid inserted an <i> tag that should probably be converted to wikitext.

(I've checked that this is always reproducible in Citoid, so I'm confident this was inserted with Citoid and not manually).

There's also the issue of the double language span, but that's present in the linked page's <h1> so it's likely due to the double use of a displaytitle template.

Details

	Subject	Repo	Branch	Lines +/-
	Clean html tags out of top level fields	mediawiki/services/citoid	master	+119 -89

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Mvolz	T100464 Insertion of <i> tag
Resolved	Mvolz	T101878 Improve Zotero Validation
Resolved	Mvolz	T95295 Create a cleaner import and export of natively scraped and zotero scraped citations
Resolved	Mvolz	T75993 Promisify Citoid and base it on service-template-node / service-runner
Resolved	Mvolz	T96363 Promisify html-metadata

Event Timeline

• gpaumier created this task.May 26 2015, 10:44 PM

• gpaumier raised the priority of this task from to Needs Triage.

• gpaumier updated the task description. (Show Details)

• gpaumier added a project: Citoid.

• gpaumier subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 26 2015, 10:44 PM

• gpaumier mentioned this in T94767: Perform a weekly review of edits made with VisualEditor.May 26 2015, 10:48 PM

• Elitre subscribed.May 27 2015, 7:24 AM

This output is coming from Zotero, which contains the html.

@csteipp, seems like a potential security issue to me as well? @mobrovac, maybe we should add a task to clean any html out of zotero responses?

@Esanders, should the template model be checking for html in fields?

Mvolz added subscribers: • mobrovac, • csteipp.May 27 2015, 9:43 AM

Mvolz added a subscriber: Esanders.

Reported issue to translators: https://github.com/zotero/translators/issues/892

In T100464#1314098, @Mvolz wrote:

This output is coming from Zotero, which contains the html.

@csteipp, seems like a potential security issue to me as well? @mobrovac, maybe we should add a task to clean any html out of zotero responses?

So I can understand what's going on, when citoid add the html into the citation, that's treated as wikitext, correct? In this specific case, <i> in wikitext is converted to <i> in html which is safe, while <script> would be converted to <script>. But if citoid is somehow inserting html that doesn't go through the wikitext => html conversion, then that would be a major security issue.

Citoid is letting most Zotero json go through as is, we only have some minor validation for dates and things, and some more planned.

So on the citoid end, it's letting raw html get returned in the JSON, as long as it's coming from Zotero. If Zotero somehow managed to pass us a script tag, which I can imagine there being such a scenario, since it looks like Zotero does no upstream validation of what is passed to it in translators making it theoretically possible.

If a title tag like

<script>alert(1);</script><i>Elephants</i>

is scraped in our *native* scraper, the end result in citoid response is

"title":"alert(1);Elephants"

as all html tags are striped

But what is happening here is that raw html is coming through in the json directly from zotero, i.e.:

title":"<i><span class=\"lang-en\" lang=\"en\"><span class=\"lang-en\" lang=\"en\">Ninja Turtles</span></span></i> (film)"

http://citoid.wikimedia.org/api?format=mediawiki&search=http%3A%2F%2Ffr.wikipedia.org%2Fw%2Findex.php%3Ftitle%3DNinja_Turtles_%28film%29

That's all on the service end. Regardless of what the front end, it must be a security issue to not validate the Zotero fields since we can't be guaranteed it's script tag free- right?

On the extension end, I'm sure parsoid must strip out anything offending, but I'll need to investigate more (and is why I cced @Esanders to see what things should be like there). Presumably when adding fields in the template, they should be removing or converting html tags. I know we previously had a similar issue with pipes, which was dealt with.

Mvolz added a project: VisualEditor.May 27 2015, 7:57 PM

Mvolz set Security to None.

In T100464#1315453, @Mvolz wrote:

That's all on the service end. Regardless of what the front end, it must be a security issue to not validate the Zotero fields since we can't be guaranteed it's script tag free- right?

Whatever consumes zotero and might pass zotero data on to the client should be aware that this is possible and have a way to sanitize it. I think we now do a reasonable job of protecting against reflected xss in the citoid endpoint, but I'm not sure if it would ever be sanitized before being added into the dom on the browser. Once we save the edit, there shouldn't be an issue since it goes through a wikitext conversion.

I'd say that, from a general standpoint, we should check Zotero returns since it's a service we run without modifications. We could perhaps reuse HTML stripping we do in the native scraper.

• gpaumier added a project: VisualEditor 2014/15 Q4 blockers.May 28 2015, 6:02 PM

Jdforrester-WMF moved this task from Nominated to Backlog: Urgent Fixes on the VisualEditor 2014/15 Q4 blockers board.May 28 2015, 6:09 PM

Why wouldn't Parsoid convert these <i> tags? They shouldn't have data-parsoid entries (?) so won't be considered original wikitext as raw <i> tags and should be turned into '' ?

Parsoid emits contents of "wt" property of data-mw as is (since that is wikitext). See below. So, from what I can see in https://fr.wikipedia.org/w/index.php?diff=115270236 this is the same thing going on.

[subbu@earth lib] cat /tmp/x.html
<p about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"echo","href":"./Template:Echo"},"params":{"1":{"wt":"<i>foo</i>"}},"i":0}}]}'>foo</p>
[subbu@earth lib] node parse --html2wt < /tmp/x.html
{{echo|<i>foo</i>}}

However, if you provide the html property, that html property is independently serialized to wikitext. See below:

[subbu@earth lib] cat /tmp/x.html
<p about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"echo","href":"./Template:Echo"},"params":{"1":{"html":"<i>foo</i>"}},"i":0}}]}'>foo</p>
[subbu@earth lib] node parse --html2wt  < /tmp/x.html
{{echo|''foo''}}

If you provide both wt and html, the wt parameter takes priority and html is ignored. This is a slight inconsistency here since for extensions, if both html and extsrc properties are provided, html takes priority. But, this is tangential to the main problem in this ticket.

ssastry removed a project: Parsoid.Jun 3 2015, 9:42 PM

Jdforrester-WMF moved this task from To Triage to External and Administrivia on the VisualEditor board.Jun 3 2015, 10:00 PM

Mvolz moved this task from Backlog to Zotero on the Citoid board.Jun 9 2015, 7:34 PM

Mvolz added a subtask: T101878: Improve Zotero Validation.

Jdforrester-WMF triaged this task as Medium priority.Jun 18 2015, 12:32 AM

Jdforrester-WMF moved this task from External and Administrivia to TR1: Releases on the VisualEditor board.

Mvolz moved this task from Zotero to Site specific issues on the Citoid board.Jun 18 2015, 6:14 PM

Jdforrester-WMF edited a custom field.Jun 18 2015, 10:35 PM

Change 221131 had a related patch set uploaded (by Mvolz):
Clean html tags out of top level fields

https://gerrit.wikimedia.org/r/221131

gerritbot added a project: Patch-For-Review.Jun 26 2015, 3:10 PM

Jdforrester-WMF assigned this task to Mvolz.Jun 26 2015, 3:26 PM