Page MenuHomePhabricator

Insertion of <i> tag
Closed, ResolvedPublic8 Estimated Story Points

Description

On the French Wikipedia, when a page title contains text in another language, the title is usually altered to include a language span and to be formatted with italics (example).

In this edit, the user added a reference to such a page, and Citoid inserted an <i> tag that should probably be converted to wikitext.

(I've checked that this is always reproducible in Citoid, so I'm confident this was inserted with Citoid and not manually).

There's also the issue of the double language span, but that's present in the linked page's <h1> so it's likely due to the double use of a displaytitle template.

Event Timeline

gpaumier raised the priority of this task from to Needs Triage.
gpaumier updated the task description. (Show Details)
gpaumier added a project: Citoid.
gpaumier subscribed.

This output is coming from Zotero, which contains the html.

@csteipp, seems like a potential security issue to me as well? @mobrovac, maybe we should add a task to clean any html out of zotero responses?

@Esanders, should the template model be checking for html in fields?

This output is coming from Zotero, which contains the html.

@csteipp, seems like a potential security issue to me as well? @mobrovac, maybe we should add a task to clean any html out of zotero responses?

So I can understand what's going on, when citoid add the html into the citation, that's treated as wikitext, correct? In this specific case, <i> in wikitext is converted to <i> in html which is safe, while <script> would be converted to &lt;script&gt;. But if citoid is somehow inserting html that doesn't go through the wikitext => html conversion, then that would be a major security issue.

Citoid is letting most Zotero json go through as is, we only have some minor validation for dates and things, and some more planned.

So on the citoid end, it's letting raw html get returned in the JSON, as long as it's coming from Zotero. If Zotero somehow managed to pass us a script tag, which I can imagine there being such a scenario, since it looks like Zotero does no upstream validation of what is passed to it in translators making it theoretically possible.

If a title tag like

<script>alert(1);</script><i>Elephants</i>

is scraped in our *native* scraper, the end result in citoid response is

"title":"alert(1);Elephants"

as all html tags are striped

But what is happening here is that raw html is coming through in the json directly from zotero, i.e.:

title":"<i><span class=\"lang-en\" lang=\"en\"><span class=\"lang-en\" lang=\"en\">Ninja Turtles</span></span></i> (film)"

http://citoid.wikimedia.org/api?format=mediawiki&search=http%3A%2F%2Ffr.wikipedia.org%2Fw%2Findex.php%3Ftitle%3DNinja_Turtles_%28film%29

That's all on the service end. Regardless of what the front end, it must be a security issue to not validate the Zotero fields since we can't be guaranteed it's script tag free- right?

On the extension end, I'm sure parsoid must strip out anything offending, but I'll need to investigate more (and is why I cced @Esanders to see what things should be like there). Presumably when adding fields in the template, they should be removing or converting html tags. I know we previously had a similar issue with pipes, which was dealt with.

That's all on the service end. Regardless of what the front end, it must be a security issue to not validate the Zotero fields since we can't be guaranteed it's script tag free- right?

Whatever consumes zotero and might pass zotero data on to the client should be aware that this is possible and have a way to sanitize it. I think we now do a reasonable job of protecting against reflected xss in the citoid endpoint, but I'm not sure if it would ever be sanitized before being added into the dom on the browser. Once we save the edit, there shouldn't be an issue since it goes through a wikitext conversion.

I'd say that, from a general standpoint, we should check Zotero returns since it's a service we run without modifications. We could perhaps reuse HTML stripping we do in the native scraper.

Jdforrester-WMF subscribed.

Why wouldn't Parsoid convert these <i> tags? They shouldn't have data-parsoid entries (?) so won't be considered original wikitext as raw <i> tags and should be turned into '' ?

Parsoid emits contents of "wt" property of data-mw as is (since that is wikitext). See below. So, from what I can see in https://fr.wikipedia.org/w/index.php?diff=115270236 this is the same thing going on.

[subbu@earth lib] cat /tmp/x.html
<p about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"echo","href":"./Template:Echo"},"params":{"1":{"wt":"<i>foo</i>"}},"i":0}}]}'>foo</p>
[subbu@earth lib] node parse --html2wt < /tmp/x.html
{{echo|<i>foo</i>}}

However, if you provide the html property, that html property is independently serialized to wikitext. See below:

[subbu@earth lib] cat /tmp/x.html
<p about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"echo","href":"./Template:Echo"},"params":{"1":{"html":"<i>foo</i>"}},"i":0}}]}'>foo</p>
[subbu@earth lib] node parse --html2wt  < /tmp/x.html
{{echo|''foo''}}

If you provide both wt and html, the wt parameter takes priority and html is ignored. This is a slight inconsistency here since for extensions, if both html and extsrc properties are provided, html takes priority. But, this is tangential to the main problem in this ticket.

Change 221131 had a related patch set uploaded (by Mvolz):
Clean html tags out of top level fields

https://gerrit.wikimedia.org/r/221131

Change 221131 merged by Mobrovac:
Clean html tags out of top level fields

https://gerrit.wikimedia.org/r/221131

Deployed, resolving. If the issue still persists, please reopen the ticket.