Page MenuHomePhabricator

Parsoid converts <a name="foo"></a> to [ ]
Closed, ResolvedPublic

Description

I need to convert HTML meeting notes to wikitext (steps). @GWicke suggested I use parsoid, so I used rest_v1's /html/to/wikitext/{title}/{revision} transform. It worked but I got some unexpected wikitext.

Steps to reproduce:
  1. get typical HTML output from a tool, e.g.
<h2 class="c0"><a name="h.83986x7wjhtl"></a><span>Pending action items</span></h2>
<a name="foo"></a>Now some text.
  1. Visit mediawiki.org/api/rest_v1
  2. Open Transforms >[Post] /transform/html/to/wikitext/{title}/{revision}
  3. Paste the HTML above into html field.
  4. Click [Try it out!]
Results:
== [ ]<span>Pending action items</span> ==
[ ]Now some text.

The square brackets are unexpected. A <a href="some/url"> hyperlink is represented in wikitext using square brackets, but an anchor name is never represented that way.

Expected behavior:

Is Parsoid even supposed to work on arbitrary HTML?

{{Anchor}} wikitext templates typically output <span id="foo"></span>, but maybe parsoid transforming anchor HTML into span HTML is also unexpected.

Event Timeline

Spage raised the priority of this task from to Needs Triage.
Spage updated the task description. (Show Details)
Spage added a project: Parsoid.
Spage added subscribers: Spage, GWicke.

To your question about arbitary HTML: no, Parsoid right doesn't handle arbitrary HTML very well. We have made some progress over the years to improve robustness, but there is still a fair bit more to be done till we can take all kinds of crappy HTML and generate clean wikitext from it. There will be some compromises to be made in the bargain .. especially about html2html rendering rountripping vs. use of native wikitext constructs vs. HTML tags, etc. We may likely have to implement a cleanup step to scrub HTML we get from tools like google docs, word, openoffice, etc.

But, these bug reports are welcome so we know what are the biggest blockers along the way to using this on arbitrary HTML.

ssastry triaged this task as Medium priority.Sep 22 2015, 5:06 AM
ssastry set Security to None.

Change 345570 had a related patch set uploaded (by Arlolra):
[mediawiki/services/parsoid@master] T112043: Drop if extlink will be serialized w/o an href

https://gerrit.wikimedia.org/r/345570

Change 345570 merged by jenkins-bot:
[mediawiki/services/parsoid@master] T112043: Handle anchors without hrefs

https://gerrit.wikimedia.org/r/345570

Arlolra claimed this task.