Provide an API flag to suppress auto-generated <references />
Closed, ResolvedPublic0 Story Points

Description

Parsoid clients sometimes parse a fragment of wikitext during edit sessions and insert the generated HTML in the full page DOM. For this use case, it doesn't make sense to auto-generate the <references /> tag. See T95851 for an example bug report that is affected by this.

There are a couple ways of doing this:

  1. Provide a new API flag. It is likely that this will always be used along with the bodyOnly API flag.
  2. Provide a new fragment parse API end point that implicitly provides this functionality. This would eliminate the bodyOnly API flag since that option is also implied.

In either case, <ref> tags found in the wikitext in this mode would have their HTML inlined in data-mw instead of having an id pointer to the DOM in the <references /> section.

Option 1 is the easier path in terms of minimal disruption. Option 2. seems cleaner but requires work on the part of all Parsoid clients + a change in the RESTBase API. I've cc-ed VE, Flow, CX and Services teams in case they want to weigh in on this.

ssastry created this task.Sep 22 2015, 4:52 AM
ssastry updated the task description. (Show Details)
ssastry raised the priority of this task from to Normal.
ssastry added a subscriber: ssastry.
Restricted Application added a project: Collaboration-Team-Triage. · View Herald TranscriptSep 22 2015, 4:52 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We probably actually want the auto-generated <references /> for Flow.

In T113331#1706668, @Mattflaschen wrote:

We probably actually want the auto-generated <references /> for Flow.

This is the default behavior, but, in several edit scenarios (ex: when you are parsing a fragment, say, an edited transclusion or a new transclusions in VE), you don't want to auto-generate the refs. See the merged tasks (in retrospect, I should have made this a blocking task on them).

GWicke added a subscriber: GWicke.Oct 8 2015, 2:16 AM

There seem to be two main directions for making some content elements (like references, navboxes, infoboxes) optional:

  1. API flags that lets us parametrize the returned content, based on a fresh parse.
  2. Markup for such elements that makes it easy to extract / remove that content in a consumer or post-processing layer.

An example of 2) would be custom elements, as discussed in T105845#1650013. A potential advantage of this approach is the ability to share stored / cached content between different consumers, while being able to apply cheap post-processing steps (facilitated by appropriate syntax) to get the desired variant. A disadvantage is the need to apply such post-processing.

For references, there is also the option of potentially exposing them as separate metadata, as discussed in T102867 and T105845.

Arlolra claimed this task.Oct 16 2015, 3:42 PM
cscott added a subscriber: cscott.Oct 16 2015, 3:44 PM

I like the idea of a new flag, but let's call it "fragment", rather than proliferate the set of "body_only", "no_references", etc flags. If you hit the wt2html endpoint with "fragment=true" it implies body_only, no_references, and whatever other future fragment-specific parsing tweaks we want to do.

ssastry set Security to None.

Change 249683 had a related patch set uploaded (by Arlolra):
T113331: Provide an API flag for fragment parsing

https://gerrit.wikimedia.org/r/249683

Here is a summary of a lengthy IRC discussion on #mediawiki-parsoid.

@GWicke raises the concern that we shouldn't be implementing any new uncached endpoints that can be abused. I remain unconvinced about this theoretical concern since the existing endpoint (/transform/..) to parse a wikitext fragment will be uncached anyway. @GWicke's concern is that someone who might want the entire DOM of the page (minus the <references /> section) will use this new API flag instead of fetching the stored DOM and stripping the <references /> output from it. I continue to be unconvinced.

The custom tag marker solution that @GWicke is advocating for (that this will make it easier for clients to strip out <references /> section instead of parsing the DOM) is an orthogonal solution for the use cases we are targeting. Flow, CX, VE and other internal clients will be parsing the DOM in any case and this is a non-issue. They have to do additional work in either case and deleting the <ol>-dom-tree with a mw:Extension/References typeof is 2 lines of code at most.

So, to summarize, here are the 2 possible solutions that I see:

(1) Rename bodyOnly to parseFragment (or some other name that we can bikeshed about) where Parsoid parses the wikitext, doesn't generate the <references /> output, and returns the children of <body> serialized to HTML
(2) Don't do anything. Ask VE, Flow, CX to drop the <ol> that has the mw:Extensions/References typeof attribute.

@GWicke is advocating that we should wrap references output with a custom tag so that it is simpler for clients to strip them out. But custom tags in output is a separate discussion that doesn't need to be tied into this ticket. The custom tags doesn't fundamentally change the fact that VE, CX, Flow are now responsible for dropping information from the DOM that they get from Parsoid.

I am personally agnostic about solution (1) or (2). I would like to hear from @Esanders, @Catrope, @santhosh (/cc @Jdforrester-WMF, @MattFlaschen). If you guys can handle this on your end, we'll close this ticket as declined. It does seem like a lot of code ( https://gerrit.wikimedia.org/r/#/c/249683/ ) to support the renamed flag and will require changes to RESTBase codebase as well to deal with the flag rename besides changes in all your codebases.

<James_F> subbu: Doing the same thing in every client sucks.
<James_F> subbu: Also, the name in VE for the function to call this endpoint in Parsoid is indeed "parseFragment".
<subbu> 2 lines of code as far as i can tell.
<James_F> But… eh. I don't care much either way.

@ssastry, CX needs the <references/> in the output and at this point we don't have any usescase to parse the fragments or to avoid the references in the output. I also don't see any usecase in near future we will require this.

ssastry closed this task as Declined.Nov 5 2015, 3:07 AM

From IRC conversations, it seems that Flow doesn't need this either.

Given this, it seems best for VE to strip the references subtree by looking for mw:Extension/References from the DOM it receives from Parsoid. We'll leave bodyOnly as is.

Change 249683 abandoned by Arlolra:
T113331: Provide an API flag for fragment parsing

Reason:
Task was declined.

https://gerrit.wikimedia.org/r/249683

Esanders reopened this task as Open.Feb 9 2016, 11:55 PM

I was looking at doing the stripping in VE (T101553) but it turns that without data-parsoid we can't tell the difference between an auto-generated reflist (e.g. from an infobox) which we want to strip, and a real references tag (e.g. from {{reflist}}) that we want to keep.

I was looking at doing the stripping in VE (T101553) but it turns that without data-parsoid we can't tell the difference between an auto-generated reflist (e.g. from an infobox) which we want to strip, and a real references tag (e.g. from {{reflist}}) that we want to keep.

I don't follow ... how would you determine this when you make a call to Parsoid to not generate references ... i.e. whatever logic you use to pass the new api flag to parsoid should be sufficient to let you figure out when to strip it, it seems.

Well given the initial approach was declined I was suggesting that we at least have a way of distinguishing the two.

Well given the initial approach was declined I was suggesting that we at least have a way of distinguishing the two.

Ah ... I see. I was confused because you reopened the ticket and I thought you wanted us to implement that API option.

You are saying that we set that flag in data-parsoid, and you don't have access to that flag and so you want a way of distinguishing between the two scenarios.

Tbayer removed a subscriber: Tbayer.Feb 11 2016, 7:11 PM

I'm suggesting data-mw='{"auto":"true"}' (or data-mw='{"synthetic":"true"}'?) as the not-perfectly-great-but-good-enough-for-now solution.

@cscott There's already something set in data-parsoid, we could just move that definition to data-mw

Change 279446 had a related patch set uploaded (by Subramanya Sastry):
T113331: Move auto-generated refs flag from data-parsoid to data-mw

https://gerrit.wikimedia.org/r/279446

Change 279446 merged by jenkins-bot:
T113331: Move auto-generated refs flag from data-parsoid to data-mw

https://gerrit.wikimedia.org/r/279446

ssastry closed this task as Resolved.Mar 25 2016, 7:47 PM

Parsoid now sets autoGenerated: true in data-mw for auto-generated references.

This is not yet deployed but would likely go out on Monday. Note that till the HTML turns over because of edits, you will continue to receive the older version where this flag is not set in data-mw.

In the next Parsoid deploy, we are going to bump the version number to 1.2.1. At the same time, we are going to update the accept header to actually point to the HTML specs based on the version number. The new header will look like 'text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/1.2.1"'

If you want to receive the new version right away (without waiting for the HTML to turn over in RESTBase), you can update the accept header in VE once Parsoid and RESTBase updates are deployed (likely on Monday, latest by Wednesday).

If you want to receive the new version right away (without waiting for the HTML to turn over in RESTBase), you can update the accept header in VE once Parsoid and RESTBase updates are deployed (likely on Monday, latest by Wednesday).

You can (and should) update the accept header, but be aware that you are *not* guaranteed to receive content in version 1.2.1 at this point. The requisite logic for version upgrades is not yet implemented in Parsoid and RESTBase, and will take at least a few more weeks to build.

Jdforrester-WMF set the point value for this task to 0.