Page MenuHomePhabricator

Table of contents in Parsoid output
Closed, ResolvedPublic

Description

Parsoid currently does not generate the table of contents table in its output, on the grounds that it is redundant and non-editable.

On the other hand, we need to generate this somehow for parsoid read views. Maybe a(nother) post-processing pass?

If discussion tools uses Parsoid read views for talk pages, this is probably required for that as well.

Related: T269630: Parsoid should support section editing links.

Related Objects

Event Timeline

ssastry triaged this task as Medium priority.Dec 16 2020, 4:37 PM
ssastry moved this task from Needs Triage to Missing Functionality on the Parsoid board.
Jdlrobson subscribed.

We are rethinking the table of contents for the desktop improvements project which may impact your implementation and make your lives easier as it would move it outside the parser output. We should definitely talk.

Please see our plans in https://phabricator.wikimedia.org/T283505
Making it possible to request parser content without the table of contents and accessing the underlying data structure instead is going to be very helpful...

Hey @cscott the table of contents is going to present a problem for the Desktop improvements project over the next 3 months using the current parser so I want to make sure whatever we do there is compatible with the new parser.

Lots of skins (Vector, mediawiki.org/wiki/Skin:Citizen) are actually moving towards having the table of contents outside the parser content so I think having more flexibility about where the table of contents goes is going to be important. Should we meet to talk about it?

We can simply move it to ParserOutput as one more derived output property and have clients do what they want to do with it. Between Platform Engineering and us, we are going to be working on T287216 which effectively will create a base class / interface for the current ParserOutput class and that would be a good time to slot it there.

Right now the method ParserOutput::getTOCHTML exists but ideally we'd love that as a JSON describing the heading structure. I believe that exists somewhere as an internal. have opened T287767 to describe the use case.

Change 733032 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Use a <meta> tag for the TOC_PLACEHOLDER

https://gerrit.wikimedia.org/r/733032

Current plan is to deprecate and remove ParserOutput::getTOCHTML(): T293513: Deprecate and remove ParserOutput::setTOCHTML()

Change 748890 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] parser: Prepare to use a <meta> tag for the internal TOC_PLACEHOLDER

https://gerrit.wikimedia.org/r/748890

This change is split out from change 721115, where I raised the following concern, which I believe has not yet been addressed or otherwise explained away:

The proposed <meta> tag seems problematic because:

  1. This isn't a page property. Those a very different things. I don't know if Subbu meant the suggestion literally, or as example of an existing meta tag that Parsoid uses (afaik to represent the __TOC__ magic word in VisualEditor, which controls where the TOC appears). Afaik we should be able to tell apart whether something is a page property, including to avoid internal ambiguity and for round-tripping.
  2. [..]
  3. The use of void elements and multiple attributes seems risky […] It's not something we've done before despite having several things like it that have the same requirements. So maybe not the best time to do that right now. It also suggests, incorrectly, that what we contractualize is the attribute value, whereas it's actually a byte-identical string match that we are contractualizing. That is likely going to confuse people at some point.

I'd suggest either using a comment of sorts like before, or something like <mw:foobar></mw:foobar>.

This is an internal Skin contract, not a Parse API contract. When we get to that stage, there may indeed be an unformatted version of this where a <meta> tag gets involved at some point internally for cases where the TOC is explicitly positioned by wikitext to match Parsoid, but I think there are sufficient differences right now that matching it here would add mismatching expectations and confusion, and something we'd likely have to change/break again before we ever make use of it.

In a nut shell, I see these as two significantly different concepts. I personally don't see the benefit or appeal of using tags with attribute values here, and while I do see it as slightly confusing and more fragile within string handling (do we want to pay for DOM parsing/querying/manipulating/stringifiying here?), I certainly have no objection to the proposed "meta tag as magic string" either from my point of view.

My only recommendation would be for this purely-internal never-exposed placeholder in the parser to use a value that cannot be mistaken for a mw:PageProp. Anything else is fine, e.g. mw:ParserInternal/toc or some such.

Briefly -- using a mw:PageProp is because the __TOC__ marker (which is a page prop) *also* marks the desired location of a ToC in the generated output. So the proposal is to use the same <meta property="mw:PageProp/toc"> for both, and (in Parsoid) just mark one as "synthetic" in the same way we generate a synthetic <references/> tag if the user didn't explicitly write one.

Basically the mental model is the same as for <references/> -- if the user explicitly wrote one, we leave it be and use that location for the references. If the user didn't explicitly write one, we insert a synthetic <references/> at the article end. But in both cases, we insert the references at the location of the (user-generated or synthetic) <references/> tag.

That avoids having to expose the multiple cases in ParserOutput::getText() -- and with an additional improvement over the <references/> case, because for ToC there are a number of "suppress ToC" cases too (including the presence of NOTOC) and we handle all of those by *not* generating the synthetic tag. So by using (effectively) the presence of absence of mw:PageProp/toc in the cached HTML we can uniformly handle all the various-locations-of and presence-or-absence-of ToC cases uniformly.

And, because Parsoid is a round-tripping parser, Parsoid just has to look at the presence/absence of the "is synthetic" metadata in data-parsoid to know if it should serialize this as a literal page property or not.

Briefly -- using a mw:PageProp is because the __TOC__ marker (which is a page prop) […]

By what definition is TOC a page prop? My understanding of page props is key-value pairs retrieved by the PageProps service, as stored by LinksUpdate into the page_props table, typically after being built up in a ParserOutput object by the Parser or a parser extension.

I checked both locally and in prod, that __TOC__ does not result in a row in page_props. Upn reviewing the Parser code in question, I did learn that indeed there appears to be an intent for most double-underscore magic words to be reflected as a page property, such as NOTOC and NOINDEX. I wasn't aware of this being done (almost) universally, and not just for those that have a use case to be queried as meta data during page rendering. Eg. this makes sense for NOINDEX which affects logic outside the rendered HTML. I think if it were done explicitly we probably would not do this for TOC/NOTOC etc which only play a role in internal rendering and have no need to be retreived as metadata by itself. I would worry about setting a predecent or awkward requirement that in the future any page prop must be a metatag inside the body HTML portion (as opposed to head metadata that we only expose during editing and naturally strip or don't serialize elsewhere), and vise versa that any internal placeholder for rendering must be persisted as an indexed and searchable page prop in the database.

As for references, I could not find in the Parsoid codebase that this uses a page prop currently. Perhaps that's a future plan. Within my very narrow and limited understanding of Parsoid through the lense of VisualEditor is that the references list is a generated node, much like a transclusion. It produces output that we'd want to recognise as coming from a certain wikitext incantation. The default appended node similarly could have a null representation that I guess either way VE and Parsoid would have to have special knowledge of to know to strip out if/when an explicit one is added.

In any event, so long as we're confident this won't be exposed as the (first) meta HTML tag of its kind in public APIs etc, I don't mind really what you end up using internally. It doesn't seem to match my mental model, but it's more important that it makes sense to you.

It's a good catch that __TOC__ is the only exception to the "all double-underscores are also page properties", caused by this excerpt from Parser::handleDoubleUnderscore:

	private function handleDoubleUnderscore( $text ) {
		# The position of __TOC__ needs to be recorded
		$mw = $this->magicWordFactory->get( 'toc' );
		if ( $mw->match( $text ) ) {
			$this->mShowToc = true;
			$this->mForceTocPosition = true;

			# Set a placeholder. At the end we'll fill it in with the TOC.
			$text = $mw->replace( self::TOC_PLACEHOLDER, $text, 1 );

			# Only keep the first one.
			$text = $mw->replace( '', $text );
		}

		# Now match and remove the rest of them
		$mwa = $this->magicWordFactory->getDoubleUnderscoreArray();
		$this->mDoubleUnderscores = $mwa->matchAndRemove( $text );

                .... various ad hoc handling of specific double underscores ....

		# Cache all double underscores in the database
		foreach ( $this->mDoubleUnderscores as $key => $val ) {
			$this->mOutput->setPageProperty( $key, '' );
		}

		// @phan-suppress-next-line PhanTypeMismatchReturnNullable False positive
		return $text;
	}

So it does seem like perhaps $this->mOutput->setPageProperty('toc', '') should be added to the clause at the top of the file so that all double underscores are uniformly reflected in page properties.

However, your larger point about Parsoid HTML mw:PageProp being confusing/confusable for the ParserOutput "page property" concept is a very sound one. Parsoid uses mw:PageProp for quite a number of things (https://www.mediawiki.org/wiki/Specs/HTML/) and I bet __TOC__ is not the only one which wasn't actually a "ParserOutput page property". I suspect some historical confusion there, as originally Parsoid didn't have a clean separation between "ParserOutput metadata" (jsconfigvars, extensiondata, page properties, categories, ...) and the HTML so tried to embed every bit of metadata directly into the HTML. We may wish to actually rename mw:PageProp in the future and/or more clearly differentiate the Parsoid HTML property from the ParserOutput concept. That's probably a biggish future fix, however; we'd almost certainly have to bump the major revision of the Parsoid HTML spec for that.

WRT <references> I was referring mostly to Parsoid's internal data-parsoid tracking info, and how we handle a variety of "not really there" tags. Not just <references> but also "auto close tags" (when the author forgets a </b> or </a> and it is closed by the end of the paragraph instead, eg). So Parsoid can distinguish between a <meta> tag corresponding to a literal __TOC__ in the source text and a <meta> tag which was automatically inserted because of the lack of a literal __TOC__ location. It is true that the regular expressions in ParserOutput to find and replace the appropriate <meta> might eventually need to be broadened to accomodate the particular form of the <meta> tag which Parsoid emits, which might contain additional attributes, but that can be done at the appropriate future time.

Change 823267 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Add a page property for __TOC__

https://gerrit.wikimedia.org/r/823267

Change 748890 merged by jenkins-bot:

[mediawiki/core@master] parser: Prepare to use a <meta> tag for the internal TOC_PLACEHOLDER

https://gerrit.wikimedia.org/r/748890

Change 733032 merged by jenkins-bot:

[mediawiki/core@master] parser: Use a <meta> tag for the internal TOC_PLACEHOLDER

https://gerrit.wikimedia.org/r/733032

Change 832530 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Remove transitional TOC_START/TOC_END support

https://gerrit.wikimedia.org/r/832530

Assigned to Scott for review.

Change 864864 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] WIP: Generate section metadata for TOC

https://gerrit.wikimedia.org/r/864864

Change 864864 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Generate section metadata for TOC

https://gerrit.wikimedia.org/r/864864

Change 888790 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.17.0-a15

https://gerrit.wikimedia.org/r/888790

Change 888790 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.17.0-a15

https://gerrit.wikimedia.org/r/888790

Change 823267 merged by jenkins-bot:

[mediawiki/core@master] Add a page property for __TOC__

https://gerrit.wikimedia.org/r/823267

Change 832530 abandoned by C. Scott Ananian:

[mediawiki/core@master] Remove transitional TOC_START/TOC_END support

Reason:

Replaced by I3254d0acba31e107b50767797a2b0ad28aba59ee

https://gerrit.wikimedia.org/r/832530