Page MenuHomePhabricator

Thoughts on element IDs, sections, incremental parsing and fast section editing
Closed, InvalidPublic0 Estimated Story Points

Description

Some rough ideas on how we could integrate several independent ideas / requirements around section editing, element ids and incremental parsing:

  • As discussed in T78676, only setting element ids on top-level sections would reduce the compressed html size by about 20%. We could use a path scheme to identify information for nested elements. Something like an index array ['mw1234',0,2,5], which evaluates as document.getElementById('mw1234').childNodes[0].childNodes[2].childNodes[5].
  • Section editing by top-level section can be efficiently implemented with an offset index and string-based operations. The granularity of edits would still be reasonably small (apart from huge tables, perhaps). Using a top-level wikitext section offset index, we could even serialize only the modified top-level sections, and reuse the wikitext wholesale for the unmodified sections without ever loading the full DOM (which currently accounts for about half of the html2wt time).
  • Mobile web & apps would like top-level sections (those defined by headings, so often multi-paragraph) to be wrapped into a <section> element for rendering purposes: T78734. They would also like to be able to retrieve the lead section separately from other sections, especially for apps. This can again be supported efficiently with an offset index.
  • Incremental parsing in parsoid could be section-based too. This would also align with the expectation of wikitext section edits not affecting other parts of the page.

See Also (regarding fast section editing):
T52206: No loading animation when editing a section deep down a page
T55217: VisualEditor: When editing a section, don't wait for loading to be complete before scrolling to the section
T88613: Page bump when editing a section in VE

Related Objects

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke subscribed.
GWicke edited projects, added VisualEditor-Performance; removed VisualEditor.
GWicke set Security to None.
GWicke edited subscribers, added: ssastry, ori, tstarling and 5 others; removed: Aklapper.
GWicke renamed this task from Thoughts on element IDs, incremental parsing and section editing / -markup to Thoughts on element IDs, sections, incremental parsing and fast section editing.Jan 26 2015, 6:39 AM
GWicke updated the task description. (Show Details)

This is currently talking solely about top-level sections (and top-level headings), can you elaborate why? (Is this a technical limitation, or are you just listing primary use-cases for now?)

@matmarex, we currently emit ids for each element in the DOM, and *could* support editing at that level. However, it is difficult to do so efficiently without loading a full DOM of the page on the server. All those id attributes also blow up the size of the HTML by about 20% (see T78676).

For these reasons, I am wondering if limiting the granularity to top-level sections would be a good compromise between network transfer size & API performance. You can still edit smaller features of course, but you'd send back the section containing it to the API.

Do you think section edit granularity would be too coarse in general?

I honestly don't know. I think that emitting proper <section/> tag wrappers for the sections (all section, subsections too) [1], and then providing section-level editing (subsections too) would be great, but maybe that's just because I am used to this. I am sure that it would make many people really happy, irrespective of the actual utility :)

[1] Whenever possible, obviously you can't do this for a section that starts in the middle of a table – not generating wrappers, and not providing section editing for these, wouldn't be unreasonable.

Tim by email:

Why is it necessary to ensure that template output is balanced? If a template starts out as a single DOM subtree, and it is changed so that it is a new DOM subtree, then surely you can just replace the old subtree in the output with the new one. If you reparse the template and it turns out to be unbalanced when it was previously balanced, then you can reparse the whole page. An unbalanced template would not generate the same sort of annotation in the DOM output, so if a template changes which was previously unbalanced, no subtree could be found to selectively replace, and so the whole page would be reparsed.

Gabriel's reply by IRC:
<gwicke> btw, re the dom fragment question: IIRC the main issue is determining whether some bit of HTML is ultimately balanced or not without hacking up the HTML tree builder library
<gwicke> even a stray closing tag could combine with earlier sections
<gwicke> depending on which kind of element it is etc
<TimStarling> and how hard would it be to hack up the HTML tree builder library?
<gwicke> it might not be impossible, but it's pretty ugly as we have so far avoided forking the upstream library
<TimStarling> so, hundreds of lines?
<gwicke> the other issue is that VE previews for example parse a new template in one dom parse
<gwicke> if templates can be unbalanced, then it won't necessarily be WYSIWYG
<gwicke> hundreds of lines that we'll have to maintain against an upstream library
<TimStarling> got it
<TimStarling> if you're previewing, say, a table header, just closing off the tags would give you a reasonable preview
<gwicke> that's what happens by default
<gwicke> the issues isn't the balancing, but discovering that it happened and figuring out whether it'll matter in a full-page re-parse
<gwicke> most of the balancing does not affect the outer context at all; it is also ubiquitous
<gwicke> so just marking everything that was balanced won't be helpful
<TimStarling> so this is parsoid/node_modules/html5?
<gwicke> yup
<gwicke> there are also cases where the wikitext will combine with other syntax in a way that won't cause an unbalanced DOM
<gwicke> we could make the argument that we don't care about those
<gwicke> example: [{{echo|[}}foo]]
<gwicke> there are less far-fetched real-life templates in common use on nlwiki
<gwicke> they produce the attributes (but not the start tag) of a table tag followed by a newline and table contents
<gwicke> parsoid detects when page content ends up in a dom with template content

Regarding this comment in the description: "Using a top-level wikitext section offset index, we could even serialize only the modified top-level sections, and reuse the wikitext wholesale for the unmodified sections without ever loading the full DOM (which currently accounts for about half of the html2wt time)."

< 20% of the total time spent inside Parsoid (~50ms out of about ~300ms?) and fairly small in the total time spend in the VE -> save wikitext lifecycle. So, in my opinion, we should pick an id assignment strategy that simplifies the implementation.

@ssastry, I think the exact assignment strategy is fairly orthogonal to section serialization. The id assignment strategy discussed above is primarily aimed at reducing the size overhead we introduced by adding a random ID attribute on each element.

In any case, are your timings the accumulative dom parse times for both the original *and* the modified HTML, or only for one of them? Only serializing a section would of course also reduce the times spent in DOM diffing and selser itself. The 99th percentile for html2wt is around 2s, surely mostly driven by large pages. Getting that down should be useful especially for small section edits and micro-contributions.

Okay .. I didn't read carefully .. I was mostly concerned that id assignment and section serialization were related. I was concerned because the proposed id assignment scheme has the potential of mixing up metadata on edits (as we discussed on T94422). But, yes they are indeed orthogonal things in which case we can deal with the id assignment issues separately from section editing.

We (@ssastry, @Catrope, @GWicke) just discussed this a bit on IRC. As a first step, we agreed that it would be useful to build an API for retrieval and saving of direct child elements of body. The wider issue around section-based parsing and markup can then be tackled independently in a second phase. If we end up introducing a <section> wrapper element, then this will continue to work with a top-level element based API.

Pchelolo subscribed.

I believe this is invalid after so many years and changes in thinking.