Thoughts on element IDs, sections, incremental parsing and fast section editing
Closed, InvalidPublic0 Estimated Story Points
Actions

Assigned To

None

Authored By

	• GWicke
	Jan 26 2015, 6:34 AM

Description

Some rough ideas on how we could integrate several independent ideas / requirements around section editing, element ids and incremental parsing:

As discussed in T78676, only setting element ids on top-level sections would reduce the compressed html size by about 20%. We could use a path scheme to identify information for nested elements. Something like an index array ['mw1234',0,2,5], which evaluates as document.getElementById('mw1234').childNodes[0].childNodes[2].childNodes[5].
Section editing by top-level section can be efficiently implemented with an offset index and string-based operations. The granularity of edits would still be reasonably small (apart from huge tables, perhaps). Using a top-level wikitext section offset index, we could even serialize only the modified top-level sections, and reuse the wikitext wholesale for the unmodified sections without ever loading the full DOM (which currently accounts for about half of the html2wt time).
Mobile web & apps would like top-level sections (those defined by headings, so often multi-paragraph) to be wrapped into a <section> element for rendering purposes: T78734. They would also like to be able to retrieve the lead section separately from other sections, especially for apps. This can again be supported efficiently with an offset index.
Incremental parsing in parsoid could be section-based too. This would also align with the expectation of wikitext section edits not affecting other parts of the page.

Related Objects
Search...

Status	Assigned	Task
Resolved	• GWicke	T144814 Services Team Goals 2016/2017 Q2: October - December
Resolved	• mobrovac	T136942 Services Team Goals July - September 2016 (Q1 2016-17)
Resolved	• mobrovac	T118871 Services team goals April - June 2016 (Q4 2015/16)
Resolved	None	T118868 Services team goals January - March 2016 (Q3 2015/16)
Resolved	• GWicke	T111819 Services team goals October - December 2015 (Q2 2015/16)
Resolved	• GWicke	T102306 Services team roadmap July - September 2015 (Q1 2015/16)
Resolved	None	T92468 Services Roadmap April - June 2015 (Q4 2014/2015)
Resolved	• GWicke	T91533 Services team Q3 (Jan - March 2015) quarterly goal tracking
Invalid	None	T87556 Thoughts on element IDs, sections, incremental parsing and fast section editing
Resolved	• Pchelolo	T94890 RFC: API for retrieval and saving of top-level HTML elements / sections by element ID
Resolved	Arlolra	T96279 Provide data-section-offsets with HTML and WT offsets for immediate children of <body>
Resolved	• mobrovac	T101501 RFC: HTML and wikitext save API end-points

Event Timeline

• GWicke created this task.Jan 26 2015, 6:34 AM

• GWicke raised the priority of this task from to Needs Triage.

• GWicke updated the task description. (Show Details)

• GWicke added projects: Parsoid, RESTBase-API, VisualEditor.

• GWicke subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 26 2015, 6:34 AM

• GWicke updated the task description. (Show Details)Jan 26 2015, 6:39 AM

• GWicke edited projects, added VisualEditor-Performance; removed VisualEditor.

• GWicke set Security to None.

• GWicke edited subscribers, added: • ssastry, ori, tstarling and 5 others; removed: Aklapper.

Restricted Application added a project: VisualEditor. · View Herald TranscriptJan 26 2015, 6:39 AM

• GWicke renamed this task from Thoughts on element IDs, incremental parsing and section editing / -markup to Thoughts on element IDs, sections, incremental parsing and fast section editing.Jan 26 2015, 6:39 AM

• GWicke updated the task description. (Show Details)

• GWicke updated the task description. (Show Details)Jan 26 2015, 6:55 AM

fbstj awarded a token.Jan 26 2015, 7:33 AM

Liuxinyu970226 subscribed.Jan 26 2015, 9:27 AM

This is currently talking solely about top-level sections (and top-level headings), can you elaborate why? (Is this a technical limitation, or are you just listing primary use-cases for now?)

@matmarex, we currently emit ids for each element in the DOM, and *could* support editing at that level. However, it is difficult to do so efficiently without loading a full DOM of the page on the server. All those id attributes also blow up the size of the HTML by about 20% (see T78676).

For these reasons, I am wondering if limiting the granularity to top-level sections would be a good compromise between network transfer size & API performance. You can still edit smaller features of course, but you'd send back the section containing it to the API.

Do you think section edit granularity would be too coarse in general?

I honestly don't know. I think that emitting proper <section/> tag wrappers for the sections (all section, subsections too) [1], and then providing section-level editing (subsections too) would be great, but maybe that's just because I am used to this. I am sure that it would make many people really happy, irrespective of the actual utility :)

[1] Whenever possible, obviously you can't do this for a section that starts in the middle of a table – not generating wrappers, and not providing section editing for these, wouldn't be unreasonable.

matmarex awarded a token.Jan 26 2015, 10:29 PM

• brooke subscribed.Jan 26 2015, 11:00 PM

• GWicke updated the task description. (Show Details)Jan 27 2015, 3:05 AM

• GWicke updated the task description. (Show Details)Jan 27 2015, 3:08 AM

• GWicke updated the task description. (Show Details)

• GWicke mentioned this in T86564: 2015 MediaWiki Developer Summit session proposal: Long-term plan for content representation, editing, caching and skins.Jan 27 2015, 3:13 AM

Liuxinyu970226 awarded a token.Jan 27 2015, 4:02 AM

• ssastry moved this task from Needs Triage to In Progress on the Parsoid board.Jan 31 2015, 1:18 AM

Jdforrester-WMF moved this task from To Triage to Bug Fixes on the VisualEditor board.Jan 31 2015, 3:59 AM

Jdforrester-WMF triaged this task as Medium priority.Feb 5 2015, 12:37 AM

Jdforrester-WMF added a project: VisualEditor 2014/15 Q3 blockers.Feb 10 2015, 8:39 PM

Jdforrester-WMF edited a custom field.Feb 10 2015, 9:04 PM

Jdforrester-WMF moved this task from Nominated to Dependencies on the VisualEditor 2014/15 Q3 blockers board.Feb 11 2015, 12:33 AM

Aklapper mentioned this in T88613: Page bump when editing a section in VE.Feb 11 2015, 7:31 PM

Aklapper mentioned this in T55217: VisualEditor: When editing a section, don't wait for loading to be complete before scrolling to the section.

Aklapper updated the task description. (Show Details)Feb 11 2015, 7:34 PM

Tim by email:

Why is it necessary to ensure that template output is balanced? If a template starts out as a single DOM subtree, and it is changed so that it is a new DOM subtree, then surely you can just replace the old subtree in the output with the new one. If you reparse the template and it turns out to be unbalanced when it was previously balanced, then you can reparse the whole page. An unbalanced template would not generate the same sort of annotation in the DOM output, so if a template changes which was previously unbalanced, no subtree could be found to selectively replace, and so the whole page would be reparsed.

Gabriel's reply by IRC:
<gwicke> btw, re the dom fragment question: IIRC the main issue is determining whether some bit of HTML is ultimately balanced or not without hacking up the HTML tree builder library
<gwicke> even a stray closing tag could combine with earlier sections
<gwicke> depending on which kind of element it is etc
<TimStarling> and how hard would it be to hack up the HTML tree builder library?
<gwicke> it might not be impossible, but it's pretty ugly as we have so far avoided forking the upstream library
<TimStarling> so, hundreds of lines?
<gwicke> the other issue is that VE previews for example parse a new template in one dom parse
<gwicke> if templates can be unbalanced, then it won't necessarily be WYSIWYG
<gwicke> hundreds of lines that we'll have to maintain against an upstream library
<TimStarling> got it
<TimStarling> if you're previewing, say, a table header, just closing off the tags would give you a reasonable preview
<gwicke> that's what happens by default
<gwicke> the issues isn't the balancing, but discovering that it happened and figuring out whether it'll matter in a full-page re-parse
<gwicke> most of the balancing does not affect the outer context at all; it is also ubiquitous
<gwicke> so just marking everything that was balanced won't be helpful
<TimStarling> so this is parsoid/node_modules/html5?
<gwicke> yup
<gwicke> there are also cases where the wikitext will combine with other syntax in a way that won't cause an unbalanced DOM
<gwicke> we could make the argument that we don't care about those
<gwicke> example: [{{echo|[}}foo]]
<gwicke> there are less far-fetched real-life templates in common use on nlwiki
<gwicke> they produce the attributes (but not the start tag) of a table tag followed by a newline and table contents
<gwicke> parsoid detects when page content ends up in a dom with template content

• GWicke mentioned this in T78676: Store & load data-mw separately.Feb 12 2015, 5:57 AM

Jdforrester-WMF added a project: Blocked-on-Parsoid.Feb 13 2015, 9:39 PM

• Elitre subscribed.Feb 27 2015, 3:53 PM

• GWicke added a parent task: T91533: Services team Q3 (Jan - March 2015) quarterly goal tracking.Mar 4 2015, 6:41 PM

• GWicke mentioned this in T91533: Services team Q3 (Jan - March 2015) quarterly goal tracking.Mar 4 2015, 6:51 PM

Jdforrester-WMF removed a project: VisualEditor 2014/15 Q3 blockers.Mar 10 2015, 10:45 PM

• GWicke mentioned this in T92468: Services Roadmap April - June 2015 (Q4 2014/2015).Mar 13 2015, 5:00 PM

• GWicke mentioned this in T94422: Consistently use the same render for html2wt processing after an edit.Mar 30 2015, 6:50 PM

• GWicke mentioned this in T93715: [EPIC] Make Parsoid HTML output completely deterministic.Mar 30 2015, 7:10 PM

Regarding this comment in the description: "Using a top-level wikitext section offset index, we could even serialize only the modified top-level sections, and reuse the wikitext wholesale for the unmodified sections without ever loading the full DOM (which currently accounts for about half of the html2wt time)."

< 20% of the total time spent inside Parsoid (~50ms out of about ~300ms?) and fairly small in the total time spend in the VE -> save wikitext lifecycle. So, in my opinion, we should pick an id assignment strategy that simplifies the implementation.

@ssastry, I think the exact assignment strategy is fairly orthogonal to section serialization. The id assignment strategy discussed above is primarily aimed at reducing the size overhead we introduced by adding a random ID attribute on each element.

In any case, are your timings the accumulative dom parse times for both the original *and* the modified HTML, or only for one of them? Only serializing a section would of course also reduce the times spent in DOM diffing and selser itself. The 99th percentile for html2wt is around 2s, surely mostly driven by large pages. Getting that down should be useful especially for small section edits and micro-contributions.

Okay .. I didn't read carefully .. I was mostly concerned that id assignment and section serialization were related. I was concerned because the proposed id assignment scheme has the potential of mixing up metadata on edits (as we discussed on T94422). But, yes they are indeed orthogonal things in which case we can deal with the id assignment issues separately from section editing.

We (@ssastry, @Catrope, @GWicke) just discussed this a bit on IRC. As a first step, we agreed that it would be useful to build an API for retrieval and saving of direct child elements of body. The wider issue around section-based parsing and markup can then be tackled independently in a second phase. If we end up introducing a <section> wrapper element, then this will continue to work with a top-level element based API.