Page MenuHomePhabricator

Detect when pages have generated content, and figure out how to inject this content into the Read HTML
Closed, DeclinedPublic


The Parsing team has identified that generated content is omitted from Parsoid output. See here

This means that for the Reading use case we have to figure out when this continent is missing and then inject it in a way that makes sense for the Reading HTML.

Two examples of this:
T151223: Category listing pages are not populated properly in Parsoid
T148118: Parsoid doesn't include the main image for a File page

Questions to answer for this ticket:

  1. It is implied that this is related to specific namespaces. But is this something that happens in the main namespace?
  2. Is there anything that can be gleaned from the Parsoid output to let us know it is missing?
  3. If not, is there a definitive list of name spaces that we need to cover?
  4. Is this a project specific thing? Is it different in each project?
  5. Once we determine if there is generated content, we need to figure out how to generate it and inject it into the HTML - can we use the MW API? Is it specific to the type of content?

Event Timeline

Notes from @ssastry

Some old notes form our parsing team meeting in case it helps ... we are going to be brainstorming about a parser extension mechanism while at the hackathon .. so this might make a cameo appearance there.

  • maybe treat as wt = page-wt + <namespace ns="..", title=".."></namespace>
  • (but file pages the custom wikitext comes last, category pages it ccomes first)
  • cscott's idea is to try to encapsulate structured data inside the opaque "extension" so that we can let folks write specialized editors as "extensions"
  • pretend the wikitext is: <special type="category">category description in wikitext</special>, so that you could then write an "extension" to edit the wikitext portion w/o directly exposing it.
  • file pages would have more complicated structured data "inside".

this is a bit foggy, have to think about how to preserve appropriate abstraction so that PHP side can expose structured data w/o (necessarily) exposing it to Parsoid core... but leave it in a form that would let a Parsoid extension optionally work on it.

Another important note:

This output is NOT useful for editing clients, only for reading clients. So this is why we are investigating it within the scope of the PCS and not within Parsoid at this time.