Page MenuHomePhabricator

Parsoid and the legacy parser should emit exactly the same ParserOutput metadata
Open, MediumPublic

Description

Courtesy input from @Anomie, we need to be able to get the following data from the output of a Parsoid parse for insertion into the database:

  • All pages linked (namespace number + dbkey) or otherwise depended on for existence checks
    • "otherwise depended on" can include things like #ifexist and some kinds of access via Scribunto.
    • For redirects this includes only the redirect, not its target.
    • This is used to know which pages need updating when a page is created or deleted.
  • All templates transcluded (namespace number + dbkey)
    • This includes templates transcluded via other templates.
    • This includes both the redirect and its target, e.g. if Template:Cn redirects to Template:Citation_needed, a page using {{cn}} would record both.
    • This includes other page content dependencies from things like Scribunto loading a page's content for processing.
    • This is used to know which pages need updating when a page is edited, and for cascading protection.
  • All files embedded (dbkey, namespace number is always 6).
    • This includes both the redirect and its target.
    • This includes other file dependencies from things like Scribunto accessing file metadata.
    • This is used to know which pages need updating when a file is uploaded, and for cascading protection.
  • All categories the page is supposed to be in (dbkey, namespace number is always 14), including the raw sortkey for each.
    • This includes tracking categories that aren't directly present in the wikitext. T137584 requests making it possible to add such categories from Scribunto.
    • This is used to produce the lists of pages in the category when viewing the category, and the list of categories at the bottom of the page, and for the "category watching" feature.
  • All language links (prefix and target string).
    • I'm not sure whether Wikibase provides its langlinks separately or injects them into the ParserOutput during the parse.
    • This is used for the sidebar.
  • All interwiki links (prefix and target string).
  • All external link URLs. This does not include language links, interwiki links, and so on.
  • Any "page properties" set during the parse. A page property has a name (max 60 bytes) and a value (max 255 bytes).

This data might be collected during the parse, or determined by crawling the DOM after the fact, or really any other method as long as the necessary information is available.

There's also some data recorded in the ParserOutput object so it can be used when the output is pulled from the parser cache, but not otherwise stored in the database:

  • Page status indicators: An associative array of keys to HTML blobs to render right-floated next to the page title. This is separate from the main content.
  • An associative array of HTML blobs to add to the <head> tag.
  • ResourceLoader:
    • modules to load for the page
    • CSS modules to load for the page
    • Key-value pairs for page-specific entries in mw.config.
  • An array of wikitext blobs to display as "warnings" at the top of the edit page preview.
  • Table of contents data, available as a structured array and as HTML.
  • Parser options used during the parse, to reduce cache fragmentation (see T247788#5976651 for a brief explanation).
  • Limit report data, e.g.
    • Time taken for the parse
    • Values of counters for various limits
    • Scribunto profiling and log output
  • Flags like whether to emit X-Frame-Options: DENY and whether the page should be noindexed.
  • CSP data: extra values for script-src, default-src, and style-src.
  • If the parse had to guess at values for {{REVISIONID}}, {{PAGEID}}, and/or {{REVISIONTIMESTAMP}} for a preview parse, the values that were guessed. Used to avoid re-parsing on save in case it turns out to have guessed correctly.
  • Timestamp of the parse, used for things like {{CURRENTMONTH}}.
  • Recommended cache TTL, e.g. based on use of things like {{CURRENTMONTH}}.
  • Key-value pairs holding arbitrary data set by extensions, similar to the page properties but not stored separately in the DB.
  • Getters for some specific page properties.
  • Revision IDs of templates transcluded, used by FlaggedRevs and UploadWizard.
  • SHA-1 hash and timestamp of files embedded, used by FlaggedRevs and MultimediaViewer.
  • There's a system for registering pre-output processing hooks, only used by Extension:LanguageSelector (not WMF deployed).

Again, these might be collected during the parse or determined by crawling the DOM after the fact. And some might even be deprecated or significantly change for Parsoid, as long as any calling code is updated.

Related Objects

StatusSubtypeAssignedTask
OpenReleaseNone
OpenNone
OpenNone
OpenNone
OpenFeatureNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Resolvedcscott
Opencscott
Resolvedmatmarex
OpenNone
Opencscott
Opencscott
Resolvedcscott
OpenNone
ResolvedNone
OpenNone
OpenNone
Opencscott
OpenNone
Opencscott
Opencscott
Opencscott
Opencscott
Opencscott
Opencscott
OpenNone
OpenNone
ResolvedBUG REPORTJgiannelos
OpenNone
OpenBUG REPORTNone

Event Timeline

ssastry triaged this task as Medium priority.Jun 14 2022, 9:25 PM

Change 888058 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Add comments about maintaining language link metadata

https://gerrit.wikimedia.org/r/888058

Change 888058 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Add comments about maintaining language link metadata

https://gerrit.wikimedia.org/r/888058

Change 901245 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump parsoid to 0.18.0-a2

https://gerrit.wikimedia.org/r/901245

Change 901245 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.18.0-a2

https://gerrit.wikimedia.org/r/901245