Parsoid HTML contains a good amount of information in its head section:
<head prefix="mwr: http://en.wikipedia.org/wiki/Special:Redirect/"> <meta property="mw:TimeUuid" content="2153e39e-a974-11e5-b4f1-0512e7f3ec96"> <meta property="mw:articleNamespace" content="0"> <link rel="dc:replaces" resource="mwr:revision/696077222"> <meta property="dc:modified" content="2015-12-23T12:52:52.000Z"> <meta about="mwr:user/9455233" property="dc:title" content="M2545"> <link rel="dc:contributor" resource="mwr:user/9455233"> <meta property="mw:revisionSHA1" content="eec7863a2b6aa4e6913cead6b286110f1d8457f0"> <meta property="dc:description" content="/* History */"> <meta property="mw:parsoidVersion" content="0"> <link rel="dc:isVersionOf" href="//en.wikipedia.org/wiki/San_Francisco"> <title>San_Francisco</title> <base href="//en.wikipedia.org/wiki/"> <link rel="stylesheet" href="//en.wikipedia.org/w/load.php?modules=mediawiki.legacy.commonPrint,shared|mediawiki.skinning.elements|mediawiki.skinning.content|mediawiki.skinning.interface|skins.vector.styles|site|mediawiki.skinning.content.parsoid|ext.cite.style|mediawiki.raggett&only=styles&skin=vector"> <style type="text/css">:root .ext-quick-survey-panel {display:none !important;}</style> </head>
We designed this head section in an early phase of the Parsoid project, before we had actual users apart from VisualEditor. Now that we have actual users, I think it's worth revisiting which of this metadata turns out to be useful in its current form.
Doubts and issues
Basically all revision-related information is already available separately as JSON revision metadata. Those API end points implement features like user name suppression, which is necessary when legally sensitive information is embedded in user names. This feature is not implemented for Parsoid HTML, and it seems unlikely that the complexity would be worth it.
Page-specific information like per-page styles aren't provided in a format that is very useful for composing content from several elements along the lines of T105845. It would be desirable to provide this information in a more structured form, so that it can be aggregated and processed while composing content.
Other bits are in the head primarily to make the page overall a valid RDFa object. While attractive in theory, it seems unclear if the semantics exposed here are actually relevant to any "semantic web" project, and if anybody actually processes this information using generic RDFa tools, rather than custom mappings to internal representations.
It is entirely possible that apart from <title>, <base href> and styles, most of this information might not have any actual users.
Discuss!
Let us know
- which parts of the Parsoid <head> information you use,
- if the way this information is exposed fits your use case,
- if you use generic RDFa tools to extract information from Parsoid HTML, and
- which bits you could do without.