metadata
Open, MediumPublic0 Estimated Story Points
Actions

Description

Parsoid HTML contains a good amount of information in its head section:

<head prefix="mwr: http://en.wikipedia.org/wiki/Special:Redirect/">
  <meta property="mw:TimeUuid" content="2153e39e-a974-11e5-b4f1-0512e7f3ec96">
  <meta property="mw:articleNamespace" content="0">
  <link rel="dc:replaces" resource="mwr:revision/696077222">
  <meta property="dc:modified" content="2015-12-23T12:52:52.000Z">
  <meta about="mwr:user/9455233" property="dc:title" content="M2545">
  <link rel="dc:contributor" resource="mwr:user/9455233">
  <meta property="mw:revisionSHA1" content="eec7863a2b6aa4e6913cead6b286110f1d8457f0">
  <meta property="dc:description" content="/* History */">
  <meta property="mw:parsoidVersion" content="0">
  <link rel="dc:isVersionOf" href="//en.wikipedia.org/wiki/San_Francisco">
  <title>San_Francisco</title>
  <base href="//en.wikipedia.org/wiki/">
  <link rel="stylesheet" href="//en.wikipedia.org/w/load.php?modules=mediawiki.legacy.commonPrint,shared|mediawiki.skinning.elements|mediawiki.skinning.content|mediawiki.skinning.interface|skins.vector.styles|site|mediawiki.skinning.content.parsoid|ext.cite.style|mediawiki.raggett&amp;only=styles&amp;skin=vector">    
  <style type="text/css">:root .ext-quick-survey-panel
{display:none !important;}</style>
</head>

We designed this head section in an early phase of the Parsoid project, before we had actual users apart from VisualEditor. Now that we have actual users, I think it's worth revisiting which of this metadata turns out to be useful in its current form.

Doubts and issues

Basically all revision-related information is already available separately as JSON revision metadata. Those API end points implement features like user name suppression, which is necessary when legally sensitive information is embedded in user names. This feature is not implemented for Parsoid HTML, and it seems unlikely that the complexity would be worth it.

Page-specific information like per-page styles aren't provided in a format that is very useful for composing content from several elements along the lines of T105845. It would be desirable to provide this information in a more structured form, so that it can be aggregated and processed while composing content.

Other bits are in the head primarily to make the page overall a valid RDFa object. While attractive in theory, it seems unclear if the semantics exposed here are actually relevant to any "semantic web" project, and if anybody actually processes this information using generic RDFa tools, rather than custom mappings to internal representations.

It is entirely possible that apart from <title>, <base href> and styles, most of this information might not have any actual users.

Discuss!

Let us know

which parts of the Parsoid <head> information you use,
if the way this information is exposed fits your use case,
if you use generic RDFa tools to extract information from Parsoid HTML, and
which bits you could do without.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T122390 Is RDFa metadata in Parsoid HTML head actually useful to you / no user name & edit comment suppression in Parsoid <head> metadata
		Resolved		• ssastry	T125266 Remove user name and edit comment from html <head>

Event Timeline

• GWicke created this task.Dec 24 2015, 3:32 AM

• GWicke raised the priority of this task from to Medium.

• GWicke updated the task description. (Show Details)

• GWicke added projects: Parsoid, ContentTranslation-CXserver, VisualEditor, RESTBase, Web-Team-Backlog, Mobile-Content-Service.

• GWicke added subscribers: • GWicke, Bianjiang, Bianjiang1 and 6 others.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 24 2015, 3:32 AM

could someone explain what's the meaning of "Basically all revision-related information is already available separately as JSON revision metadata."? e.g. pointing me to the (RESTbase) API to get that data?

FWIW, I've found the head metadata very useful when writing client-side reader APIs. It lets me do a single RESTBase query to get all the relevant metadata for the page. The alternative would be to parse the RESTBase redirect to get the revision id and then to do a separate query to the mediawiki action API to pull the rest of the revision metadata. (As I understand it, RESTBase itself does some rewriting of the HEAD to add it's own revision UUID to the <head> metadata.)

One alternative would be to create a separate RESTBase endpoint for the revision metadata for the "latest" version of a page, which would basically return the <head> (except possibly as a JSON object instead of RDFa/XML), and then query RESTBase specifically for the UUID returned by that first query. But that adds a round-trip latency and slows down page load.

jmadler subscribed.Jan 6 2016, 5:11 AM

@Bianjiang, revision metadata is available

a) from the action api, via action=query, and
b) from an experimental RESTBase end point at https://en.wikipedia.org/api/rest_v1/?doc#!/Page_content/get_page_revision_revision.

There is also an entry point to retrieve the latest revision information for a given title at https://en.wikipedia.org/api/rest_v1/?doc#!/Page_content/get_page_title_title.

One alternative would be to create a separate RESTBase endpoint for the revision metadata for the "latest" version of a page, which would basically return the <head> (except possibly as a JSON object instead of RDFa/XML), and then query RESTBase specifically for the UUID returned by that first query. But that adds a round-trip latency and slows down page load.

@cscott, I believe https://en.wikipedia.org/api/rest_v1/?doc#!/Page_content/get_page_title_title already covers your use case. The request for the metadata can be sent off in parallel with the HTML request.

• GWicke set Security to None.Jan 12 2016, 9:09 PM

• GWicke mentioned this in T120409: RESTBase should honor wiki-wide deletion/suppression of users.

• GWicke edited subscribers, added: • csteipp; removed: • Bianjiangwiki, Bianjiang1.

• GWicke renamed this task from Is RDFa metadata in Parsoid HTML head actually useful to you? to Is RDFa metadata in Parsoid HTML head actually useful to you / no user name & edit comment suppression in Parsoid <head> metadata.Jan 12 2016, 9:22 PM

Krenair subscribed.Jan 12 2016, 9:43 PM

Independent of the format of the metadata (RDFa or custom format), and where the metadata is stored (parsoid <head> or RESTBase api or core-MW api), an orthogonal concern has arisen around some bits of information being exposed there.

I hear that there is a legal concern around suppressing of certain bits of information based on requests. Specifically, it appears that creatively crafted user names are some ways where sensitive information is sometimes released (Imagine a user-id such as: "User:123-456-789-is-the-phone-number-of-subbu"). This then leads to requests to suppress that information from metadata. I don't know all the details, but when such requests come in, those bits need to be removed from the metadata that is exposed.

So, instead of having RESTBase write suppression routines (mimicking functionality that is part of core-MW), it makes sense not to provide this as part of Parsoid in the first place. Specifically, these 2 bits of information:

<meta about="mwr:user/9455233" property="dc:title" content="M2545">
<link rel="dc:contributor" resource="mwr:user/9455233">

So, at first blush, it seems that the simplest approach would be to not add this to Parsoid metadata. However, this immediately begs the question. Should we just let the core MW api continue to serve the metadata (which also handles this legal scenario of suppressing sensitive information).

So, to summarize, here are the options:

Suppress the sensitive pieces of metadata from the Parsoid <head> section. In that case, if any applications depend on having such information, they would need to get it from the core MW api.

Remove all revision-specific metadata from Parsoid and have them be served from the existing core-MW api.

Is there a reason to do (1) and not (2)?

Jdforrester-WMF moved this task from To Triage to TR0: Interrupt on the VisualEditor board.Jan 19 2016, 8:08 PM

@Catrope will be best placed to answer this, but I can have a look at the code.

Another reason to potentially remove metadata from the html head:

We have recently started to look into more efficient storage for old revisions (see T122028). A promising strategy is to only store changed chunks or sections. However, the metadata in the head of the document currently always causes the first section to differ, rendering that optimization less effective.

Looking at the snippet and our MW code nothing immediately stands out...

• GWicke mentioned this in T125266: Remove user name and edit comment from html <head>.Jan 29 2016, 10:58 PM

In T122390#1956753, @Esanders wrote:

@Catrope will be best placed to answer this, but I can have a look at the code.

Of the RDFa things, only mw:TimeUuid comes to mind. This was relied on at one point as a substitute / back-compat thing for the E-Tag header. VE now has E-Tag support, so removing the mw:TimeUuid fallback should not break anything, but it's good to be careful as the consequences of dropping this information are nasty but sometimes subtle. Note that there is no explicit knowledge about this tag in the VE code, because it's not needed: VE simply sends back the <head> unmodified, so if Parsoid sends mw:TimeUuid and expects to get it back, that'll work. I would suggest checking whether Parsoid receives any serialization requests that have no E-Tag but do have mw:TimeUuid before removing it.

I don't believe any of the other RDFa attributes are used, or have ever been used by VE. Many of them either duplicate information, or contain information that api.php and mw.config already provide. Things VE cares about are:

Possibly mw:TimeUuid, see above. Not used directly, but removing it may break VE if its E-Tag support turns out to be broken, or may break other clients that don't know about E-Tag.
<base>: this is essential, and it's the original reason why VE cares about the <head>. But you're probably not thinking about removing or changing this, given how many things would be affected (everything related to links, for example).
The about attribute on the <html> tag (not part of the <head>, but worth mentioning). This contains the revid, which is used to determine whether the HTML VE received actually corresponds to the revid it was expecting to get (*)
<title> is not currently used by VE, but could conceivably be used in the future if we wanted to have really really good displaytitle support or something

(*) Explanation of why this is needed: when the user is viewing the latest version (i.e. it's not an ?oldid= page view) and clicks edit, we send a request to RESTbase without a revid (i.e. asking for the latest revision), and we send a request for metadata to api.php (also asking for the latest revision). The responses to these two requests could be for two different revisions (usually this would happen because a new edit just went through, so api.php returns metadata about the new revision while RESTbase serves the old one; but a race condition can also cause the reverse to happen). For VE to be able to detect this situation, both requests have to return the revid. The requests are then retried with an explicit oldid parameter, asking for the lower of the two revids.

In T122390#2012355, @Catrope wrote:

The about attribute on the <html> tag (not part of the <head>, but worth mentioning). This contains the revid, which is used to determine whether the HTML VE received actually corresponds to the revid it was expecting to get (*)

The served rev_id is returned in the ETag header by RESTBase (the ETag format is <rev_id>/<time_uuid>). Given that VE already uses the ETag for the TimeUUID, it could use it for revision-collision detection as well.

In T122390#2012969, @mobrovac wrote:

In T122390#2012355, @Catrope wrote:

The about attribute on the <html> tag (not part of the <head>, but worth mentioning). This contains the revid, which is used to determine whether the HTML VE received actually corresponds to the revid it was expecting to get (*)

The served rev_id is returned in the ETag header by RESTBase (the ETag format is <rev_id>/<time_uuid>). Given that VE already uses the ETag for the TimeUUID, it could use it for revision-collision detection as well.

Ooh, nice, I didn't realize that. That's also less ugly to parse than about="http://en.wikipedia.org/wiki/Special:Redirect/revision/540443854". Currently that's being parsed with:

		aboutDoc = this.doc.documentElement.getAttribute( 'about' );
		if ( aboutDoc ) {
			docRevIdMatches = aboutDoc.match( /revision\/([0-9]*)$/ );
			if ( docRevIdMatches.length >= 2 ) {
				docRevId = parseInt( docRevIdMatches[ 1 ] );
			}
		}

So VE should probably switch to that.

Amire80 added a project: ContentTranslation.Feb 18 2016, 1:13 PM

• ssastry closed subtask T125266: Remove user name and edit comment from html <head> as Resolved.Feb 22 2016, 10:11 PM

• AlexMonk-WMF subscribed.Feb 23 2016, 7:46 PM

• AlexMonk-WMF removed a subscriber: Krenair.Feb 23 2016, 7:49 PM

Amire80 moved this task from Needs Triage to Upstream/Other teams on the ContentTranslation board.Feb 24 2016, 11:57 AM

In T122390#2013477, @Catrope wrote:

So VE should probably switch to that.

Filed as T127941

...except that the ETag isn't available if you don't have RESTBase.

In T122390#2059265, @Esanders wrote:

...except that the ETag isn't available if you don't have RESTBase.

Right. As in https://gerrit.wikimedia.org/r/#/c/271929/ where I updated the content-types for HTML and data-parsoid in <head>, I think we should leave this about information behind.

So, looks like we can remove everything except the about tag in the <tag>, <title>, <base> and <meta> tags with the html and data-parsoid versions.

Right now, RESTBase is inserting the mw:TimeUuid tag, but Parsoid should be able to do that instead. It just fell off our radar.

Right now, RESTBase is inserting the mw:TimeUuid tag, but Parsoid should be able to do that instead. It just fell off our radar.

Parsoid can't insert this, as it doesn't know the timeuuid that will be used for storage. mw:TimeUuid was a temporary work-around, so that we can retrieve the original data-parsoid, html etc during html2wt processing. We should check if there are still clients not sending the ETag back, and remove mw:TimeUuid once there aren't any left.

Parsoid can and probably should emit an ETag of its own. The syntax we have been using is "{revision}/{timeuuid}".

• bearND moved this task from Incoming to Tracking on the Mobile-Content-Service board.Mar 10 2016, 5:33 PM

Amire80 moved this task from Backlog to Compatibility on the ContentTranslation-CXserver board.Apr 25 2016, 9:29 AM

Jdforrester-WMF set the point value for this task to 0.May 16 2016, 8:30 AM

dr0ptp4kt moved this task from Incoming to 2014-15 Q4 on the Web-Team-Backlog board.Aug 5 2016, 4:51 PM

Jdforrester-WMF moved this task from TR0: Interrupt to External and Administrivia on the VisualEditor board.Aug 9 2016, 7:36 PM

• GWicke added a project: Services.Oct 12 2016, 11:19 PM

@ssastry, any news from your side on this?

• GWicke edited projects, added Services (watching); removed Services.Oct 12 2016, 11:48 PM

• NHarateh_WMF added a project: Product-Infrastructure-Team-Backlog-Deprecated.Apr 25 2017, 12:24 PM

• NHarateh_WMF moved this task from Needs triage to Tracking on the Product-Infrastructure-Team-Backlog-Deprecated board.Apr 25 2017, 12:26 PM

• NHarateh_WMF moved this task from Tracking to Backlog on the Mobile-Content-Service board.Apr 25 2017, 4:31 PM