Page MenuHomePhabricator

Is RDFa metadata in Parsoid HTML head actually useful to you / no user name & edit comment suppression in Parsoid <head> metadata
Open, MediumPublic0 Estimated Story Points

Description

Parsoid HTML contains a good amount of information in its head section:

<head prefix="mwr: http://en.wikipedia.org/wiki/Special:Redirect/">
  <meta property="mw:TimeUuid" content="2153e39e-a974-11e5-b4f1-0512e7f3ec96">
  <meta property="mw:articleNamespace" content="0">
  <link rel="dc:replaces" resource="mwr:revision/696077222">
  <meta property="dc:modified" content="2015-12-23T12:52:52.000Z">
  <meta about="mwr:user/9455233" property="dc:title" content="M2545">
  <link rel="dc:contributor" resource="mwr:user/9455233">
  <meta property="mw:revisionSHA1" content="eec7863a2b6aa4e6913cead6b286110f1d8457f0">
  <meta property="dc:description" content="/* History */">
  <meta property="mw:parsoidVersion" content="0">
  <link rel="dc:isVersionOf" href="//en.wikipedia.org/wiki/San_Francisco">
  <title>San_Francisco</title>
  <base href="//en.wikipedia.org/wiki/">
  <link rel="stylesheet" href="//en.wikipedia.org/w/load.php?modules=mediawiki.legacy.commonPrint,shared|mediawiki.skinning.elements|mediawiki.skinning.content|mediawiki.skinning.interface|skins.vector.styles|site|mediawiki.skinning.content.parsoid|ext.cite.style|mediawiki.raggett&amp;only=styles&amp;skin=vector">    
  <style type="text/css">:root .ext-quick-survey-panel
{display:none !important;}</style>
</head>

We designed this head section in an early phase of the Parsoid project, before we had actual users apart from VisualEditor. Now that we have actual users, I think it's worth revisiting which of this metadata turns out to be useful in its current form.

Doubts and issues

Basically all revision-related information is already available separately as JSON revision metadata. Those API end points implement features like user name suppression, which is necessary when legally sensitive information is embedded in user names. This feature is not implemented for Parsoid HTML, and it seems unlikely that the complexity would be worth it.

Page-specific information like per-page styles aren't provided in a format that is very useful for composing content from several elements along the lines of T105845. It would be desirable to provide this information in a more structured form, so that it can be aggregated and processed while composing content.

Other bits are in the head primarily to make the page overall a valid RDFa object. While attractive in theory, it seems unclear if the semantics exposed here are actually relevant to any "semantic web" project, and if anybody actually processes this information using generic RDFa tools, rather than custom mappings to internal representations.

It is entirely possible that apart from <title>, <base href> and styles, most of this information might not have any actual users.

Discuss!

Let us know

  • which parts of the Parsoid <head> information you use,
  • if the way this information is exposed fits your use case,
  • if you use generic RDFa tools to extract information from Parsoid HTML, and
  • which bits you could do without.

Event Timeline

could someone explain what's the meaning of "Basically all revision-related information is already available separately as JSON revision metadata."? e.g. pointing me to the (RESTbase) API to get that data?

FWIW, I've found the head metadata very useful when writing client-side reader APIs. It lets me do a single RESTBase query to get all the relevant metadata for the page. The alternative would be to parse the RESTBase redirect to get the revision id and then to do a separate query to the mediawiki action API to pull the rest of the revision metadata. (As I understand it, RESTBase itself does some rewriting of the HEAD to add it's own revision UUID to the <head> metadata.)

One alternative would be to create a separate RESTBase endpoint for the revision metadata for the "latest" version of a page, which would basically return the <head> (except possibly as a JSON object instead of RDFa/XML), and then query RESTBase specifically for the UUID returned by that first query. But that adds a round-trip latency and slows down page load.

@Bianjiang, revision metadata is available

a) from the action api, via action=query, and
b) from an experimental RESTBase end point at https://en.wikipedia.org/api/rest_v1/?doc#!/Page_content/get_page_revision_revision.

There is also an entry point to retrieve the latest revision information for a given title at https://en.wikipedia.org/api/rest_v1/?doc#!/Page_content/get_page_title_title.

One alternative would be to create a separate RESTBase endpoint for the revision metadata for the "latest" version of a page, which would basically return the <head> (except possibly as a JSON object instead of RDFa/XML), and then query RESTBase specifically for the UUID returned by that first query. But that adds a round-trip latency and slows down page load.

@cscott, I believe https://en.wikipedia.org/api/rest_v1/?doc#!/Page_content/get_page_title_title already covers your use case. The request for the metadata can be sent off in parallel with the HTML request.

GWicke renamed this task from Is RDFa metadata in Parsoid HTML head actually useful to you? to Is RDFa metadata in Parsoid HTML head actually useful to you / no user name & edit comment suppression in Parsoid <head> metadata.Jan 12 2016, 9:22 PM

Independent of the format of the metadata (RDFa or custom format), and where the metadata is stored (parsoid <head> or RESTBase api or core-MW api), an orthogonal concern has arisen around some bits of information being exposed there.

I hear that there is a legal concern around suppressing of certain bits of information based on requests. Specifically, it appears that creatively crafted user names are some ways where sensitive information is sometimes released (Imagine a user-id such as: "User:123-456-789-is-the-phone-number-of-subbu"). This then leads to requests to suppress that information from metadata. I don't know all the details, but when such requests come in, those bits need to be removed from the metadata that is exposed.

So, instead of having RESTBase write suppression routines (mimicking functionality that is part of core-MW), it makes sense not to provide this as part of Parsoid in the first place. Specifically, these 2 bits of information:

<meta about="mwr:user/9455233" property="dc:title" content="M2545">
<link rel="dc:contributor" resource="mwr:user/9455233">

So, at first blush, it seems that the simplest approach would be to not add this to Parsoid metadata. However, this immediately begs the question. Should we just let the core MW api continue to serve the metadata (which also handles this legal scenario of suppressing sensitive information).

So, to summarize, here are the options:

  1. Suppress the sensitive pieces of metadata from the Parsoid <head> section. In that case, if any applications depend on having such information, they would need to get it from the core MW api.
  1. Remove all revision-specific metadata from Parsoid and have them be served from the existing core-MW api.

Is there a reason to do (1) and not (2)?

@Catrope will be best placed to answer this, but I can have a look at the code.

Another reason to potentially remove metadata from the html head:

We have recently started to look into more efficient storage for old revisions (see T122028). A promising strategy is to only store changed chunks or sections. However, the metadata in the head of the document currently always causes the first section to differ, rendering that optimization less effective.

Looking at the snippet and our MW code nothing immediately stands out...

@Catrope will be best placed to answer this, but I can have a look at the code.

Of the RDFa things, only mw:TimeUuid comes to mind. This was relied on at one point as a substitute / back-compat thing for the E-Tag header. VE now has E-Tag support, so removing the mw:TimeUuid fallback should not break anything, but it's good to be careful as the consequences of dropping this information are nasty but sometimes subtle. Note that there is no explicit knowledge about this tag in the VE code, because it's not needed: VE simply sends back the <head> unmodified, so if Parsoid sends mw:TimeUuid and expects to get it back, that'll work. I would suggest checking whether Parsoid receives any serialization requests that have no E-Tag but do have mw:TimeUuid before removing it.

I don't believe any of the other RDFa attributes are used, or have ever been used by VE. Many of them either duplicate information, or contain information that api.php and mw.config already provide. Things VE cares about are:

  • Possibly mw:TimeUuid, see above. Not used directly, but removing it may break VE if its E-Tag support turns out to be broken, or may break other clients that don't know about E-Tag.
  • <base>: this is essential, and it's the original reason why VE cares about the <head>. But you're probably not thinking about removing or changing this, given how many things would be affected (everything related to links, for example).
  • The about attribute on the <html> tag (not part of the <head>, but worth mentioning). This contains the revid, which is used to determine whether the HTML VE received actually corresponds to the revid it was expecting to get (*)
  • <title> is not currently used by VE, but could conceivably be used in the future if we wanted to have really really good displaytitle support or something

(*) Explanation of why this is needed: when the user is viewing the latest version (i.e. it's not an ?oldid= page view) and clicks edit, we send a request to RESTbase without a revid (i.e. asking for the latest revision), and we send a request for metadata to api.php (also asking for the latest revision). The responses to these two requests could be for two different revisions (usually this would happen because a new edit just went through, so api.php returns metadata about the new revision while RESTbase serves the old one; but a race condition can also cause the reverse to happen). For VE to be able to detect this situation, both requests have to return the revid. The requests are then retried with an explicit oldid parameter, asking for the lower of the two revids.

  • The about attribute on the <html> tag (not part of the <head>, but worth mentioning). This contains the revid, which is used to determine whether the HTML VE received actually corresponds to the revid it was expecting to get (*)

The served rev_id is returned in the ETag header by RESTBase (the ETag format is <rev_id>/<time_uuid>). Given that VE already uses the ETag for the TimeUUID, it could use it for revision-collision detection as well.

  • The about attribute on the <html> tag (not part of the <head>, but worth mentioning). This contains the revid, which is used to determine whether the HTML VE received actually corresponds to the revid it was expecting to get (*)

The served rev_id is returned in the ETag header by RESTBase (the ETag format is <rev_id>/<time_uuid>). Given that VE already uses the ETag for the TimeUUID, it could use it for revision-collision detection as well.

Ooh, nice, I didn't realize that. That's also less ugly to parse than about="http://en.wikipedia.org/wiki/Special:Redirect/revision/540443854". Currently that's being parsed with:

		aboutDoc = this.doc.documentElement.getAttribute( 'about' );
		if ( aboutDoc ) {
			docRevIdMatches = aboutDoc.match( /revision\/([0-9]*)$/ );
			if ( docRevIdMatches.length >= 2 ) {
				docRevId = parseInt( docRevIdMatches[ 1 ] );
			}
		}

So VE should probably switch to that.

...except that the ETag isn't available if you don't have RESTBase.

...except that the ETag isn't available if you don't have RESTBase.

Right. As in https://gerrit.wikimedia.org/r/#/c/271929/ where I updated the content-types for HTML and data-parsoid in <head>, I think we should leave this about information behind.

So, looks like we can remove everything except the about tag in the <tag>, <title>, <base> and <meta> tags with the html and data-parsoid versions.

Right now, RESTBase is inserting the mw:TimeUuid tag, but Parsoid should be able to do that instead. It just fell off our radar.

Right now, RESTBase is inserting the mw:TimeUuid tag, but Parsoid should be able to do that instead. It just fell off our radar.

Parsoid can't insert this, as it doesn't know the timeuuid that will be used for storage. mw:TimeUuid was a temporary work-around, so that we can retrieve the original data-parsoid, html etc during html2wt processing. We should check if there are still clients not sending the ETag back, and remove mw:TimeUuid once there aren't any left.

Parsoid can and probably should emit an ETag of its own. The syntax we have been using is "{revision}/{timeuuid}".