Page MenuHomePhabricator

Collect meta info and serialize it in the head or elsewhere
Closed, ResolvedPublic

Description

We need to collect various meta info like the page title, revision id etc- very similar to the <page> structure in http://en.wikipedia.org/wiki/Special:Export/OLPC. To make our life easy and provide some abstraction vs. the serialization, we can use an env.page.meta object for this purpose. This can then be passed to a 'buildHead' method for the in-DOM-head serialization. Later on we might want to keep some of that info out of the DOM, which can be done without changing all internal users of this meta info.


Version: unspecified
Severity: enhancement
URL: https://www.mediawiki.org/wiki/Parsoid/Page_metadata

Details

Reference
bz45206

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:42 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz45206.

Related URL: https://gerrit.wikimedia.org/r/57703 (Gerrit Change Ib7b866b899cfbcf47d8a06190a411d1a3d46dd50)

Related URL: https://gerrit.wikimedia.org/r/57820 (Gerrit Change Ie6de8d6ece925c752c1d3a7d9b6f4e3d4b67a09e)

https://gerrit.wikimedia.org/r/57703 (Gerrit Change Ib7b866b899cfbcf47d8a06190a411d1a3d46dd50) | change APPROVED and MERGED [by jenkins-bot]

https://gerrit.wikimedia.org/r/57820 (Gerrit Change Ie6de8d6ece925c752c1d3a7d9b6f4e3d4b67a09e) | change APPROVED and MERGED [by jenkins-bot]

This bug is almost done; we're just trying to finalize the exact format of the metadata in <head>. Currently we have:

<html data-parsoid="{}" prefix="mw: http://mediawiki.org/rdf/">
<head data-parsoid="{}" prefix="schema: http://schema.org/">
<meta charset="UTF-8">
<meta property="mw:articleNamespace" content="0">
<meta property="schema:CreativeWork/version" content="550389492">
<meta property="schema:CreativeWork/version/parent" content="550001092">
<meta property="schema:CreativeWork/dateModified" content="2013-04-15T00:02:02.000Z">
<meta property="schema:CreativeWork/contributor/username" content="en.wikipedia.org/wiki/User:Tariqabjotu">
<meta property="schema:CreativeWork/contributor" content="
en.wikipedia.org/wiki/Special:UserById/153365">
<meta property="mw:revisionSHA1" content="59b13d0d38a8f8992d5f7231f439c8805656f71c">
<meta property="schema:CreativeWork/comment" content="replacing with permanent TAFI code">
<title>Main Page</title>
<base href="//en.wikipedia.org/wiki/">
</head>

The schema:CreativeWork/contributor and schema:CreativeWork/contributor/username attributes might need revision. In particular the Special:UserById/xxx URL is implemented in https://gerrit.wikimedia.org/r/59050 and an alternative Special:Redirect/user/by-id/xxx URL format is implemented in https://gerrit.wikimedia.org/r/59572 --- one or the other of those will get merged (hopefully!), and we might have to tweak the link to match it.

Also, schema:CreativeWork/contributor is a Person (see http://schema.org/Person) and technically we are specifying the 'url' property of the Person. It's not clear that the tags above correctly specify that. We might also want to shift from <meta> to <link> tags; the href attribute of a <link> is guaranteed to have typed correctly as a URL (which ensures resolution of relative paths). Relatedly, most of our DOM uses relative paths; our metadata probably should as well.

cscott added a comment.May 1 2013, 3:22 PM

Special:Redirect was merged, and https://gerrit.wikimedia.org/r/61603 tweaks out <head> to use it.

The only remaining issues are the type-system issues (is the contributor a Person or a URL or what) and whether we should be using <link> rather than <meta> tags in some places.

Proposal for new metadata (hammered out with Daniel Friesen's help):

<html data-parsoid="{}" prefix="mw: http://mediawiki.org/rdf/">

<head data-parsoid="{}" prefix="dc: http://purl.org/dc/terms/ mwr:http://en.wikipedia.org/wiki/Special:Redirect/">
  <meta charset="UTF-8">
  <meta property="dc:isFormatOf" resource="mwr:revision/555790550">
  <meta about="mwr:revision/555790550"
        property="mw:articleNamespace" content="0"
        datatype="xsd:integer">
  <meta about="mwr:revision/555790550"
        property="dc:modified" content="2013-04-15T00:02:02.000Z"
        datatype="xsd:dateTime">
  <meta about="mwr:revision/555790550"
        property="dc:replaces" resource="mwr:revision/552548208">
  <meta about="mwr:revision/555790550"
        property="mw:revisionSHA1"
        content="59b13d0d38a8f8992d5f7231f439c8805656f71c">
  <meta about="mwr:revision/555790550"
        property="dc:description"
        content="replacing with permanent TAFI code">
  <meta about="mwr:revision/555790550"
        property="dc:contributor" resource="mwr:user/153365">
  <meta about="mwr:user/153365"
        property="dc:title" content="Tariqabjotu">

<title>Main Page</title>

<base href="//en.wikipedia.org/wiki/">
</head>

This uses the Dublin Core metadata terms exclusively, and describes the parsoid output as "a format of" a particular revision of the wikipedia article. That revision is also described as "replacing" an earlier version of the article. The wikipedia user is identified by userid and described as a "contributor" to the particular revision of the wikipedia article. The username is then given as the "title" associated with that userid (which is how dublin core recommends names of people be described).

We use CURIEs for the repeated http://en.wikipedia.org/wiki/Special:Redirect/* urls in order to describe the relations more compactly.

Overall this looks sensible to me.

The structure can likely be compressed a bit by setting the default about on the head element to the revision.

As a minor niggle about terminology, I'm not convinced that the prefixed URLs are technically CURIEs since I seem to remember that those are not allowed in HTML5 + RDFa.

http://www.w3.org/TR/html-rdfa/ seems to indicate that CURIEs are fine in HTML5/RDFa.

However, re-reading indicates that the 'resource' attribute should really only be used with the <link> element, so some of the <meta>s above should probably still change to <link>. Also, <base> sets the default subject for RDFa processing, so we need to be careful to set an appropriate @about on the isFormatOf property (and then setting @about on <head> to default to the revision is a good idea). The result is:

<html data-parsoid="{}" prefix="mw: http://mediawiki.org/rdf/">

<head data-parsoid="{}" prefix="dc: http://purl.org/dc/terms/ mwr: http://en.wikipedia.org/wiki/Special:Redirect/"
      about="mwr:revision/555790550">
  <meta charset="UTF-8">
  <link about="http://parsoid.wmflabs.org/en/Main_Page"
        rel="dc:isFormatOf" resource="mwr:revision/555790550">
  <link rel="dc:isVersionOf" href="http://en.wikipedia.org/wiki/Main_Page">
  <meta property="mw:articleNamespace" content="0"
        datatype="xsd:integer">
  <meta property="dc:modified" content="2013-04-15T00:02:02.000Z"
        datatype="xsd:dateTime">
  <link rel="dc:replaces" resource="mwr:revision/552548208">
  <meta property="mw:revisionSHA1"
        content="59b13d0d38a8f8992d5f7231f439c8805656f71c">
  <meta property="dc:description"
        content="replacing with permanent TAFI code">
  <link rel="dc:contributor" resource="mwr:user/153365">
  <meta about="mwr:user/153365"
        property="dc:title" content="Tariqabjotu">

  <title>Main Page</title>
  <base href="http://en.wikipedia.org/wiki/Main_Page">

</head>

<body>

<div>Welcome to <a rel="mw:WikiLink" href="./Wikipedia" >Wikipedia</a></div>

</body>

This talks about three pages:
http://parsoid.wmflabs.org/en/Main_Page
http://en.wikipedia.org/wiki/Main_Page
http://en.wikipedia.org/wiki/Special:Redirect/revision/555790550

Note that WikiLinks and other content in the <body> are relations on wikipedia pages (not parsoid output).

Related URL: https://gerrit.wikimedia.org/r/66300 (Gerrit Change I7da7762462635530189c0d994e89f83a38c1f5ff)

New version removes the controversial dc:isFormatOf triple, a property of "this document" (which is hard to talk about in RDFa if you have a <base> element). The head now looks like:

<!DOCTYPE html>
<html prefix="dc: http://purl.org/dc/terms/

        mw: http://mediawiki.org/rdf/
        mwr: http://en.wikipedia.org/wiki/Special:Redirect/"
about="mwr:revision/560327612">

<head>

<meta charset="UTF-8">
<meta property="mw:articleNamespace" datatype="xsd:integer" content="0">
<link rel="dc:replaces" resource="mwr:revision/560314723">
<meta property="dc:modified" datatype="xsd:dateTime" content="2013-06-17T17:55:30.000Z">
<meta about="mwr:user/1624037" property="dc:title" content="Edokter">
<link rel="dc:contributor" resource="mwr:user/1624037">
<meta property="mw:revisionSHA1" content="e0564e710b93f998658bd5527f0042eaba6d6c87">
<meta property="dc:description" content="Undid revision 560314723 by [[Special:Contributions/Meno25|Meno25]] ([[User talk:Meno25|talk]]) Sync structure to other main page pages. Don't make null-edits either; post on talk instead."><link rel="dc:isVersionOf" href="//en.wikipedia.org/wiki/Main_Page">
<title>Main Page</title>
<base href="//en.wikipedia.org/wiki/Main_Page">

</head>
<body>

<div>Welcome to <a rel="mw:WikiLink" href="./Wikipedia" >Wikipedia</a></div>

</body>
</html>

[Parsoid component reorg by merging JS/General and General. See bug 50685 for more information. Filter bugmail on this comment. parsoidreorg20130704]

Change 66300 merged by jenkins-bot:
Tweak RDFa markup of page metadata in <head>.

https://gerrit.wikimedia.org/r/66300