Page MenuHomePhabricator

Bring schema.org markup on articles in line with RDF export of sitelinks
Open, Needs TriagePublic

Description

Sorry to be “backseat driving” or “armchair reviewing” here, but I just saw the parent task and have a few comments on the markup proposed there. Feel free to split this into subtasks, ignore everything you don’t agree with, or mangle this task beyond recognition to better fit your workflow :)

In general, I’m comparing this to the existing RDF we output for sitelinks as part of the Wikibase RDF output. See, for example, Q500927.ttl for the same item as in T209352.

  • Since the JSON-LD contains the article link as the url and not as the @id of the JSON object, all the triples described in it actually have a blank node as the subject (see the N-Quads output in the JSON-LD playground). In theory, this means that all the other points I mention here don’t matter as much, since this describes a different subject than the sitelinks in Wikibase’ RDF export, but it’s still a bit odd.
  • While the schema.org documentation [claims that http://schema.org/ and https://schema.org/ are both fine](https://schema.org/docs/faq.html#19), in general, RDF URIs must match exactly in order to describe the same resource (including the protocol), and most RDF-based tools will not recognize https://schema.org/Article as the http://schema.org/Article resource they’re familiar with. In Wikibase, we declare the schema: prefix to mean http://schema.org/.
  • The JSON-LD connects the article and its Wikibase item via schema:sameAs, whereas in the Wikibase RDF export, they’re distinct entities, connected via schema:about.
  • When the JSON-LD is converted to other RDF syntaxes (see the playground link above), the datePublished and dateModified values get the data type schema:Date, whereas in the Wikibase RDF export, the dateModified value (of an item – we don’t emit it for sitelinks) has the data type xsd:dateTime. But the schema:Date type doesn’t seem to be specified in the JSON-LD itself, so I’m not sure if this isn’t a bug (problem?) with schema.org itself, actually.
  • In the Wikibase RDF output, the article title (schema:name) is a language-tagged string ("World"@en), whereas in the JSON-LD it seems to be a plain string ("World").
    • The schema:headline is not part of the Wikibase RDF output, but should probably also be language-tagged in the JSON-LD. In fact, it might make sense to declare a language for the whole @context, though in that case the author and publisher names should be reset to English if they’re not translated (are they?).

In general, I’m approaching this from a very RDF-based perspective, whereas you seem to be more oriented towards search engines – for all I know, any of the points I raise here could actually hurt SEO, so don’t follow me blindly here :) it would probably also be good to get some input from other people more familiar with RDF and especially Wikibase’ RDF output, such as @Smalyshev.

Event Timeline

The JSON-LD connects the article and its Wikibase item via schema:sameAs

I think it's wrong - these are completely different data sets, article is human-readable text in specific language wiki and Wikibase item is a structured data set. If we want link from article to item, then we may maybe use subjectOf (also not entirely correct), isBasedOn (reverse direction), or about (also reverse).

However, if search engines use it this way, we may have to leave it in.

https://schema.org/Article as the http://schema.org/Article

I prefer to use http since this is URI identifier, not data access protocol, but I don't think here it matters too much. Proper tool should be able to get both.

the data type schema:Date

Not sure whether schema:Date is meant to be used as data type for literals. I suspect that it is not the case, but I can't find any conclusive proof. xsd:dateTime is well-established date literal type. So I'd probably use it instead.

Lucas_Werkmeister_WMDE raised the priority of this task from Medium to Needs Triage.Nov 14 2018, 11:12 AM

While the schema.org documentation claims that http://schema.org/ and https://schema.org/ are both fine, in general, RDF URIs must match exactly in order to describe the same resource (including the protocol), and most RDF-based tools will not recognize https://schema.org/Article as the http://schema.org/Article resource they’re familiar with. In Wikibase, we declare the schema: prefix to mean http://schema.org/.

The Google Developer Console reports all Schema.org references parsed as HTTP regardless of whether they were HTTPS so one would guess that changing to HTTP would have no effect on SEO.

Change 719530 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] Update jabram's ssh key

https://gerrit.wikimedia.org/r/719530

Change 719530 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] Update jabram's ssh key

https://gerrit.wikimedia.org/r/719530

Ignore my silly typo, thanks!

Jdlrobson subscribed.

Hey @Lucas_Werkmeister_WMDE three years later I'm wondering if there's any actionables I can help with in this ticket?

@Lucas_Werkmeister_WMDE: Could you please answer the last comment? Thanks in advance!

As far as I can tell, all the points in the task description still stand: the triples still start with blank nodes; the declared schema.org namespace still uses HTTPS; article and item are still connected using schema:sameAs; date values still use schema:Date instead of xsd:dateTime; the article title is still not language-tagged. Whether you want to address any of these issues is up to you… I think I agree with Stas that the schema:sameAs seems like the most serious issue, since Wikidata items and Wikipedia articles really shouldn’t be the same entity.