Page MenuHomePhabricator

Add Schema property 'sameAs' pointing to Wikidata entries
Closed, ResolvedPublic5 Story Points

Description

Outcome from 2018 SEO project with Go Fish Digital:

Schema.org is an open standard which allows website owners to add structured annotations to their pages. Aside from being generally useful, Schema properties are used by search engine crawlers to understand pages. Adding Schema properties to pages generally boosts their ranking in search results. Wikimedia sites have generally been lucky that the crawlers understand them, and explicitly adding Schema annotations improves the chances that these crawlers will continue to understand our sites.

Go Fish Digital recommended that we add the Schema property sameAs to our pages, pointing to Wikidata. This is a very important property, and whilst other properties are useful, they think sameAs will have the biggest impact, so let's start there.

For example, for the page Yegor_Khokhlov on the English Wikipedia, the Schema annotation in JSON-LD would look something like this:

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "name": "Yegor Khokhlov",
  "url": "https://en.wikipedia.org/wiki/Yegor_Khokhlov",
  "sameAs": [
    "https://www.wikidata.org/wiki/Q8051149"
  ]
}
</script>

Instead of adding a JavaScript block, we will put the tag inside the HTML.

See https://schema.org/sameAs

Acceptance criteria

  • The meta tags are configurable by a config flag
  • The meta tags are disabled by default
  • When enabled, the meta tags appear in the HTML
  • If a page has no corresponding Wikidata entry, there is no metadata tag.
  • sameAs tag present without JavaScript.

Developer notes

Per T204070 we think this would be best inside the WikibaseClient extension

Pros: The code relies on wikibase item data set by wikibase client so it might be more convenient to put it in the same codebase with a feature flag rather than creating a new extension that depends on Wikibase Client. Wikibase extension has already been setup and is used in many wikis.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

As I mentioned during today's standup ritual, there's an additional component that should be taken into account when deciding how to implement this: we'd like this change to be live before the December code freeze.

I misspoke here. There is a risk, however small, that this change could negatively impact the December fundraiser (and any other campaigns before it). Consequently, we'd like this change to be live early enough that we can measure its impact, if any, and revert it if necessary.

@dr0ptp4kt mentioned that it could take longer than two weeks (20 days?) for this change to fully (?) take effect. With this in mind, we should be aiming to deploy this by the middle of October.

Of course, this is based on the premise that reverting this change will rectify any negative impact that the change had. AIUI we can't validate this premise before deploying this change.

What is the rush here?
Can we not deploy this post-fundraiser?

What is the rush here?
Can we not deploy this post-fundraiser?

@Jdlrobson - we have committed to completing this before the end of the year, which basically means we need to have it live before the fundraiser.

As I mentioned during today's standup ritual, there's an additional component that should be taken into account when deciding how to implement this: we'd like this change to be live before the December code freeze.

I misspoke here. There is a risk, however small, that this change could negatively impact the December fundraiser (and any other campaigns before it). Consequently, we'd like this change to be live early enough that we can measure its impact, if any, and revert it if necessary.

@dr0ptp4kt mentioned that it could take longer than two weeks (20 days?) for this change to fully (?) take effect. With this in mind, we should be aiming to deploy this by the middle of October.

Of course, this is based on the premise that reverting this change will rectify any negative impact that the change had. AIUI we can't validate this premise before deploying this change.

discussed with @dr0ptp4kt today and we estimated that it would be best to have this change made by the end of October (two weeks after the initial estimate made above). From there we will analyze the effect of the change and revert if necessary, although I feel that that is not very likely.

Change 465547 had a related patch set uploaded (by Niedzielski; owner: Stephen Niedzielski):
[mediawiki/extensions/Wikibase@master] WIP: Update: add schema

https://gerrit.wikimedia.org/r/465547

Per discussion, I've continued with the minimal JSON-LD approach. I've uploaded a WIP with a Q number hardcoded while we figure out the linking issue subtasked. I've staged the change. It literally injects:

<script type="application/ld+json">{"@context":"http:\/\/schema.org","name":"Show Me the Money (U.S. game show)","url":"http:\/\/readers-web-stephen.wmflabs.org\/wiki\/Show_Me_the_Money_(U.S._game_show)","sameAs":["http:\/\/readers-web-stephen.wmflabs.org\/wiki\/Q7503010"]}</script>

Or, after pretty print and unescaping:

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "name": "Show Me the Money (U.S. game show)",
  "url": "http://readers-web-stephen.wmflabs.org/wiki/Show_Me_the_Money_(U.S._game_show)",
  "sameAs": [
    "http://readers-web-stephen.wmflabs.org/wiki/Q7503010"
  ]
}
</script>

This is verbatim the recommendation provided. However, Google's structured data testing tool flags a missing @type field as an error. I think this should be "instance of" or "subclass of". I'll look into how to query this extra information next.

The WIP now records description and @type fields. The former is not required but seems to be recommended for the microdata format we were exploring previously. The latter is flagged as an error by Google's structured data testing tool when omitted.

I am currently using instance of and subclass of for @type which is also not understood by Google's structured data testing tool. I don't know if this is the correct vocabulary for the near term. Dan Brickley didn't seem to think it was an issue in this StackOverflow response to use unrecognized strings but I'm just guessing.

@Tbayer has connected me with access to desktop and mobile en/eswiki Google search console access.

I've made a general request for help in the Wikidata IRC channel. @ovasileva, if there's a contact I should work with on this task in Wikidata that would be immensely helpful to ensuring it's done properly and as quickly as possible.

I'm somewhat blocked on T206567 and T206570. I haven't started writing unit tests but being unable to link items impedes development.

@Niedzielski @type is tricky. The @context key refers to the vocabulary used to define the structured data. The problem is that http://schema.org and http://wikidata.org are different vocabularies, so a @type defined by Wikidata may be different than one defined by schema.org. Where schema.org defined "TVSeries", Wikidata defines "Q5398426" (television series). I don't think many search-engines recognize the http://wikidata.org vocabulary, so "@context": "http://wikidata.org", is invalid. Omitting the @context key does validate the markup, but I'm not sure there's much value without it.

Jdrewniak added a comment.EditedOct 11 2018, 11:27 AM

An update (I love to nerd out on this stuff). We're not the only ones to run into this problem. There are long-standing discussions between the relation of schema.org and Wikidata. Reading through this https://www.wikidata.org/wiki/Wikidata:Schema.org led me to this monster GitHub issue, https://github.com/schemaorg/schemaorg/issues/280 where I noticed the use of equivalentClass property. Turns out, in Wikidata, for many entities, the equivalentClass property exists and actually points to a schema.org entity, see https://www.wikidata.org/wiki/Q5398426 . I think where that property exists, we could so something like "get instanceof, and then get equivalentClass of that instanceof", and that might give us the valid schema.org @type for that entity.

@Jdrewniak, thanks for looking into this. FWIW, @Tbayer also alluded to these mapping concerns in his comment. What do you think we should do for the many entities that do not have an equivalent class? For example, kitten is not an instance of anything but is a subclass of house cat and that has no equivalent class. Should I fallback to instance of and then to subclass of?

I also worked on proper JSON+LD statements in T44063: [Epic] Provide a plain linked data interface for accessing entities and T164655: Store and serve annotations in W3C standard format. FWIW, with https://gerrit.wikimedia.org/r/384050 you get the following JSONLD for Q100:

{
    "@graph": [
        {
            "@id": "wdata:Q100",
            "@type": "schema:Dataset",
            "about": "wd:Q100",
            "license": "http://creativecommons.org/publicdomain/zero/1.0/",
            "softwareVersion": "0.1.0",
            "version": 2799,
            "dateModified": "2018-06-21T00:08:11Z",
            "statements": 85,
            "identifiers": 0,
            "sitelinks": 184
        },
        {
            "@id": "wd:Q100",
            "@type": "wikibase:Item"
        },
        {
            "@id": "https://sv.wikivoyage.org/wiki/Boston",
            "@type": "schema:Article",
            "about": "wd:Q100",
            "inLanguage": "sv",
            "isPartOf": "https://sv.wikivoyage.org/",
            "name": {
                "@language": "sv",
                "@value": "Boston"
            }
        },
  ...etc...
}

That suggests that the proper @type is schema:Article, although it is pointing to a wikibase:Item.

@cscott, thank you very much for these references.

Niedzielski added a subscriber: pmiazga.

Patch revised. @ovasileva, moving to blocked pending feedback from Wikibase folks. @Jdlrobson, @pmiazga, reviews appreciated in the meantime!

mpopov added a subscriber: mpopov.Oct 12 2018, 5:51 PM
Addshore moved this task from incoming to in progress on the Wikidata board.Oct 16 2018, 1:49 PM
Niedzielski added a subscriber: BBlack.

Just to keep this ticket somewhat in sync with the patch, the latest patchset produces schemas like:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Article",
  "name": "Douglas Adams",
  "url": "https://de.wikipedia.org/wiki/Douglas_Adams",
  "sameAs": "https://www.wikidata.org/entity/Q42",
  "mainEntity": "https://www.wikidata.org/entity/Q42",
  "author": {
    "@type": "Organization",
    "name": "Contributors to Wikimedia projects"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Wikimedia Foundation, Inc.",
    "logo": {
      "@type": "ImageObject",
      "url": "https://www.wikidata.org/extensions/Wikibase/client/assets/wikimedia.png"
    }
  },
  "datePublished": "2002-05-27T18:26:23Z",
  "dateModified": "2018-09-28T20:16:12Z",
  "image": "https://upload.wikimedia.org/wikipedia/commons/c/c0/Douglas_adams_portrait_cropped.jpg",
  "headline": "British author and humorist (1952–2001)"
}
</script>

Thanks largely to schema feedback from @cscott and improvements from @pmiazga. The current version of the patchset requires further advisement from WMDE.

Since this change would increase the size of every page in the main namespace, I've added the performance team and @BBlack for any caching concerns.

A spike for server side A/B tests, which we haven't done before, are tracked in T206868. Otherwise it's full on / off per wiki.

Tarrow added a subscriber: Tarrow.Oct 17 2018, 10:01 AM

@Addshore, I wanted to thank you for your help formally on this task and general Wikidataness. I've greatly benefited from your gentle guidance :]

We talked about this work in our standup today and I also wanted to share that conversation. We have an A/B test spike that we will be working on in parallel with this task. We actually haven't done an A/B test for server rendered pages previously, hence the spike, but believe it would be very valuable to evaluate how the schema addition impacts SEO. The timeline for both this task and the A/B test is "coming soon" with our drop dead deploy date being November 15th. Given the scope, you can imagine that we're eager to complete this work much sooner, hopefully this month if at all possible.

No one on the Web team has worked in these extensions previously so any continued help you can provide would be greatly welcomed. The codebase is so well separated that it's been great to write in but we're more familiar with MediaWiki proper and less with Wikidata so further guidance is needed.

Thank you again for all of your help and patience. Between you and @cscott, I hope good things will come of this work.

Change 465547 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Update: add Article schema to pages for SEO

https://gerrit.wikimedia.org/r/465547

@Jdlrobson, @pmiazga, @Jdrewniak, @nray I can pair with one of you to help QA as needed.

Gilles added a subscriber: Gilles.Oct 23 2018, 7:19 AM

It seems like this extra content would likely repeat strings present elsewhere in the HTML, which means that this additional content should compress well. Could you look at how much extra weight it adds to the page when gzipped on a couple of articles (big and small)? This should help put the cost into perspective.

https://www.wikidata.org/extensions/Wikibase/client/assets/wikimedia.png This is a very unusual location for a static image. Was this vetted by Traffic ? This image being consumed by bots/crawlers means a long-term commitment to that URL working. I would have expected it to be housed in /static/ where all the logos are, including wikidata's own. I.e. something like https://www.wikidata.org/static/images/project-logos/wikimedia.png or https://www.wikidata.org/static/images/wikimedia.png

Change 469214 had a related patch set uploaded (by Niedzielski; owner: Stephen Niedzielski):
[operations/mediawiki-config@master] Update: add Wikimedia logo for SEO

https://gerrit.wikimedia.org/r/469214

Change 469215 had a related patch set uploaded (by Niedzielski; owner: Stephen Niedzielski):
[mediawiki/extensions/Wikibase@master] Hygiene: move SEO asset to mediawiki-config

https://gerrit.wikimedia.org/r/469215

Change 469215 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Hygiene: move SEO asset to mediawiki-config

https://gerrit.wikimedia.org/r/469215

Thanks @Gilles. Patches are up.

@Jdlrobson, any suggestion for who to ping on https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/469214/? I'm happy to iterate on the patch if further work is needed but it's currently blocking the resolution of this card with no feedback.

Change 469214 merged by jenkins-bot:
[operations/mediawiki-config@master] Update: add Wikimedia logo for SEO

https://gerrit.wikimedia.org/r/469214

Mentioned in SAL (#wikimedia-operations) [2018-10-31T23:44:10Z] <tgr@deploy1001> Synchronized static/images/wmf-hor-googpub.png: SWAT: [[gerrit:469214|Update: add Wikimedia logo for SEO (T198946, T207790)]] (duration: 00m 53s)

This looks unblocked?
Can we move this to QA/sign off?

I've moved T207790 to a subtask of T208755 and removed T206567 and T206570. Resolution of the latter two would surely help with verification of this task but can be considered separately.

Tpt added a subscriber: Tpt.Nov 6 2018, 12:39 PM
ovasileva assigned this task to pmiazga.Nov 7 2018, 6:07 PM
Elitre removed a subscriber: Elitre.Nov 8 2018, 4:42 PM

I've removed T198970 as a parent since this is a child of T208755

Tested on beta cluster, works as expected. I did smoke tests, the proper thorough testing will be tracked in T208772.

pmiazga closed this task as Resolved.Nov 12 2018, 8:27 PM
pmiazga removed pmiazga as the assignee of this task.
pmiazga updated the task description. (Show Details)

Change 477522 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[operations/mediawiki-config@master] Fix Wikidata base URI in client config

https://gerrit.wikimedia.org/r/477522

Change 477522 merged by jenkins-bot:
[operations/mediawiki-config@master] Fix Wikidata base URI in client config

https://gerrit.wikimedia.org/r/477522

Mentioned in SAL (#wikimedia-operations) [2019-02-04T12:40:05Z] <lucaswerkmeister-wmde@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:477522|Fix Wikidata base URI in client config (T198946)]] (duration: 00m 46s)