Page MenuHomePhabricator

Redesign Wikimedia Enterprise baseline schema to schema.org standards
Closed, ResolvedPublic

Description

Current:

{
  "qid": "Q...",
  "wpid": "###",
  "title": "TITLE",
  "url": "URL Link",
  "db_name": "WIKI PROJECT CODE",
  "lang": "LANG ISO CODE",
  "revision_dt": "TIMESTAMP",
  "content": {
      "license": ["CC BY-SA"],
      "html": "STRING BLOCK",
      "wikitext": "STRING BLOCK"
  }
}

Details

Due Date
Apr 30 2021, 4:00 AM

Event Timeline

RBrounley_WMF triaged this task as High priority.
RBrounley_WMF set Due Date to Apr 30 2021, 4:00 AM.
RBrounley_WMF updated the task description. (Show Details)

So here's a draft of what we're thinking...curious other folks thoughts. A few nitpicks I have below:

{
  "name": "Basic_access_authentication",
  "identifier": 1214512,
  "version": 1014850939,
  "dateModified": "2021-03-29T12:52:41Z",
  "url": "https://en.wikipedia.org/wiki/Basic_access_authentication",
  "inLanguage": {
    "name": "English",
    "identifier": "en"
  },
  "mainEntity": {
    "identifier": "Q766839"
  },
  "project": {
    "name": "Wikipedia",
    "identifier": "enwiki"
  },
  "articleBody": {
    "html": "...html of the page",
    "wikitext": "...wikitext of the page"
  },
  "license": [
    {
      "identifier": "CC BY-SA",
      "name": "Creative Commons Attribution–ShareAlike License"
    }
  ]
}
  1. Are there standards for licensing? For instance, is there programatic shorthand for "CC BY-SA". Tagging @mwilliams @Seddon on this in case they know.
  2. identifier as the MW Page ID isn't super clear, we can definitely clean this up in docs. But on a read, do we know what things are? Is that something worth changing?
  3. We went ahead and embedded wikitext into the articleBody object. This is better than duplication from our POV
  4. For name, removing the underlying semicolons is ideal, but in the nature of keeping page titles similar as a key - should we adjust this?

Are there standards for licensing? For instance, is there programatic shorthand for "CC BY-SA". Tagging @mwilliams @Seddon on this in case they know.

https://spdx.org/licenses/ is probably what you want; but CC-BY-SA (with a version on the end too) is probably about as standardised shorthand as you're going to get

identifier as the MW Page ID isn't super clear, we can definitely clean this up in docs. But on a read, do we know what things are? Is that something worth changing?

https://en.wikipedia.org/wiki/Paris?action=info - we call it page id in docs/interfaces/apis, and it can be used in certain locations, but I wouldn't necessarily call it an "identifier", as that's not language that MediaWiki uses, so changing it in okapi is much easier than MediaWiki (which isn't likely to happen)

identifier as the MW Page ID isn't super clear, we can definitely clean this up in docs. But on a read, do we know what things are? Is that something worth changing?

I would agree that it's not super clear, but clear enough. If we are to stick with schema.org this is probably the best way to go. It might not be super clear at the beginning but when you get the gist of it it makes sense and delivers the structure for you to understand new properties or objects we may add here, on the other hand as soon as we make exception to the rule it can cause confusion. So my argument is don't make exceptions unless we absolutely need to cuz consistency makes it easy to expand and read the schema.

Also one more propery to consider is namespace, we'll need to add it into the schema as well. My thoughts are either simply presenting it as number:

{
  ...
  "namespace": 0,
 ...
}

Or with nested object:

{
  ...
  "namespace": {
    "identifier": 0,
    "name": "Article",
  },
 ...
}

Have not found similar property on schema.org. May need to double check, but this is initial thoughts. Second one seems more informative but whether that information is needed or simply the number is enough that's another question...

@Protsack.stephan I prefer the second one there mostly because I think it's understandable outside of the wiki-verse. But curious to see Diana's thoughts as well.

@Reedy I agree with above, we're attempting here to make a consistent schema thinking of users of our content rather than directly mapping them back to the MW versions. In our docs we'll convert but we're trying to make some of our apis reflect some standards...we've been leveraging schema.org for now. This doesn't imply change on MW

Thank you for the link to https://spdx.org/licenses/ - that is precisely what I was looking for :)

Looks good, @RBrounley_WMF! Unfortunately, schema.org doesn't have project as a property of Article, so we're using isPartOf for the wiki:

{
  "name": "Basic access authentication",
  "identifier": 1214512,
  "version": 1014850939,
  "dateModified": "2020-08-26T18:48:58Z",
  "url": "https://en.wikipedia.org/wiki/Basic_access_authentication",
  "inLanguage": {
      "identifier": "en"
  },
  "mainEntity": {
      "identifier": "Q3756157"
  },
  "isPartOf": [
      {
      "identifier": "enwiki"
      }
  ],
  "articleBody": {
      "html": "...page html...",
      "wikitext": "...page wikitext..."
  },
  "license": [
      {
      "identifier": "CC-BY-SA-3.0"
      }
  ]
}
  1. +1 to using the codes from https://spdx.org/licenses
  2. If the intent is to use schema.org, we should stick with identifier.
  3. +1 to this approach
  4. Can you elaborate on this? I'm not sure what you mean. In the model, name is in reading format, with spaces instead of underscores.

Also one more property to consider is namespace

I hadn't considered adding namespace since I didn't think it would be useful outside the context of MediaWiki. But if it's something your users need, both of the formats you suggested look good to me, @Protsack.stephan!

This is what we came up with. License is not corresponding https://spdx.org/licenses yet, need to double check specific one(s) we'll use so leaving it generic for now. Also setting isPartOf as a object for now, we want to be clear that at this point it is a project to which page is related to. If we'll need to expand this field we'll turn it into array as suggested above.

{
  "name": "Delhi",
  "identifier": 200617,
  "version": 4401275,
  "dateModified": "2018-04-24T16:52:59Z",
  "url": "https://en.wikinews.org/wiki/Delhi",
  "namespace": {
    "name": "Article",
    "identifier": 0
  },
  "mainEntity": {
    "identifier": "Q17610398"
  },
  "inLanguage": {
    "name": "English",
    "identifier": "en"
  },
  "isPartOf": {
    "name": "Wikinews",
    "identifier": "enwikinews"
  },
  "articleBody": {
    "html": "...html...",
    "wikitext": "...wikitext..."
  },
  "license": [
    {
      "name": "Creative Commons Attribution Share Alike",
      "identifier": "CC-BY-SA"
    }
  ]
}

Should we use https://en.wikipedia.org/wiki/JSON-LD if we're going for schema.org?

"name": "Basic_access_authentication"

This is a DB key of the article, right? Article can have also human-readable name (with spaces instead of underscores), and some articles have a special display title, in case editors decided this article needs some specialized formatting of the title. DBKey is kinda internal to MW database layer. In practice it leaks everywhere though. 'name' is not known in wiki codebase. There's no such thing as a 'name' of an article. Article has a title, doesn't have a name.

"version": 4401275,

I would assume you are renaming revision ID to a version? Can you please elaborate on the reasons for this change? Revision is one of the most standard parts of vocabulary in wiki ecosystem, so renaming 'revision' to 'version' will create A LOT of confusion.

@Pchelolo thanks for this. Addressing below:

  • JSON-LD, we should probably do this but haven't had the need or request from users...will investigate further, thanks for flagging. @apaskulin - thoughts on this?
  • You're right that "name" is the title of an article, this changed conforming into schema.org and using the CreativeWork nomenclature rather than the title. This is where the schema.org status quo will drift away from current ways of Wiki norms. Larger discussion to be had on this but since this is mostly outward-facing data, the reasoning here is that keeping consistent to schema.org helps us have clear decision-making on what to call things - also it's familiarized in industry.
  • Version vs. Revision - yes, same as above.

Since this is confusing without the docs - here is a pasted reference from our docs (you can view here):

PropertyTypeDescription
nameTextPage title in reading-friendly format (spaces instead of underscores)
identifierIntegerPage ID (MediaWiki page ID)
versionIntegerMediaWiki revision ID
dateModifiedTextTimestamp of latest revision in ISO 8601 format (DateTime)
urlTextComplete URL for the page
namespaceObjectType of Wikimedia page (File, Article, Category, etc.)
mainEntityThingPrimary subject of the page (Wikidata ID)
inLanguageLanguageHuman language the page is written in
isPartOfarray of ProjectWiki the page belongs to
articleBodyTextArticle content in HTML and Wikitext
licensearray of LicenseContent license, including name and url properties

I'm not sure if this is the right place, but, to the extent possible, I would strongly advise use of the internal dbname/wikiid value in public APIs and metadata. We have made this mistake a dozen times before, and paid for it with painful migrations many times since.

In general the canonical domain name should be used in new developments.

Hello!

In case you haven't seen, we spent a lot of time bikeshedding and deciding on general purpose and MW specific event schema conventions, documented at https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines.

The MediaWiki specific conventions are here: https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Common_Mediawiki_event_fields

But, most importantly, I'd really like to stress that capital letters in data keys is a really bad idea.
https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Identifier_Naming_Rules

Data moves around. It will be used in different languages with different typing and different naming rules. It will certainly be used in SQL systems, which are for the most part case insensitive. The only common identifier naming rule that will function in all of these systems is snake_case.

Any time data passes through a case insensitive system, it will be normalized, most likely to all lower case.
Fields like isPartOf and mainEntity will become ispartof and mainentity. Longer names that include acronyms get even worse. In camelCase, it isn't clear what the acronym capitalization rules are. E.g. HTTPURLID? HttpUrlId? Whatever the camelCase acronym rule is, the name will be normalized in SQL systems to e.g. httpurlid. Data integration automation code has to reason about which fields are the same. If ingesting data that has capital letters, it is possible that two different fields end up normalized to the same lower cased name. Then we just have to guess about how to ingest data.

Every time someone needs to move camelCased data identifiers in case insensitive systems, they will have to write code that reasons about the case changes. If we avoid upper cased field names in our schemas, we are less likely to encounter bugs and breakages in data pipelines.

Additionally, I've heard that camelCase can be difficult for non native English speakers. incomingHTTPRequestIpAddress (which is normalized to incominghttprequestipaddress) is (subjectively) more difficult to read than incoming_http_request_ip_address.

Thanks for sharing this feedback, all! For some general context, the Architecture Team has posted the knowledge store data model on wiki. We’re planning to take this data model through the technical decision-making process, but for any additional feedback in the meantime, feel free to post to the talk page there.

@Pchelolo

Should we use https://en.wikipedia.org/wiki/JSON-LD if we're going for schema.org?

I definitely think there’s a benefit to the context provided by JSON-LD. We’ll be looking into how we can add something like this to the data model. But for now, we’re focusing on using the terminology from schema.org to build a standalone schema instead of trying to implement schema.org within the HTML.

This is a DB key of the article, right? Article can have also human-readable name (with spaces instead of underscores), and some articles have a special display title, in case editors decided this article needs some specialized formatting of the title.

Good point! Our intention was to use the human-readable name as the name, but I've added a note to the wiki page to consider using the display title instead.

There's no such thing as a 'name' of an article. Article has a title, doesn't have a name.

True, but our goal here is to use standardized terms from schema.org and schema.org doesn’t have a title property, only a name

I would assume you are renaming revision ID to a version? Can you please elaborate on the reasons for this change?

The intention here is not to change the concept of a revision within MediaWiki, but to use standard terminology for outward-facing data. The purpose of the knowledge store data model is to structure content in a way that is predictable and optimized for distribution outside of Wikimedia. The Architecture Team has decided to do this by mapping our existing MediaWiki concepts onto properties from schema.org. version is the closest approximation in schema.org to our concept of a revision ID.

@Krinkle

I would strongly advise use of the internal dbname/wikiid value in public APIs and metadata. We have made this mistake a dozen times before, and paid for it with painful migrations many times since. In general the canonical domain name should be used in new developments.

Sounds good to me. So for example, this would be using “en.wikipedia” instead of “enwiki”?

@Ottomata, thanks for this context! We're looking into changing to snake case.

Disclaimer: I think this ticket is a wrong place to raise my more generic concerns. Okapi should do whatever okapi needs to do :) but since there’s an active discussion here, I’ll leave the comment here. Please point me to a more appropriate place if one exists.

I would like to raise one more question: why are we forcing everything into schema.org at all?

From schema.org website:

What is the purpose of schema.org?
Schema.org is a joint effort, in the spirit of sitemaps.org, to improve the web by creating a structured data markup schema supported by major search engines. On-page markup helps search engines understand the information on web pages and provide richer search results. A shared markup vocabulary makes easier for webmasters to decide on a markup schema and get the maximum benefit for their efforts. Search engines want to make it easier for people to find relevant information on the web. Markup can also enable new tools and applications that make use of the structure.

TLDR it’s a vocabulary designed for formatting data in a way search engines can understand. They themselves state it does not aim to be a general purpose ontology language. Which might be exactly correct for okapi.

I do however feel hesitant about making it a general-purpose standard for our internal events and data structures. As we already have seen, forcing everything into it makes us rename some of the most basic concepts we’ve been using for the past 15 years, and we’ve barely scratched the surface here, I anticipate that if we wanted to standardize around schema.org, we’d need to rename everything.

This beg the question: how deep do you thing we should drag schema.org terminology into our stack? Are we going to rename our database fields? I hope not. This means that if we do decide to use schema.org terms, we need to know how to transform from well established media wiki terms into schema.org. No matter what, in practice there will have to be a layer that knows that “version” is a “revision of” and “name” is a “page_id”

So, if we need not only be defining the schema for schema.org, but probably more importantly define the mappings. If we had easy transformation, supporting schema.org become a question of rendering, we are free generate schema.org markup on the edge for external consumers that want or need it.

Every time someone needs to move camelCased data identifiers in case insensitive systems, they will have to write code that reasons about the case changes.

For reference, here's an example of case aware code we had to write to automate ingestion of event data into Hive. I'd much prefer if we never have to write that again!

I agree with above points @Pchelolo - deciding now if it's the best approach for Okapi as well. I think what it provides for us is a clear path of ways to name things but it is full of gaps and holes for data we need to add to the schema. The Wikidata front is more clear, in terms of flattening the ontology into schema.org.

We're adding a few new fields and I'm working through the right approach now, we have a clear window right now to make that decision. I'll share with you all as I get through what that might look like.

@Krinkle wrote:

I would strongly advise use of the internal dbname/wikiid value in public APIs and metadata. We have made this mistake a dozen times before, and paid for it with painful migrations many times since. In general the canonical domain name should be used in new developments.

Sounds good to me. So for example, this would be using “en.wikipedia” instead of “enwiki”?

Yes, and more specifcally en.wikipedia.org (I'm not aware of APIs where we return or expect it without TLD).

Actually, this function is probably more interesting.

@RBrounley_WMF, @Protsack.stephan: Hi, the Due Date set for this open task was a while ago.
Can you please either update or reset the Due Date (by clicking Edit Task), or set the status of this task to resolved in case this task is done? Thanks.