Page MenuHomePhabricator

Knowledge store data model
Open, Needs TriagePublic

Description

Decision Statement Overview

Text pasted below as well.

What?
Our evolving target architecture introduces an abstraction layer, a modern platform serving collections of knowledge and information, created from multiple trusted sources, to nearly infinite product experiences and other platforms ref. We need a predictably-structured, technology-agnostic data model, in an industry-standard format, as central to this architecture.

What does the future look like if this is achieved?
This modern platform, acting between the trusted sources and the nearly-infinite products and other platforms, transforms knowledge (and data about that knowledge) into predictably-structured, technology-agnostic collections. Knowledge can be consumed by, or pushed to, products and platforms. And they can ask for only the knowledge they want.

What happens if we do nothing?
We will fail to achieve “moving from viewing Wikipedia as solely a website, to developing, supporting, and maintaining the Wikimedia ecosystem as a collection of knowledge, information, and insights with infinite possible product experiences and applications” [citation]

Without adopting new service-friendly patterns, we’ll struggle to become the world’s infrastructure for free knowledge. As one Wikipedian put it, accessing the knowledge currently means “I need to have a PhD in Mediawiki.”

Why?

User Value/Organization ValueObjective it supports and How
Evolve the Wikimedia platform to use modern patternsPlatform evolution: Evolutionary architecture. Evolve the Wikimedia systems architecture to use more modern patterns. This allows engineering teams to work more independently and build software features more reliably and predictably.
Becoming the infrastructure of free knowledgeKnowledge as a service: “To serve our users, we will become a platform that serves open knowledge to the world across interfaces and communities. We will build tools for allies and partners to organize and exchange free knowledge beyond Wikimedia. Our infrastructure will enable us and others to collect and use different forms of free, trusted knowledge.” https://meta.wikimedia.org/wiki/Strategy/Wikimedia_movement/2017/Direction#Our_strategic_direction:_Service_and_Equity
Moving from solely a website to an ecosystem with infinite possible consumersModernize our product experience: We will make contributor and reader experiences useful and joyful; moving from viewing Wikipedia as solely a website, to developing, supporting, and maintaining the Wikimedia ecosystem as a collection of knowledge, information, and insights with infinite possible product experiences and applications. https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Medium-term_plan_2019#Goals

Why are you bringing this decision to the Technical Forum?
We would like specific feedback about the data model, which is standardized using a known, documented and expanding vocabulary. We have designed two prototypes using the evolving data model and are planning a bigger, cross-functional initiative. Enterprise Wikimedia is launching using this data model, as they fit the “other platforms” criteria.

Event Timeline

Where do I apply for my PhD in MediaWiki?

Fairly sure you're a professor already :)

Would someone be able to put links to the two prototypes or to overviews of them, since what is requested is feedback about the data model specifically? Without that I'm not sure how you will get the feedback you desire.

In principle, a well defined schema for defining structured knowledge repository is an excellent concept. But modeling it is herculean task depending on the granularity, scope and format. Is this concept at a stage of sharing some more details?

We can define a data model for deep concepts like Game, Chess game etc(examples RDF/JSON-LD etc). Or we can define models for high level wikipedia entities like article, user, edit, categories, discussion, history with a platform agnostic schema. It may help if we clarify the scope and depth here. Some examples can also help. Since we have Wikidata, abstract wikipedia with possibly intersecting concepts, it may be helpful to explain the connections if any as well. Thanks.

I'd love to be a part of this design

Yes to all of the above -- we definitely want to make sure we're iterating on a data model, and it's definitely important to decide on the standard. We do have a couple of artifacts touching on this and we'd love to share those.

I'd just like to point out, process-wise, that this specific ticket is part of the "decision overview" part of the decision process: so this ticket's goal is to *generally* describe the problem, its need, and its scope, without (yet) going into specific solutioning, which is why we didn't get into the details yet.

This is meant to define that the problem is a necessary one to solve and a good one to delve into, what bodies/authorities/people to consult or who would want to be involved, etc.

After defining the problem and its scope (in this ticket + the feedback form and discussion that follow) we will then follow up with a "Decision Record" (and probably several of them) that then deal with specific solutions and delve into discussions about specifics, alternatives, iterations, etc.

Does this clarify things? Of course we'd love to share the more specific thing, but we didn't want to detract from the specific focus that the Decision process was aiming for in the general overview.

<snip>

Does this clarify things? Of course we'd love to share the more specific thing, but we didn't want to detract from the specific focus that the Decision process was aiming for in the general overview.

I get this. The difficulty for me is that this feels so open-ended if there are no examples, that I wouldn't know if I could help provide good feedback, or who could. If folks don't want to reference specific prototypes, which is understandable given your framing of the issue above, then I would remove the reference to those from the statement and maybe name a few of the sorts of knowledge that could be covered, just to give an idea of the scope and help us get our minds around it. Does that make sense?

Yes to all of the above -- we definitely want to make sure we're iterating on a data model, and it's definitely important to decide on the standard. We do have a couple of artifacts touching on this and we'd love to share those.

I'd just like to point out, process-wise, that this specific ticket is part of the "decision overview" part of the decision process: so this ticket's goal is to *generally* describe the problem, its need, and its scope, without (yet) going into specific solutioning, which is why we didn't get into the details yet.

This is meant to define that the problem is a necessary one to solve and a good one to delve into, what bodies/authorities/people to consult or who would want to be involved, etc.

After defining the problem and its scope (in this ticket + the feedback form and discussion that follow) we will then follow up with a "Decision Record" (and probably several of them) that then deal with specific solutions and delve into discussions about specifics, alternatives, iterations, etc.

Does this clarify things? Of course we'd love to share the more specific thing, but we didn't want to detract from the specific focus that the Decision process was aiming for in the general overview.

A bit, but I'm still unsure of how to proceed.

My problem with this approach is that reading the decision statement above I would have no clear answer to the questions you want to answer. But as you said, here is where we should define the scope of the problem, so let me ask a few clarifying questions:

  • I'm not sure how much of a change compared to our current Status Quo we're proposing. We've built a Modern Event Platform over the last few years, and we already have well-defined schemas for the various types of events. Do we want to improve and rationalize the models, or rethink it from the ground up?
  • If it's an improvement we're searching for, I think it would be easier to give an opinion with specifics about what's lacking from the current MEP that we need, and why
  • If it's a radical rethinking of the MEP, I think we'd need an even stronger justification of why our current event /data models are incomplete and why our needs cannot be solved within improvements to the current MEP.

Hi,

Disclaimer: total MW noob here :).

Could you maybe clarify the definition of "knowledge store"? From the decision statement, and this thread, my understanding is that the goal is to define a technology agnostic data model.

I'm probably missing context, but I don't fully understand the target. Is the goal to follow and implement known representation patterns (I'm thinking knowledge graphs), or should I read the word "store" in more abstract terms? Would approaches taken by wikidata (and triple stores in general) satisfy the desired data model (at some early iteration level)?

Thank you for this feedback. Some of it mirrors feedback from the initial review, so will show up in Revision 1 (after the holiday). I will ping when that's ready.
Some notes:

  • We are crafting a foundational standard for the evolving target architecture (linked in the doc if you haven't seen it.) @gmodena your description is exactly right - noob and all :)
  • @Joe This is a new schema, designed for distributing content, is the heart of a platform that will mature alongside the existing MEP schemas. We rely on event messages to do everything. The goal is to leverage these patterns as we move towards decoupling.
  • The bigger questions about “do we even need this target architecture, why don’t we use what we have?” are valuable, and will continue, but are out of scope for this discussion. This decision process is not reviewing the architecture mission as a whole but the data model at it’s heart. At the moment, this does not impact the current platform.
  • Artifacts on the previous prototype and the one in progress are here and here. @ArielGlenn Sharing this is premature (they'll be part of the solution document for decision making). But why not?!
  • The schema scope is focused on knowledge (breaking down a page). Enterprise is using the initial version. There are lots of other types of data and the goal is to iterate towards understanding and defining them (much of that by the data team). (@santhosh though I think the target artifacts help as well)
  • @Joe This is a new schema, designed for distributing content, is the heart of a platform that will mature alongside the existing MEP schemas. We rely on event messages to do everything. The goal is to leverage these patterns as we move towards decoupling.

Sorry, I'm not sure I understood what you meant. Specifically, what do you mean with "distributing content"? Do you mean distributing to the public or sending updates to other services? The former would indeed be a new initiative completely with little to do with the MEP, the latter is more or less the current duty of our platform already, hence my question.
Also, what do you mean with "decoupling"? It can have various meanings and I'm not sure how to interpret it in this context.

Also, second question: "we rely on events to do everything"... can you define "everything" in this context? Because I'm a big fan of using events for asynchronous processing (so, updates, synchronization, batch processing) but I have had multiple bad experiences with using events as the only mediators between services - basically when people decide to use messages even for synchronous (e.g. coming from a live user) requests, they're substituting TCP with a queue as a mediator, and that introduces more latency, more complexity and a huge SPOF.

As a side note: the target architecture I can see it's a few diagrams and slides, which don't really allow me to get a grasp of what the idea is. As an outsider to all the process, it's very hard to get a good grasp of what the proposal of the architecture team is. It makes it even harder to give constructive feedback and also to get on board with the plan.

As an outsider to all the process, it's very hard to get a good grasp of what the proposal of the architecture team is. It makes it even harder to give constructive feedback and also to get on board with the plan.

I respectfully submit that maybe instead of an outsider, you're one of the central insiders. Mediawiki architecture has been going on for 20 years. I think what's happening right now is all teams are jumping into this process with different perspectives on what a shared, broader architecture would look like. MEP is our team's attempt at engaging with the architecture of the guts of free knowledge, but we have more ideas. This is the Architecture team's attempt. I think we need a broader conversation spanning from high level strategy down to specifics. That's going to be involved, and I think we need to make space for it.

Some of the other stuff I'd like to bring in here is the data architecture overall, thinking of it in terms of data pipelines with sources, processing, and serving layers satisfying different SLOs for different consumers. And the knowledge store is a central part of that. I'm not sure if its schema is the right place to start, but I suspect if 100 people look at a blank canvas they'd pick 100 different places to start.

I'm excited for this work, I think it's gonna be great, and a big part of that is going to be your perspective, @Joe. This comment is mostly to make space and point out how big of a canvas we're looking at.

After last week's break and catching up, I've had a chance to look at the two pointer you provided above in T284258#7194754. Thanks! Do I understand correctly that the proposal in this task is coming out of https://www.mediawiki.org/wiki/Architecture_Repository/Strategy/Goals_and_initiatives/Structured_content_proof_of_value#Next_steps ? And the data model would cover structuring the information in a page (wiki article) in various ways, whether by section/subsection, template content or anything else? Is that the basic idea?

The decision documents states

We would like specific feedback about the data model, which is standardized using a known, documented and expanding vocabulary.

It is not particularly clear where to find the actual "data model". After digging around I have found this: https://www.mediawiki.org/wiki/Architecture_Repository/Systems/Data_models/Knowledge_store Is this the correct link to provide specific feedback about?

General comments:

  • It's still not clear what the purpose of this is. Standardization is a good thing, but I am feeling we're moving towards https://xkcd.com/927/. For example for Page entity, we already have Action API response schema, MW REST API response, RESTBase API response, MEP event property schema, and probably a few more various schemas to represent a page. Is this schema being designed to be 'the one true schema everything else should try moving towards'? Or is this going to be the 15th standard we adopt, used in some corner cases but not the others?
  • I'm conflicted on the use of schema.org. Schema.org was not designed as a general-purpose ontology, but a search engine optimization technique. I agree that integrating better with each engines is a good thing, but I fail to see what benefits would dragging schema.org deep inside out stack would provide us? I do see a lot of downsides, but mainly that in many places it looks like you are trying to fit a cylinder into a square hole trying to cram our terminology into schema.org's. It is already confusing, and this schemas only cover the easiest 1% of a wiki - the further we go with this, the bigger sacrifices we will need to be making in order to fit schema.org.
  • Is this supposed to cover all sister projects, or just wikipedia?
  • Splitting up the article into sections. What is the definition of a 'section'? AFAIK, it's much harder then it looks. If you consider projects other than wikipedia, it becomes even harder. How are the section identifiers defined? Are they stable across revisions?
  • Renaming of well-established wiki terms. Revisions become versions, touched becomes date_modified, title becomes name, etc. There's 2 problems with this - for people with PhD in MediaWiki, its very confusing. But even if you come from the outside with a clean slate, it's hard to imagine that all your needs will be satisfied with what's covered with this schema - wiki universe is vast. So, you will need to interact with other APIs and offerings, and you will suddenly find yourself confused even more cause everything is named differently.
  • Extensive use of HTML. If 'Page' entity is supposed to be general 'item of knowledge' - not all knowledge is best presented as 'html' or even as 'text'. Hard-coding the content of a section under 'text' property seems to narrow the possible scope of it a lot. The purpose of 'text' property is also not clear - is that how to present the content to the user? Are machines supposed to be reading this?
  • Inconsistent typing and structure between various schemas. identifier is the biggest issue here. Especially this becomes an issue when they're used in is_part_of and has_part - the consumer must have custom handling for these for each entity, depending on what is the expected type.
  • keywords can just be an array instead of a comma-separated list. What if a keyword contains a comma?

@Pchelolo currently we are simply framing the problem, not the solution. In your statement are you saying you don't think there is an issue in not having a standard schema for clients to consume knowledge?

The reason you had to dig, is we aren't solutioning yet.

@kchapman I was trying to answer to this:

We would like specific feedback about the data model, which is standardized using a known, documented and expanding vocabulary.

so I was trying to give specific feedback about the prototype.

In your statement are you saying you don't think there is an issue in not having a standard schema for clients to consume knowledge?

If the question is "do we need a data model in general", my thoughts are a bit more generic:

  • I do not think our problem is the lack of data models, I think we already have too many and all of them are different. We have a lot of various API families (action, rest, restbase etc), we have events, we have recent changes streams, etc etc etc - all of these are different ways of consuming knowledge in the structured way. I am worried that adding one more data model will just make us end up with 15 different standards instead of 14 unless we are very bold and we claim that "this is the one, all the rest are deprecated, we will spend Herculean effort and standardized all of our existing knowledge consumption points". Is the new thing being proposed as a replacement for everything we've got so far or is it yet another thing?
  • I do think that if we do decide to make a standard, we must make an effort not to depart too far from what already exists unless there is a VERY good reason to do so. We can take event schemas as a base and improve, or rest api responses, or whatever else we already have. We should preserve existing well-established terminology as much as possible. Making the new data model a true standard and making things consistent between new and existing offerings is already like climbing Mount Everest, the less we change the easier it is. Renaming 'revision' to 'version' seems like an easy thing on paper, in reality it would add human-work-years to standardization project.
  • I do not think it is necessary to follow any existing third-party standard unless it fits us very very well. It's easy to convert between various formats (for example, convert from MEP events to schema.org format), but every sacrifice we need to make to conform our internals to a third-party will come at a very steep cost.

TLDR: I think having a standard is a good thing. I think a standard should be based on some other model we already have and not a brand new thing. I do not think it will be beneficial to follow schema.org or any other third-party format, it will bring more problems then solutions. I do think that the project is only worthwhile if we truly commit to standardizing (or deprecating) our existing stuff, otherwise this will just become one more model used in one more corner of the infrastructure.

Another clarifying point is that this isn't about changing our internal nomenclature "deep inside our stack" - this is a solution that's meant for clients like Wikimedia Enterprise and a distribution method to provide a layer between the information we have in our databases that uses terms and nomenclature that are very specific to the Mediawiki operations, and translate that to consumers who want to digest pieces of the data using as much commonly "human digestible" terminology and structure as possible.

For that matter, that's the benefit of schema.org - while it might not completely answer the entire and very specific nuances, it does provide an initial level of standardized structure that is set to explain the structure of pieces of knowledge; Wikipedia's unique content means that nothing external is every going to be a 100% fit out of the box, but starting with a schema that has some standardization built into it is not a bad start.

This document lays out the general need to have this layer and looks for the interested and impacted parties that should be involved in the follow-up decisions of implementation, so we can proceed with technical decisions about this overlay layer.

It does not describe changes to our current underlying systems, databases, or event schemas; it describes an overlay layer for systems that are directly meant for distribution, rather than strictly editing within the context of MediaWiki.

Does this clarify some of the concerns?

If the question is "do we need a data model in general", my thoughts are a bit more generic:

  • I do not think our problem is the lack of data models, I think we already have too many and all of them are different. We have a lot of various API families (action, rest, restbase etc), we have events, we have recent changes streams, etc etc etc - all of these are different ways of consuming knowledge in the structured way. I am worried that adding one more data model will just make us end up with 15 different standards instead of 14 unless we are very bold and we claim that "this is the one, all the rest are deprecated, we will spend Herculean effort and standardized all of our existing knowledge consumption points". Is the new thing being proposed as a replacement for everything we've got so far or is it yet another thing?

When we talk about a modern way of consuming our data and about decoupling the UI from the data so that products (anywhere, including consumers of the Wikimedia Enterprise schema, or volunteer developers, or anyone around the world) can utilize powerful aspects of our data to create meaningful experiences, we want to have a consolidated single schema that is consistent, machine and human readable, and available without having the consumers do full on research on individual bits of our systems to consume different aspects of our stack differently.

This is the whole goal of having a unified consolidated data layer, meant for these types of consumers, as a distribution system.

  • I do think that if we do decide to make a standard, we must make an effort not to depart too far from what already exists unless there is a VERY good reason to do so. We can take event schemas as a base and improve, or rest api responses, or whatever else we already have. We should preserve existing well-established terminology as much as possible. Making the new data model a true standard and making things consistent between new and existing offerings is already like climbing Mount Everest, the less we change the easier it is. Renaming 'revision' to 'version' seems like an easy thing on paper, in reality it would add human-work-years to standardization project.

I have to challenge here -- why? If we're talking about two essentially different audience-uses here, then why not create an optimized delivery standard for the audience that wants an optimized distribution system? A lot of the terminology and deeply nested technology and APIs we have are serving the multiplex of combined purposes of editing (either the wikitext itself or the data around it like categories, etc) or consuming the knowledge as web pages. That behavior suits specific needs, but the needs we're trying to answer now are different, and the consumers of those needs have different expectations. In many ways, providing an overlaying system that is optimized to this (very large) audience means that we can decouple a lot of the technical concerns that are not relevant for *that* audience (but are very much to the other audiences we have currently).

I think my biggest question here is:

Doesn't having systems that are optimized for their intended audiences and consumers means a better experience (both to the consumer and to the maintainers) than having one monolithic system that needs to answer all needs all the time everywhere?

  • I do not think it is necessary to follow any existing third-party standard unless it fits us very very well. It's easy to convert between various formats (for example, convert from MEP events to schema.org format), but every sacrifice we need to make to conform our internals to a third-party will come at a very steep cost.

That's a good point for us to discuss in the implementation step, but I think we also have a different view of what would need to be translated to what else; we don't have to convert the MEP events to schema.org right now in order to ship a dynamic but consistent human-readable canonical data model in the distribution system layer.

TLDR: I think having a standard is a good thing. I think a standard should be based on some other model we already have and not a brand new thing. I do not think it will be beneficial to follow schema.org or any other third-party format, it will bring more problems then solutions. I do think that the project is only worthwhile if we truly commit to standardizing (or deprecating) our existing stuff, otherwise this will just become one more model used in one more corner of the infrastructure.

That's a good point, I think we might differ on which we consider something "we already have and not a brand new thing" -- when thinking about the consumer of this data structure, our internal systems and structure isn't really "what they expect"; while schema.org isn't perfect, it *is* a bit closer to what they'd expect as a standard than our internal systems' way of structuring our data (a structure that's most of the time meant for consuming *edits* rather than distributing pieces of knowledge, whatever "pieces" means or will mean in the future)

Ok. I think I finally understand :) Sorry it took me so much text to arrive here, I at least hope that my misunderstandings were shared by someone else and this all will help them figure it out as well.

My understanding so far:

This is the proposal that a specific product (knowledge store) needs a data model. It's not a proposal of a specific model - that will come later. It is not touching other parts of the system and it is not intended as a replacement for anything. The product is intended for certain audience and you established that our existing standards will not be suitable for that intended audience, thus new standard.

Now it all makes sense. In all my prior comments I was first assuming this is a request for comments on the specifics of the prototype, and then assuming this proposed standard would become "the one ultimate standard". Since I think both of my assumptions were incorrect, I don't really have much left to say - it's a bit worrying that we need to develop yet another representation of our data, that will definitely add to the overall confusion of things, but if you have established that it's required to serve a potential audience that we are not able to serve with existing standards - oh well, let's add one more :)

It does not describe changes to our current underlying systems, databases, or event schemas; it describes an overlay layer for systems that are directly meant for distribution, rather than strictly editing within the context of MediaWiki.

That is a very important point. I think a lot of confusion arises from the term "knowledge store". That sounds like the objective is to create the canonical representation for persisting "the sum of all human knowledge". If what we are aiming for is really a model for an interoperability layer with existing infrastructure designed to consume "knowledge" in the form of (hype)text (aka search engines and related technologies), that's a very different problem. Would it be appropriate to call it the "knowledge dissemination data model"? To me at least this would have made things a lot clearer from the start.

Trying to convey sufficient context without discussing Everything was super tricky. Thank you, truly, @Pchelolo for helping to bring context into view.

This is the proposal that a specific product (knowledge store) needs a data model. It's not a proposal of a specific model - that will come later. It is not touching other parts of the system and it is not intended as a replacement for anything. The product is intended for certain audience and you established that our existing standards will not be suitable for that intended audience, thus new standard.

Rather than a product, I would say a platform, or an abstraction layer, needs a data model. But this may be semantics. (Personal side note: in yesterday's Enterprise discussion, this would refer to the WME stream specifically.) The platform/layer protects us from needing One Ultimate Standard because it maps data from sources (a wiki, ORES, a 3rd party, wikidata) to a standard. Consumers can consume it ... but the source data is also still available, as it is, where it is.

create the canonical representation for persisting "the sum of all human knowledge".

I confess, @daniel, I do dream about that ;) But for many reasons, doubt it'll ever be possible in the real world.

Rather than a product, I would say a platform, or an abstraction layer, needs a data model. But this may be semantics. (Personal side note: in yesterday's Enterprise discussion, this would refer to the WME stream specifically.) The platform/layer protects us from needing One Ultimate Standard because it maps data from sources (a wiki, ORES, a 3rd party, wikidata) to a standard. Consumers can consume it ... but the source data is also still available, as it is, where it is.

Thank you, that clarifies the objective a lot.

I think I have really allowed myself to be mislead by the name "knowledge store" a lot. It made me think about a data model that represents the entities an actions we use to manage knowledge, one that retains as much detail and complexity as possible, so it can be queried, mapped, transformed and re-used in various ways. A model that optimized for expressiveness and flexibility at the expense of simplicity.

But for the WME stream, we need the opposite: here, we want a model that is streamlined for a specific use case, hiding much of the underlying complexity, which makes it easy to understand and consume.

Do I understand this correctly?

Ok. I think I finally understand :) Sorry it took me so much text to arrive here, I at least hope that my misunderstandings were shared by someone else and this all will help them figure it out as well.

My understanding so far:

This is the proposal that a specific product (knowledge store) needs a data model. It's not a proposal of a specific model - that will come later. It is not touching other parts of the system and it is not intended as a replacement for anything. The product is intended for certain audience and you established that our existing standards will not be suitable for that intended audience, thus new standard.

Now it all makes sense. In all my prior comments I was first assuming this is a request for comments on the specifics of the prototype, and then assuming this proposed standard would become "the one ultimate standard". Since I think both of my assumptions were incorrect, I don't really have much left to say - it's a bit worrying that we need to develop yet another representation of our data, that will definitely add to the overall confusion of things, but if you have established that it's required to serve a potential audience that we are not able to serve with existing standards - oh well, let's add one more :)

Thanks @Pchelolo for this comment ^ - I have read this task many times to try to get the initial idea and the motivation behind this new initiative and I failed to get it. Your comments and this particular comment has helped me to understand it a bit better. I had exactly the same concerns you wrote (as I understood the initial concept pretty much like you did).

I still have concerns about having yet another standard to consume data from though.

+1 that knowledge store is a misleading name - this seems to be more about a new view layer on top of the existing store(s), much like how the Analytics data lake or Page Content Services are view layers.

I'd be wary of relying on surface transformations on the data instead of making the real knowledge store(s) more structured and granular, in a way that editors and editing tools can integrate with, as I think the latter is a prerequisite for many of the transformations sought by the movement strategy, but I guess that's more a resourcing question than an engineering one; engineering-wise those two types of changes don't hinder each other.

I have really allowed myself to be mislead by the name "knowledge store"

I have too, as well as the phrasings of 'infinite product experiences' and 'future state architecture'. I think some of the friction around this project really comes from grand wordings clashing with what is really being proposed.

Rather than a product, I would say a platform, or an abstraction layer, needs a data model.

I also find this confusing, and it took me a while to understand. I currently understand "knowledge store" to mean a specific dataset transformed from (mostly, for now) MediaWiki data, with a nice GraphQL API on top. This sounds like an API product to me :)

I'd be wary of relying on surface transformations on the data instead of making the real knowledge store(s) more structured and granular, in a way that editors and editing tools can integrate with

+1 on this. I'm assuming that "products and platforms ... can ask for only the knowledge they want", means, in the context of a wikipedia, consumers asking for knowledge at below the page level, for example sections. A prototype that serves existing sections out of the context of a page is probably fine, but if it's something we want to commit to long term then we absolutely need to consider making sections a Real Thing inside MediaWiki

but if it's something we want to commit to long term then we absolutely need to consider making sections a Real Thing inside MediaWiki

Hm, I disagree. Sometimes it will be necessary to have 'materialized views' of data inside outside of the monolith in separate places. Wikidata Query Service is a great example (and the basic event driven architecture is similar to what is proposed for this Knowledge Store, IIUC). It runs a transformed view of Wikibase context to serve a different query model. Building everything in MediaWiki isn't going to scale (performance-wise and people-wise). We should make it is easier for more products (that are transformed materialized views of MW data) to be built outside of MediaWiki.
(To clarify, I am not suggesting we should fully re-architect MediaWiki; only that new products should not necessarily be in MediaWiki.)

Anyway, I guess this ticket / DSO isn't so much about Knowledge Store implementation, but if a new data model is a good idea?

but if it's something we want to commit to long term then we absolutely need to consider making sections a Real Thing inside MediaWiki

Hm, I disagree. Sometimes it will be necessary to have 'materialized views' of data inside outside of the monolith in separate places. Wikidata Query Service is a great example (and the basic event driven architecture is similar to what is proposed for this Knowledge Store, IIUC). It runs a transformed view of Wikibase context to serve a different query model. Building everything in MediaWiki isn't going to scale (performance-wise and people-wise). We should make it is easier for more products (that are transformed materialized views of MW data) to be built outside of MediaWiki.

When I think about exposing derived data, I think the critical question to keep in mind is "where is the edit button"? In other words: Is there an easy way for the consumer of the derived data to discover a way to modify the things they are consuming? Allowing people to easily modify whatever they see is at the heart of the "wiki way".

I think it's crucial that the "back channel" is architected in. It cannot be an afterthought. It's an essential part of what makes a wiki a wiki.

In that sense, if we send paragraphs to consumers, mediawiki needs to offer a way to edit these paragraphs. They have to be "a real thing" somehow. Which they sort-of kind-of are, but not really.

Hm, I disagree. Sometimes it will be necessary to have 'materialized views' of data inside outside of the monolith in separate places. Wikidata Query Service is a great example (and the basic event driven architecture is similar to what is proposed for this Knowledge Store, IIUC). It runs a transformed view of Wikibase context to serve a different query model.

I don't think see how we create a consistent-over-time 'materialized view' of an article section if there is no concept of an article section inside MW. A data transform might identify a section by its title and the title changes, for example. Maybe that's a failure of imagination on my part though

As you say this isn't exactly what this ticket is about, but I think establishing the boundaries of a what's possible with potential new data models is probably necessary

In that sense, if we send paragraphs to consumers, mediawiki needs to offer a way to edit these paragraphs

WDQS GUI links back to the Wikidata item page, which can be edited. I believe the Knowledge Store is really just a GraphQL API, so I'm not sure where an 'edit button' would go, but I suppose if there was a UI it could link back to the wiki article?

I don't think see how we create a consistent-over-time 'materialized view' of an article section if there is no concept of an article section inside MW

Why not? Whatever the view is is up to the creator of the view. Anyone can consume content and transform it to look like something else. As long as the function doing the transformation formation is deterministic, that specific view will be 'consistent', no? (Especially if the view can always be regenerated from the history :) )

I mean consistent (or persistent) over time.

For example take the section on Douglas Adams's environmental activism https://en.wikipedia.org/wiki/Douglas_Adams#Environmental_activism ATM the only way we have of identifying sections is via their section title. If an editor edits that paragraph and changes the title to "Wildlife conservation" (and maybe also moves it within the article) how does the transformer know its the same thing? If it just creates a new set of materialized views of sections for the article what happens to consumers who expect the old ones to be there?

Ah, I see. Indeed, that is a Q for the Knowledge Store folks.

There's a lot of really good questions in this thread that are more about the general knowledge store than they are specifically about the data model (so just pointing out potential scope creep of the ticket specifically regarding the DSO) *but* I have some general points that I thought would be useful to raise in this context too.

Regarding the 'edit' functionality -- obviously, a wiki's power is the editing operation, but as is evident in this ticket, the questions about what that means if we are trying to represent the content in different ways (be they splitting to sections or anything else) leads then to questions of what that even means to the way editing works at all? How would we empower users to edit the pieces that many "modern consumers" (as in, not-web-page) end up reading it? .

.. I don't think we know the answer to that yet, and we should probably not force a specific answer we *think* should happen just based on guesswork, since any impact on editing process is a big impact that will disrupt the way we do things (even if the change to the actual edit process is minor, we need to make sure it fits our communities' workflows, tooling, etc)

So I have to say, I think the idea of starting from "a store that is primarily used for consumption" -- deferring the editing operation to the source (so, the store can display whatever "cuts" of data, and editing, for now, is done as-is right now in mediawiki) sounds like a good first step to understand how consumption WORKS so it can inform how editing can/should work.

This iteration costs us a lot less than a more massive (and honestly less known and predictable) change to MW, but can lead to great discoveries about how to approach this as a next step, and deliver good "views" / "entrypoints" into our content, allowing us easier experimentation of how the best to consume -- so it can inform what needs/can be done to the way we edit.

Especially when dealing with the monolith of mediawiki, this sounds like a good way to iteratively (and with much less risk) to figure out how to break out of the monolith rather than push things INTO the monolith when we aren't yet totally sure what they should look like.

I mean consistent (or persistent) over time.

For example take the section on Douglas Adams's environmental activism https://en.wikipedia.org/wiki/Douglas_Adams#Environmental_activism ATM the only way we have of identifying sections is via their section title. If an editor edits that paragraph and changes the title to "Wildlife conservation" (and maybe also moves it within the article) how does the transformer know its the same thing? If it just creates a new set of materialized views of sections for the article what happens to consumers who expect the old ones to be there?

I love this question mostly because these are exactly the questions we need to start asking in general about our content and that will inform exactly the points you're raising about consistency of editing, ec.

But I also love this question because it's so philosophical about "what is a section" in general. So, if I go by your example, this reminds me of the Theseus ship philosophical question:

Is a section still the same section if you renamed it and moved it? How about if I didn't rename it, but I edited its content? How much of the edit of the section content would make it essentially a different section, even if it's still named the same? Would a couple of word-replacements make it different? probably not, *but* if those word changes are enough to change the general meaning? What if I added or moved paragraphs and sentences inside it where now it delivers a slightly different point?

etc etc etc.

My point here is that I think these are brilliant questions not only about "a knowledge store" -- but about what is our content, and how we are going to even consider the delivery *and editing* of the content we have if we want to try and display it as more than "just" one-big-web-page.

If nothing else, the fact we're talking about these questions is more important (and impactful, honestly!) than how, exactly, a knowledge store stores or whether the name is a store or a platform or a layer.

There's a bunch of tech out there that consume our data in ways that outright take control away from us and our editors; either because they decide on their own where to "chop off" pieces (like google's search sidebar) or because they do their own manipulation over the data (like some ML features to get Alexa to give you answers that are half-based on wikipedia).

If we want to reclaim control and bring the control of the content BACK to our communities -- which is the power of wikipedia -- then the question of "what is a piece of knowledge" and "how do others start consuming it" and "how do we retain control so we make sure it's up to our standards" are all great and valuable questions that we have to start asking.

And so, we have to start somewhere; both because that's not a bad idea to iterate and test some of these unknowns (and having this as a separate system outside MW makes that less of a risk than pushing things INTO mw and having a lot of trouble changing them later) and in terms of finding out what are the blockers/questions/challenges that come up when we start thinking about our content through different views, regardless of what technology is used.

My point here is that I think these are brilliant questions not only about "a knowledge store" -- but about what is our content, and how we are going to even consider the delivery *and editing* of the content we have if we want to try and display it as more than "just" one-big-web-page.

I definitely agree, this is the crux of the problem. To me there are two open questions:

  1. Are there abstract, philosophical if you will, answers to these questions?
  2. Is it feasible to implement these answers in our context, with our resources?

And so, we have to start somewhere; both because that's not a bad idea to iterate and test some of these unknowns (and having this as a separate system outside MW makes that less of a risk than pushing things INTO mw and having a lot of trouble changing them later) and in terms of finding out what are the blockers/questions/challenges that come up when we start thinking about our content through different views, regardless of what technology is used.

I think this is relevant work: https://meta.wikimedia.org/wiki/Research:WikiCredit. It's almost a totally separate context, but it's beginning to look at these hard questions of what is content? How do I track a specific piece of content over time, as it changes? That's what WikiCredit proposed to do, follow individual contributions over time and define in concrete terms when a piece of content "disappears".

Having looked at WikiCredit in detail, my intuition is that the answers to the above questions are both yes. But I think this is far from obvious and maybe we should dive into it more, together? We could do a little reading group for Aaron's work or start somewhere else if others have more relevant examples.

If the Knowledge Store is essentially a presentation layer on top of our current system, how do editors work through that layer back to our content? How about folks who donate, organize community events, administer the wikis, etc? These questions may all be the same. Basically, if we figure out a way to better understand our content, we can bring it to more people. And then we can bring more of those people back into our greater community.

But that's only if we successfully create a near-perfect automatic transformation from our content in its current shape to the Knowledge Store model. Any imperfection in that transformation will become work for us or our beloved volunteers. So I think we need to consider that carefully too, and not get too comfortable with the fact that the Knowledge store is just a layer organizing our content. Because once you connect that layer to consumers, it becomes someone's responsibility to keep it working. Historically, when we make a mistake with something like that, our community picks up the slack.

My point here is that I think these are brilliant questions not only about "a knowledge store" -- but about what is our content, and how we are going to even consider the delivery *and editing* of the content we have if we want to try and display it as more than "just" one-big-web-page.

I definitely agree, this is the crux of the problem. To me there are two open questions:

  1. Are there abstract, philosophical if you will, answers to these questions?

Maybe not directly and prematurely, but we can probably find answers to those as we experiment further; there are a whole bunch of questions around this specific philosophical question that are both less abstract *and* more immediate, which means we can iterate around them to find our way towards answering the bigger question.

I will make a point though that while that bigger philosophical question should be one to *drive us* -- it doesn't necessarily have to *have* a solution. The question itself depends on a lot of social behavior -- which in itself changes -- which might mean that there's never going to be an absolute actionable immediate answer, but rather a continuous direction to strive for and iterate on. In that case, designing for something that allows us to safely experiment, iterate, and have others utilize our data/content in novel and unexpected ways *is* the way to go. After all, it has been what we've been doing for 20 years and it led to a lot of unexpected and pretty brilliant innovations that power Wikipedia. We are now just talking about taking it further with tools that the modern internet (and modern tech in general) allows for.

  1. Is it feasible to implement these answers in our context, with our resources?

Following from the above -- why not? If we're looking to answer immediately and completely the absolute and encompassing question -- probably not. Then again, if we're looking to answer a lot of smaller but tangible (and super impactful!) questions that will help us go towards this more abstract direction, then yes. The question in my mind is more how do we explore the new ways that we *should* think about (re that philosophical question) without sacrificing the current (working!!) system while we explore.

And that question is a separate one we can discuss in terms of actual implementation, resourcing, direction, strategy, infrastructure, means of operation, general plan, etc etc.

What we lack right now, in my opinion, is a safe way to explore these at all. If we always go back to "we must do this inside mediawiki" then the price -- technical and social -- is super high, because what you change inside mediawiki touches everything, so experiments inside mediawiki either tend to become permanent (which goes against the point of actually experimenting for the sake of figuring out the proper way to go) or risk damaging current behavior and touch a LOT of other systems/teams/etc which becomes a huge exercise in politics and inter-team/inter-department operations.

Instead, we can iterate with systems that are tangential, like the knowledge store (and other ideas, it's not the only one). If we go with the idea of making a system that is easier to iterate and experiment with because doing that doesn't mean rewriting/touching half of Wikipedia's delicate operation, we can start utilizing it for some user-facing products and actually measure how people use it,, what their expectations are, what other systems use it, how things work in terms of the need to edit, etc.

Then iterating on top of that is significantly less risky, and brings us to a place where we can actually make progress in finding answers that are actionable.

A side bonus would be that we might actually figure out ways to decouple some of the internal behavior of mediawiki regardless of whether we change the way content is consumed (whatever components or services or literal php classes etc) that can then also help us in the general quest we've been trying to achieve for many many years in detangling the monolithic code anyways.

That's basically what I mean with "asking these questions is the important thing" -- more important than the specific implementation. The road gets us to a lot more benefit (if we allow it to) than if we sit and decide a solution.

And so, we have to start somewhere; both because that's not a bad idea to iterate and test some of these unknowns (and having this as a separate system outside MW makes that less of a risk than pushing things INTO mw and having a lot of trouble changing them later) and in terms of finding out what are the blockers/questions/challenges that come up when we start thinking about our content through different views, regardless of what technology is used.

I think this is relevant work: https://meta.wikimedia.org/wiki/Research:WikiCredit. It's almost a totally separate context, but it's beginning to look at these hard questions of what is content? How do I track a specific piece of content over time, as it changes? That's what WikiCredit proposed to do, follow individual contributions over time and define in concrete terms when a piece of content "disappears".

Having looked at WikiCredit in detail, my intuition is that the answers to the above questions are both yes. But I think this is far from obvious and maybe we should dive into it more, together? We could do a little reading group for Aaron's work or start somewhere else if others have more relevant examples.

I have to admit I haven't heard about this before, so I'll need to delve deeper, but the idea that there's a lot of research-related questions to this philosophical question doesn't surprise me at all. That, in my mind, emphasizes even *more* why we should be okay with making something that allows us to experiment to find solutions, rather than talk necessarily about what is the absolute solution.

If the Knowledge Store is essentially a presentation layer on top of our current system, how do editors work through that layer back to our content? How about folks who donate, organize community events, administer the wikis, etc? These questions may all be the same. Basically, if we figure out a way to better understand our content, we can bring it to more people. And then we can bring more of those people back into our greater community.

Valid questions, all, but I'd pose that in order to answer any of these editing/donation questions, we first need to see what it even means to deliver our content not as a full-article-web-page. If we get some sense into that, we can look at what it would mean for editing and donation, etc, to make sure editors have *actual* control over not only the content, but the content that consumers actually consume.

For that matter, btw, I'm not even talking about a theoretical impact -- we already have multiple systems out there that are delivering readers and consumers a "broken up" content from our own, one we *do not* control. For example, Google's sidebar is what most readers would be exposed to first before Wikipedia article, and it shows people only a piece of the introductory paragraph from our articles. This literally means the current editors have much less control and authority on how our content is consumed. Another example is Siri, Google Home and Alexa. All of those read our content and choose which pieces to deliver to the reader/listener, outside of our control... etc.

So the idea here is that we're already in a universe where our content is not always being consumed as our editors are editing it -- we don't yet know what that would completely mean for how editing should be done if we want to make sure our editors HAVE that control back -- and what we can do, then, is look into at least regaining control on how to serve that information (if we experiment on what it means when we divide that data, connect it, etc, with our users' concerns in mind, which other external companies don't care about) -- and out of that, see what editing means.

Does this make sense? I'm agreeing with your general sentiments, I'm trying to make a point about immediacy (the problem is already happening, and will just get worse if we don't address it) and the need to be able to iterate towards a solution (which we are not doing very well right now in the movement because touching *anything* means, most times, touching *everything*)

But that's only if we successfully create a near-perfect automatic transformation from our content in its current shape to the Knowledge Store model. Any imperfection in that transformation will become work for us or our beloved volunteers. So I think we need to consider that carefully too, and not get too comfortable with the fact that the Knowledge store is just a layer organizing our content. Because once you connect that layer to consumers, it becomes someone's responsibility to keep it working. Historically, when we make a mistake with something like that, our community picks up the slack.

One could ask if the transformation will always be automatic. I don't know. Maybe not? Maybe we can let our users make distinct decisions about what information looks like when it's "cut" to bits or whatever it is it looks like. But how it's done depends on how the consumption of those bits happen, what those "bits" are, whether there are multiple options for those fragment/bits, etc etc etc. So... we need to start somewhere. Starting somewhere in order to test out what this would look like in terms of consumption and in terms of actually allowing ourselves to experiment (again, not quite done right now with the way we work) sounds like it is one of the best ways to get somewhere. Otherwise, what would be the alternative? We need these questions answered, we need to start reclaiming control over manipulation of our content that is *already done* right now... if we forever fear the permanence of all experiments, we will never explore actual solutions here.

So... I go back to... we need to start somewhere, and iterate.

There are a lot of clarifications and explanations in the comments (thanks for those! ❤). I was wondering if the authors of the DSO could update the task description / DSO document with that updated information, and maybe even something like an FAQ could help future readers (or return readers, like me) of this task?

So... I go back to... we need to start somewhere, and iterate.

So (forgive me, there's a lot to parse here and I might have the wrong end of the stick) I think this ticket is asking the question

"Do we need a data model, standardized using a known, documented and expanding vocabulary, in order to begin serving 'only the knowledge they want' to products and platforms?"

Is that the specific question we're being asked here? If so I'd say "probably not", and would like to see us experimenting/exploring first, and building a data model when we have a better idea of the territory

(nitpick: can we stop using the word "infinite"?)

As one of the original Wikidata developers (2014–2018) I would like to, well, express my confusion.

  • In Wikidata we have something that's literally called Data Model. From this task's title alone it sounds like it aims to replace Wikidata with an API that extracts data from Wikipedia articles. Which by the way is what DBpedia does. I'm sure this is a misunderstanding. Picking a different title might help.
  • A lot of what's said in this proposal really sounds like it describes Wikidata.
  • Repeated slogans like "infinite possible product experiences" make me feel like I'm reading a sales pitch. What does it mean? As someone who writes software for a living I know there are infinite possibilities in the machine. What is so special about this proposal that we need to highlight this so much?
  • What does it mean when the proposal talks about "knowledge and information", even "insights"? As far as I'm concerned all we can do with software is manage data. Not information. Not knowledge. Briefly explained: Information forms when data is combined and put into a context. Knowledge is something that forms in my head, but can't be extracted or stored. All I can do is to talk and write and hope that the stream of words I produce conveys enough information so the same knowledge forms in other peoples heads. While this is more a philosophical question, I feel it's a really important one in the context of this proposal.
  • Wikidata does pretty much exactly what this proposal asks for: All data is in a predictable format. The format does not need to change all the time, but is flexible enough to adapt and "describe itself". Users can ask for the information they need. The fact that this isn't easy is not a failure of Wikidata, in my opinion, but unavoidable when dealing with data that is as complex as the world.
  • The proposal leaves me with the impression that it want's to do something that is even more complex than Wikidata, but make it much easier the same time. Personally, I believe Wikidata would have achieved something like this in the 18 years it exists if it would be possible. Or to ask this as a question: How much resources do you think are needed to achieve what this proposal asks for?
  • Do the possible consumers of this service have a specific "industry-standard format" in mind?

Does this make sense? I'm agreeing with your general sentiments, I'm trying to make a point about immediacy

@Mooeypoo sorry I'm so late to answer, but yes, it makes sense and I'm sorry if my comment read as a blocker to immediacy. I'm all for immediacy, and especially for experiments, I was just trying to add context I thought would be useful. Since I wrote those comments I've been pushing hard on the efforts that talk about getting data out of MW reliably to external services. Andrew's latest DSO is where some of that conversation is happening: T291120