Page MenuHomePhabricator

RFC: Content model version field to accompany content model
Closed, DeclinedPublic

Description

Our context is that I'm planning for content migrations in the new Jade namespace, and would like to have a version indicator embedded in or otherwise coupled to our JSON content. This would be helpful at migration time, but more importantly, seems absolutely indispensable for processing old revisions.

We could include a version field in the JSON content itself, but this is a bit of a hack and wouldn't work well for some content formats. My hastily-thought-through proposal is to add a version field to the content table, probably using semantic versioning, though it would also be acceptable to have a single integer which is only incremented for breaking changes.

If others support the idea, I'll go ahead and work the proposal up into something more concrete.

Problem statement

The content model for pages in the Jade namespace may evolve over time in backward-incompatible ways. When we do so, we need a way to introduce newer versions of models without breaking the rendering of existing content.

Event Timeline

I have thought about content model versioning myself in the past, primarily in the context of Wikibase. Wikibase. has sidestepped this by auto.detecting the model version by looking at the JSON. Making it explicit would be nice, though.

On the other hand, adding a field to the content table is costly, and perhaps is not needed: the version could just be a suffix of the content model name. A ContentHandler would have to be registered for each version separately, but that may be an advantage rather than a disadvantage: each version gets its own ContentHandler instance, and extensions are free to decide whether these should be instances of the same class, or different classes.

All that is needed would be a convention for encoding the version. E.g. using jade/1 or jade#1.3 or whatever.

On the other hand, adding a field to the content table is costly, and perhaps is not needed: the version could just be a suffix of the content model name.
[...]
All that is needed would be a convention for encoding the version. E.g. using jade/1 or jade#1.3 or whatever.

+1 to what @daniel said, particularly this. I note the following constraints:

  • Pre-MCR columns limit the name to 32 bytes. Eventually they'll be removed from the database, but they aren't yet.
  • The content_models name table in MCR limits the name to 64 bytes.
  • The ID field on content_models being smallint limits us to 32767 models total.

OTOH, if you are using JSON or some other structured representation where a version number is easily added, then just having the version number in there seems more straightforward.

OTOH, if you are using JSON or some other structured representation where a version number is easily added, then just having the version number in there seems more straightforward.

I guess the question is, do you need the version in situations where you don't want to retrieve the content (e.g. history view)? If not, I'd also use a JSON field for it.

I guess the question is, do you need the version in situations where you don't want to retrieve the content (e.g. history view)?

Probably not - the version is mainly useful for interpreting the serialized content. I see no other use for it.

If not, I'd also use a JSON field for it.

But what if you don't have JSON. What if we want to version wikitext, for instance. I think that's not only reasonable, it's actually a very tempting use for this.

I'm in favor of having the version in the meta-data, not the content itself. I see no reason to have it in a separate field though, and I think nearly all code should just treat it as a completely separate model. Probably all code except the content handler itself. Not conflating different versions of a model is really a good reason not to add a version field, but to include the version in the model name.

Having the version in the content type means you cannot do a quiet upgrade on edit, you need to change the content type (which normal users do not have the rights for). For something like wikitext where the version change is very obvious to users that's probably for the best. For something where users only interact with the content via some kind of custom GUI and versions are mostly incomprehensible to them, it might be less ideal.

Having the version in the content type means you cannot do a quiet upgrade on edit, you need to change the content type (which normal users do not have the rights for).

If the editing component (EditPage? API module?) just produces a Content object with the new model, this should just work. Model changes are not prevented by the storage layer.

For something where users only interact with the content via some kind of custom GUI and versions are mostly incomprehensible to them, it might be less ideal.

If the user doesn't see the raw data, the user doesn't care at all about the content model, and it can just change in the background.

Thanks for all the helpful thoughts! For JADE, we're okay with a version field in the JSON content, it seems helpful in case the raw content gets separated from its metadata. Although it feels polite to keep the version in metadata to allow for content formats that we don't control, I'd like to see a theoretical use case first. For something like (bad example) CSS2 vs CSS3, I agree with Daniel that the version number is appropriate to include in the content type title component.

I'll outline one possible migration timeline, to see the implications for various versioning proposals.

  • Send out deprecation announcement, explaining the v2 format and explaining what happens on date A and date B, below.
  • On date A, the v2 format is valid and consumers must be ready to parse new data.
  • We begin migrating data in batches using a maintenance script. Minor errors are corrected, for severe errors we consider rollback of migrated data.
  • During this transitional period, a client should be allowed to write in either the v1 or v2 format.
  • On date B, we'll discontinue any support for the v1 format, and writes must be in the v2 format.

There are a few difficult points above,

  • How does the client specify whether it's writing in the old or new format? That's not part of the ApiEdit spec. It seems that newly created pages would be assumed to have the new format, but editing an old-format page you would want to keep the same version of data. This is trivial if the version field is included in the raw data.
  • Rollback is difficult if schema version has changed. What if there were new edits in v2 but we need to rollback to v1? This could be a blindly destructive rollback, e.g. for emergency un-deployment of v2, or a potentially lossy reverse migration to convert v2 into v1 data. After the migration date B, rollbacks are not allowed to use format v1… None of our proposals make these issues any easier, although it's slightly simpler to rollback if the version is embedded in the data, since nothing needs to be changed in the metadata.

More observations:

  • v1 should never be used after date B. Machinery to read and write can be removed from all code except for special history-processing use cases, like dump analysis for research.
  • Even after date B, it's useful to keep version information embedded in either data or metadata, so that dump processing can associate content with its schema version.
  • It occurs to me that we might want to include a generic MIME type such as application/json to our content handler type, as a clue to e.g. clients that want to render the data but don't support our specific and evolving schema.
  • It would also be nice to include a JSON schema URL like https://phabricator.wikimedia.org/diffusion/EJAD/browse/master/jsonschema/judgment/v1.json?view=raw as normalized metadata on the v1 version of content type—although that would also be appropriate to include in a $schema field embedded in the content data, just bloaty.

@awight You are assuming that old revisions get ported to a new format. At least so far, that's a no-go. The content of old revisions is considered read-only, messing with it is Not Done (tm). The storage infrastructure is append-only.

So far, this means that all consumers have to support the old formats forever. Maybe that could be circumvented by allowing content can be converted on load, but that would mean that the Content object returns would have a different model than what is recorded in the content table. This should probably mostly work...

As to specifying the version on write: ApiEditPage has a contentmodel parameter. If the version is part of the model's name, it can be specified there.

@awight You are assuming that old revisions get ported to a new format. At least so far, that's a no-go. The content of old revisions is considered read-only, messing with it is Not Done (tm). The storage infrastructure is append-only.

That makes sense, what I mean is different though, that all page's current revision will have v2 content by date B, by having a batch migration process slowly upgrade pages until complete.

So far, this means that all consumers have to support the old formats forever. Maybe that could be circumvented by allowing content can be converted on load, but that would mean that the Content object returns would have a different model than what is recorded in the content table. This should probably mostly work...

I'm not sure that consumers need to support the old formats. Consumers that read old history will have to support old formats forever, true. But consumers that only read and write current data could drop support for old formats, assuming the difficulties I mentioned earlier are resolved.

Also, as a courtesy we could provide custom dumps where all old revisions are migrated to the new format, although that could run into validation problems e.g. a new, required field.

As to specifying the version on write: ApiEditPage has a contentmodel parameter. If the version is part of the model's name, it can be specified there.

Great, thanks for noting!

Having the version in the content type means you cannot do a quiet upgrade on edit, you need to change the content type (which normal users do not have the rights for).

If the editing component (EditPage? API module?) just produces a Content object with the new model, this should just work. Model changes are not prevented by the storage layer.

The problem is that EditPage would see an incoming Content object for model "foobar/2" while the page is currently "foobar/1". It'd have to know that this change is OK to allow without the changecontentmodels userright, while changing "wikitext/1" to "wikitext/2" probably isn't.

We probably could make it happen if we really wanted to, perhaps by having the permissions-checker actually load the content from the page and see if it got auto-upgraded to "foobar/2" or perhaps by letting the ContentHandler specify which upgrades are allowed. But it might be better to avoid the added complexity in the permission check.

I'm not sure that consumers need to support the old formats. Consumers that read old history will have to support old formats forever, true. But consumers that only read and write current data could drop support for old formats, assuming the difficulties I mentioned earlier are resolved.

Right. Most importantly, our own code has to support both formats indefinitely, to be able to display, diff, restore, and undelete old revisions.

Also, as a courtesy we could provide custom dumps where all old revisions are migrated to the new format, although that could run into validation problems e.g. a new, required field.

There is a hook that allows re-serialization for dumps. So no extra dumps would strictly speaking be required.

Note however that this means that restoring from dumps would restore to equivalent content, not the same content. Probably not a problem in practice, but technically a change in semantics.

The problem is that EditPage would see an incoming Content object for model "foobar/2" while the page is currently "foobar/1". It'd have to know that this change is OK to allow without the changecontentmodels userright, while changing "wikitext/1" to "wikitext/2" probably isn't.

We probably could make it happen if we really wanted to, perhaps by having the permissions-checker actually load the content from the page and see if it got auto-upgraded to "foobar/2" or perhaps by letting the ContentHandler specify which upgrades are allowed. But it might be better to avoid the added complexity in the permission check.

This is the kind of thing for which it would make sense to have some notion of model versions baked into core. This would make it easy to check whether two models are versions of the same base model, and thus conversion needs no special rights.

That semantics makes sense to me, but it's indeed one of the things that would need more discussion in this RFC, to cover all the edge cases, e.g.: is it always safe to assume that conversion to a newer version of the same model needs no special permissions? How about conversion to an older format, perhaps when doing a rollback? How about a "full" undo to a previous revision (=restoring the revision)?...

I agree with Daniel that introducing a new top-level primitive for "content model version" seems undesirable as it creates additional expectations and things that can go wrong in our system without reaping benefits in turn for that added cost.

If we do want the same content "purpose" to have multiple variants of its content model, I think the content model ID is the appropiate place to do that. The existing system fully covers all the use cases for that:

  • Be able to register a different content handler for each unique content model ID.
  • Be able to internally re-use code between content handlers, through sub classes, or even by literally assigning the same handler twice if their variable purpose can be detected by other means. Wikidata already does this to some extent through its various Wikibase-related content models.
  • Make sure that things fail if an unknown content model ID is encountered.
  • Make sure that things can't accidentally invoke an unrelated content handler for a different content model ID.

All points are addressed by using the content model ID. The only thing we'd need is some (social) convention for naming of related content models for the case where one is newer than the other (from a developer perspective). Something like foo/2 seems adequate, though I don't have any preference here.

The last two points in particular are at risk if we introduce a new primitive given there'd be a way to use the "wrong" version. I think this can and should be treated similarly to MIME types. Any relation between types is knowledge that should be contained in the handler layer. The system doesn't and shouldn't need to know about that.

As for migrating older content, I too am concerned about this. There are some (minor) technical reasons against doing this, but overall I don't think there's a technical reason for why we can't. We're just struggling with the why.

I don't think there will ever be a case where we drop support for handling content models that once existed in production, which means we're not going to save on the number of classes we have, nor the size of model>handler mapping. Think about content dumps, exports, and imports, and other archival use cases.

As such, it circles me back to the "Why". What do we accomplish by retroactively changing older content. Given revisions are immutable, they would never be able to utilize newer content model features. They'd only be able to map from whatever already existed in the old schema, to a different way of expressing the same thing. As such, that seems like something that would be more appropiate for a UI layer or API layer, not something at the storage level.

For example, if the viewing of the Jade page gains a new feature or changes how something is visualised, that doesn't need to happen in lockstep with the storage. The view can visualise the data however it wants to. There would be an interpreter for both the old and new content model, both producing the same. Perhaps the latest would always be simple if it is mostly aligned with the visualisation, and the older ones would massage the data as needed. The same for any API module.

I'll also note that to a first approximation, Jade isn't in production yet, so a conversation about introducing a completely new system in MediaWiki for maintaining and migrating multiple versions of content models seems a bit premature. Given the number of things on-going, and the absence of anything blocked on this, with various viable workarounds for short-mid term, I'd recommend moving this back a few quarters at least.

I don't think there will ever be a case where we drop support for handling content models that once existed in production, which means we're not going to save on the number of classes we have, nor the size of model>handler mapping. Think about content dumps, exports, and imports, and other archival use cases.

We might have, if for example MediaWiki-extensions-EducationProgram had had a content model (see T125618: Deprecate and remove the EducationProgram extension from Wikimedia servers after June 30, 2018).

But I agree that's not much of a reason to go rewriting old revisions.

I don't think there will ever be a case where we drop support for handling content models that once existed in production, which means we're not going to save on the number of classes we have, nor the size of model>handler mapping. Think about content dumps, exports, and imports, and other archival use cases.

We might have, if for example MediaWiki-extensions-EducationProgram had had a content model (see T125618: Deprecate and remove the EducationProgram extension from Wikimedia servers after June 30, 2018).

But I agree that's not much of a reason to go rewriting old revisions.

It would be pretty easy to create a generic dummy ContentHandler for text-based content that just shows the raw text, and doesn't allow editing. A dummy handler for non-textual content would also be possibel but it wouldn't be able to show anything useful to the user. It would still allow revisions to be "viewed" (and diffed and delted and undeleted, etc), but all the user would see would be a message saying that this kind of content is no longer supported.

Krinkle renamed this task from Content model version field to accompany content model to RFC: Content model version field to accompany content model.Apr 4 2020, 2:34 AM
Aklapper triaged this task as Lowest priority.May 23 2021, 10:54 PM