Page MenuHomePhabricator

Introduce a ContentStore service to allow certain types of content to not be stored as serialized blobs.
Open, LowPublic

Description

Currently, RevisionStore directly uses a BlobStore to store slot content, calling Content::serialize() to turn a Content object into a blob.

There is no reason to hard code this, however. Some kinds of Content may better be stored in a different way, e.g. using a dedicated SQL schema, or in a column store. To allow this, RevisionStore should not use a BlobStore directly, but rather use a ContentStore, which may or may not be implemented based on a BlobStore.

The ContentStore interface would be very similar to the BlobStore interface, but would take Content objects instead of raw data blobs:

	public function getContent( $contentAddress, $queryFlags = 0 ): Content;

	public function storeContent( Content $content, $hints = [] ): string;

The default implementation would be based on BlobStore, Content::serialize, and ContentHandler::unserializeContent. To allow other storage mechanisms to be applied, a wrapper could be used that delegates to the correct ContentStore based on the content model (when writing) and address prefix (when reading).

Related Objects

StatusSubtypeAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenFeatureNone
OpenBUG REPORTNone
OpenNone
StalledNone
OpenFeatureNone
DuplicateNone
ResolvedNone
OpenNone
OpenNone
OpenFeatureNone
OpenNone
ResolvedNone
ResolvedNone
OpenFeatureNone
OpenNone
OpenFeatureNone
StalledNone
OpenNone
OpenNone

Event Timeline

Pinging some folks who have in the past expressed interest in this idea: @Fjalapeno @WMDE-leszek @Halfak.

daniel updated the task description. (Show Details)

It should probably have a mass fetch/store interface too, to avoid the problems every similar service has predictably run into.

It should probably have a mass fetch/store interface too, to avoid the problems every similar service has predictably run into.

True. Though for the blob store, we have so far not encountered a use case for mass write. Batch reads are needed by some maintenance scripts.

Experimental patch (old, linking in the hope it might be useful for discussion): https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631180

To allow other storage mechanisms to be applied, a wrapper could be used that delegates to the correct ContentStore based on the content model (when writing) and address prefix (when reading).

How would this look in practice? At first I assumed that it could make sense to have each ContentHandler know how to store its corresponding Content (i.e. choosing a ContentStore implementation), so it would have a method that returns either a ContentStore instance, or a class name. But then I wondered if ContentHandler is the right place for that: ContentStore already depends on ContentHandler, plus ContentHandler is already pretty big and might as well be broken down into separate interfaces. Also, we need to handle custom prefixes for read operations.

An alternative could be a ContentStoreRegistry, where custom stores could be registered similar to custom slots, i.e.

$services->addServiceManipulator(
	'ContentStoreRegistry',
	function ( ContentStoreRegistry $registry ) {
		$registry->registerStore(
                    CONTENT_MODEL_MYCONTENT,
                    MY_CONTENT_BLOB_PREFIX,
                    $services->getMyContentContentStore()
                );
	}
);

To allow other storage mechanisms to be applied, a wrapper could be used that delegates to the correct ContentStore based on the content model (when writing) and address prefix (when reading).

How would this look in practice? At first I assumed that it could make sense to have each ContentHandler know how to store its corresponding Content (i.e. choosing a ContentStore implementation), so it would have a method that returns either a ContentStore instance, or a class name. But then I wondered if ContentHandler is the right place for that: ContentStore already depends on ContentHandler, plus ContentHandler is already pretty big and might as well be broken down into separate interfaces. Also, we need to handle custom prefixes for read operations.

My intention was to have a DispatchingContentStore which can delegate to different implementations based on the content model, or based on the role hint.

My intention was to have a DispatchingContentStore which can delegate to different implementations based on the content model, or based on the role hint.

That would make sense, but how would the DispatchingContentStore know how to associate a model (or prefix) to a specific ContentStore implementation? Which is equivalent to asking: how would an extension register another ContentStore? I assume that the DispatchingContentStore would either act as a registry itself (and perhaps run some hooks to allow customizations), or have a registry injected (and then such a registry would be the one I mentioned in T209044#7707580).

That would make sense, but how would the DispatchingContentStore know how to associate a model (or prefix) to a specific ContentStore implementation? Which is equivalent to asking: how would an extension register another ContentStore?

I'd suggest an extension.js attribute like we do for REST routes. Alternatively, a factory hook could be used. Either way, I suppose we'd cache a ContentStore instance per combination of slot and model.

(I originally thought "service manipulators" where a good way to do that, but I now think that was a mistake.)

I assume that the DispatchingContentStore would either act as a registry itself (and perhaps run some hooks to allow customizations), or have a registry injected (and then such a registry would be the one I mentioned in T209044#7707580).

Correct.

That would make sense, but how would the DispatchingContentStore know how to associate a model (or prefix) to a specific ContentStore implementation? Which is equivalent to asking: how would an extension register another ContentStore?

I'd suggest an extension.js attribute like we do for REST routes.

That seems reasonable. One thing I don't like about extension.json (in general) is that you can't use PHP constants, which is probably fine-ish if it's just for the content model, a bit less so if it's also for the prefix. I was also thinking about a more long-term idea of having an extension.json entry for defining new content types that would allow you to register ContentHandlers, ContentStores and any other thing we may want to add in the future. Something like:

"Contents": {
    "MyContent": {
        handler: "MyContentHandler",
        store: "MyContentStore",
        blobprefix: "my-prefix",
        // More things could be added here in the future
    }
}

but this really is beyond the scope of this task.

I was also thinking about a more long-term idea of having an extension.json entry for defining new content types that would allow you to register ContentHandlers, ContentStores and any other thing we may want to add in the future.

That sounds quite nice!

I've created T301891 about the generic extension.json proposal. In the meanwhile, I'd like to try and finish the experimental ContentStore patch.

Change 631180 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] EXPERIMENTAL: ContentStore

https://gerrit.wikimedia.org/r/631180