Maniphest T209044

Introduce a ContentStore service to allow certain types of content to not be stored as serialized blobs.
Open, LowPublic
Actions

Assigned To

None

Authored By

	daniel
	Nov 8 2018, 11:45 AM

Description

Currently, RevisionStore directly uses a BlobStore to store slot content, calling Content::serialize() to turn a Content object into a blob.

There is no reason to hard code this, however. Some kinds of Content may better be stored in a different way, e.g. using a dedicated SQL schema, or in a column store. To allow this, RevisionStore should not use a BlobStore directly, but rather use a ContentStore, which may or may not be implemented based on a BlobStore.

The ContentStore interface would be very similar to the BlobStore interface, but would take Content objects instead of raw data blobs:

	public function getContent( $contentAddress, $queryFlags = 0 ): Content;

	public function storeContent( Content $content, $hints = [] ): string;

The default implementation would be based on BlobStore, Content::serialize, and ContentHandler::unserializeContent. To allow other storage mechanisms to be applied, a wrapper could be used that delegates to the correct ContentStore based on the content model (when writing) and address prefix (when reading).

Details

	Subject	Repo	Branch	Lines +/-
	EXPERIMENTAL: ContentStore	mediawiki/core	master	+391 -69

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Declined		dchen	T118706 Conduct heuristic evaluation of image upload and insert flow in VisualEditor
Open		None	T115858 Design improvements for mw.ForeignStructuredUpload.BookletLayout
Open		None	T115865 Insert image in content immediately after it's uploaded, skipping the "General settings" step
Duplicate		None	T115864 Figure out if the description of the image can be used as the caption on-wiki
Open	Feature	None	T53032 When inserting an image, set its caption by default to be the Commons image description
Open	BUG REPORT	None	T12863 Links on commons upload summaries do not link to commons
Open		None	T5498 Image history is confusing
Stalled		None	T96384 Integrate file revisions with description page history
Open	Feature	None	T39534 Wikimedia Commons should support searching by color
Duplicate		None	T39535 Wikimedia Commons should support filtering by color
Resolved		None	T19503 Provide metadata support on Wikimedia Commons
Open		None	T122038 Moving/Deleting a video file does not move/delete subtitles files, nor does it inform the file mover of their existence
Open		None	T135221 Make TimedText content an integral part of the File page
Open	Feature	None	T2167 Use a dedicated interface for adding meta-data like interwiki links, rather than wikitext
Open		None	T132072 Integrate page meta-data as a new content model revision slot for consistency and atomicity
Resolved		None	T51662 VisualEditor: Use Multimedia/Wikidata's proposed rich structured meta-data in the image insertion dialog
Resolved		None	T68108 [Epic] Store media information for files on Wikimedia Commons as structured data
Open	Feature	None	T14963 allow per-page exceptions to spam blacklist
Open		None	T203157 Make the spam whitelist its own slot
Open	Feature	None	T56140 Move TemplateData to its own JSON-content namespace and associate with Template-namespace, or to its own TemplateData content model and revision slot
Stalled		None	T174043 Deploy Multi-Content Revisions
Open		None	T174022 Implement multi-content revisions
Open		None	T209044 Introduce a ContentStore service to allow certain types of content to not be stored as serialized blobs.

Event Timeline

daniel created this task.Nov 8 2018, 11:45 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 8 2018, 11:45 AM

daniel triaged this task as Low priority.Nov 8 2018, 11:45 AM

daniel added a parent task: T107595: [RFC] Multi-Content Revisions.

Pinging some folks who have in the past expressed interest in this idea: @Fjalapeno @WMDE-leszek @Halfak.

daniel updated the task description. (Show Details)Nov 8 2018, 11:49 AM

daniel updated the task description. (Show Details)

daniel updated the task description. (Show Details)Nov 8 2018, 11:51 AM

Addshore awarded a token.Nov 8 2018, 11:58 AM

CCicalese_WMF added projects: Platform Team Legacy (Next), Multi-Content-Revisions (New Features).Nov 8 2018, 1:18 PM

CCicalese_WMF edited parent tasks, added: T174022: Implement multi-content revisions; removed: T107595: [RFC] Multi-Content Revisions.Nov 21 2018, 4:16 PM

CCicalese_WMF removed a project: Platform Team Legacy (Next).Jul 14 2019, 3:19 PM

WDoranWMF moved this task from MCR to mop on the Platform Engineering board.Jul 26 2019, 6:41 PM

WDoranWMF edited projects, added Core Platform Team Initiatives (MCR); removed Platform Engineering (MCR).

It should probably have a mass fetch/store interface too, to avoid the problems every similar service has predictably run into.

In T209044#6088746, @Tgr wrote:

It should probably have a mass fetch/store interface too, to avoid the problems every similar service has predictably run into.

True. Though for the blob store, we have so far not encountered a use case for mass write. Batch reads are needed by some maintenance scripts.

daniel updated the task description. (Show Details)Apr 28 2020, 5:52 PM

Addshore subscribed.Jul 17 2020, 12:33 PM

Experimental patch (old, linking in the hope it might be useful for discussion): https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631180

To allow other storage mechanisms to be applied, a wrapper could be used that delegates to the correct ContentStore based on the content model (when writing) and address prefix (when reading).

How would this look in practice? At first I assumed that it could make sense to have each ContentHandler know how to store its corresponding Content (i.e. choosing a ContentStore implementation), so it would have a method that returns either a ContentStore instance, or a class name. But then I wondered if ContentHandler is the right place for that: ContentStore already depends on ContentHandler, plus ContentHandler is already pretty big and might as well be broken down into separate interfaces. Also, we need to handle custom prefixes for read operations.

An alternative could be a ContentStoreRegistry, where custom stores could be registered similar to custom slots, i.e.

$services->addServiceManipulator(
	'ContentStoreRegistry',
	function ( ContentStoreRegistry $registry ) {
		$registry->registerStore(
                    CONTENT_MODEL_MYCONTENT,
                    MY_CONTENT_BLOB_PREFIX,
                    $services->getMyContentContentStore()
                );
	}
);

In T209044#7707580, @Daimona wrote:

To allow other storage mechanisms to be applied, a wrapper could be used that delegates to the correct ContentStore based on the content model (when writing) and address prefix (when reading).

How would this look in practice? At first I assumed that it could make sense to have each ContentHandler know how to store its corresponding Content (i.e. choosing a ContentStore implementation), so it would have a method that returns either a ContentStore instance, or a class name. But then I wondered if ContentHandler is the right place for that: ContentStore already depends on ContentHandler, plus ContentHandler is already pretty big and might as well be broken down into separate interfaces. Also, we need to handle custom prefixes for read operations.

My intention was to have a DispatchingContentStore which can delegate to different implementations based on the content model, or based on the role hint.

In T209044#7707817, @daniel wrote:

My intention was to have a DispatchingContentStore which can delegate to different implementations based on the content model, or based on the role hint.

That would make sense, but how would the DispatchingContentStore know how to associate a model (or prefix) to a specific ContentStore implementation? Which is equivalent to asking: how would an extension register another ContentStore? I assume that the DispatchingContentStore would either act as a registry itself (and perhaps run some hooks to allow customizations), or have a registry injected (and then such a registry would be the one I mentioned in T209044#7707580).

In T209044#7707885, @Daimona wrote:

That would make sense, but how would the DispatchingContentStore know how to associate a model (or prefix) to a specific ContentStore implementation? Which is equivalent to asking: how would an extension register another ContentStore?

I'd suggest an extension.js attribute like we do for REST routes. Alternatively, a factory hook could be used. Either way, I suppose we'd cache a ContentStore instance per combination of slot and model.

(I originally thought "service manipulators" where a good way to do that, but I now think that was a mistake.)

I assume that the DispatchingContentStore would either act as a registry itself (and perhaps run some hooks to allow customizations), or have a registry injected (and then such a registry would be the one I mentioned in T209044#7707580).

Correct.

In T209044#7708856, @daniel wrote:

In T209044#7707885, @Daimona wrote:

That would make sense, but how would the DispatchingContentStore know how to associate a model (or prefix) to a specific ContentStore implementation? Which is equivalent to asking: how would an extension register another ContentStore?

I'd suggest an extension.js attribute like we do for REST routes.

That seems reasonable. One thing I don't like about extension.json (in general) is that you can't use PHP constants, which is probably fine-ish if it's just for the content model, a bit less so if it's also for the prefix. I was also thinking about a more long-term idea of having an extension.json entry for defining new content types that would allow you to register ContentHandlers, ContentStores and any other thing we may want to add in the future. Something like:

"Contents": {
    "MyContent": {
        handler: "MyContentHandler",
        store: "MyContentStore",
        blobprefix: "my-prefix",
        // More things could be added here in the future
    }
}

but this really is beyond the scope of this task.

In T209044#7712531, @Daimona wrote:

I was also thinking about a more long-term idea of having an extension.json entry for defining new content types that would allow you to register ContentHandlers, ContentStores and any other thing we may want to add in the future.

That sounds quite nice!

I've created T301891 about the generic extension.json proposal. In the meanwhile, I'd like to try and finish the experimental ContentStore patch.

Daimona mentioned this in T302040: [Request for Comment] New "Event" and "Event Talk" namespaces.Apr 15 2022, 2:29 PM

Change 631180 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] EXPERIMENTAL: ContentStore

https://gerrit.wikimedia.org/r/631180

gerritbot added a project: Patch-For-Review.Sep 16 2022, 1:00 PM

Jdforrester-WMF subscribed.Sep 19 2022, 1:18 PM

Daimona mentioned this in T322657: Investigation: Can we display event registration actions in event page history?.Nov 11 2022, 5:11 PM

Addshore unsubscribed.Jun 27 2023, 12:43 PM

Introduce a ContentStore service to allow certain types of content to not be stored as serialized blobs.Open, LowPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Introduce a ContentStore service to allow certain types of content to not be stored as serialized blobs.
Open, LowPublic
Actions

Related Objects
Search...