Page MenuHomePhabricator

Create batch access interface for page content
Closed, ResolvedPublic

Description

RevisionStore does not provide a way to access page content (or slot meta-data) in bulk. Code that has been using the 'text' for getQueryInfo() for bulk access to page content can no longer be used, because now there can be multiple slots for each revision.

As a replacement, we need a way to construct a list if RevisionRecord objects in a way that optimizes the query for the associated information from the slots and content tables.

Design draft:
(rewritten 2019-08-20 after discussion with Anomie on this ticket)

Introduce a new method in RevisionStore:
RevisionStore::newRevisionsFromBatch( $rows, array $options = [] ): RevisionRecord[].

Contract:
Construct a RevisionRecord instance for each row in $rows, and return them as an associative array indexed by revision ID.

Signature:

  • $rows is a Traversable of anonymous objects representing rows from a query result. Each row is expected to conform to the requirements of RevisionStore::newRevisionFromRow(). In practice, $rows is typically an IResultWrapper containing the result of a query that was based on RevisionStore::getQueryInfo.
  • $options is an associative array supporting the following options:
    • 'slots': whether meta-data about slots (and slot content) should be loaded immediately, instead of relying on the lazy loading mechanism in RevisionStoreRecord. This allows all the meta-data for all slots to be loaded in a single query. The value is either truthy to indicate that all slot should be preloaded, or falsy to indicate that no slot meta data should be preloaded, or a list (array) or slot role names, to indicate for which slots the meta data should be preloaded. The default is false.
    • 'content': whether the actual content blobs should be preloaded, instead of relying on lazy loading as usual. The value is truthy or falsy, default is false. Only applies to slots which have been selected for preloading via the 'slots' option. This option has no effect until BlobStore also gets a batch interface (see T230834: Create batch access interface for BlobStore).
  • returns an associative array of Revision(Store)Records, indexed by revision ID.

Implementation idea:

  • At first, just loop over $rows and call RevisionStore::newRevisionFromRow on each row. This would be the baseline version of the method. This already satisfies the contract, but passes on the opportunity to optimize by preloading.
  • In the next iteration, if preloading of slot meta-data is requested:
    • first pull out the revision ID from each row in "$rows". Remember the row in an associative array indexed by revision ID
    • Run a query that selects all the requested slot (and slot content) meta-data; let's call the result of that query $slots.
    • Loop through the result set, grouping together slot meta-data belonging to the same revision. Use RevisionStore::newRevisionFromRowAndSlots to construct a RevisionStoreRecord from that plus the respective row from $rows.
  • In a following iteration, use a batch query to also preload the Content objects for each slot, by content ID. Details TBD, see T230834.
  • Old stale patch implementing something similar for use in WikiExporter https://gerrit.wikimedia.org/r/c/mediawiki/core/+/513073
  • WikiExporter should be changed to use the new system as well. Doing that should validate the approach and the interface.

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedZabe
Resolveddaniel
ResolvedCCicalese_WMF
Resolveddaniel
ResolvedNone
ResolvedNone
ResolvedCCicalese_WMF
ResolvedCCicalese_WMF
Resolveddaniel
Resolved Pchelolo
Resolveddaniel
ResolvedBPirkle
Resolved Pchelolo
Resolved Pchelolo
Resolved Pchelolo
Resolved Pchelolo
Resolveddaniel
Resolveddaniel
ResolvedNone

Event Timeline

which returns a RevisionQuery object

Let's not introduce a weird new method for doing queries, different from how bulk queries are done by everything else in MediaWiki and incompatible with use cases much more complicated than a basic query against the revision table.

Counterproposals

Not necessarily mutually exclusive.

  1. Add RevisionStore::newRevisionsFromRows( array $rows, array $options = [] ) or the like, that has options to indicate whether it should bulk-query and populate the slots and content in the fetched RevisionRecords.
  2. Add RevisionStore::bulkLoadRevisionSlots( RevisionRecord[] $revs, array $options = [] ) that bulk-queries the slots (and, optionally, content) to populate into the passed RevisionRecords.

which returns a RevisionQuery object

Let's not introduce a weird new method for doing queries, different from how bulk queries are done by everything else in MediaWiki and incompatible with use cases much more complicated than a basic query against the revision table.

I think we should rethink how bulk queries are done in MediaWiki. The current way makes schema changes very hard. More encapsulation means more freedom to change the schema.

That said, both counterproposals seem acceptable as well.

  1. Add RevisionStore::newRevisionsFromRows( array $rows, array $options = [] ) or the like, that has options to indicate whether it should bulk-query and populate the slots and content in the fetched RevisionRecords.
  2. Add RevisionStore::bulkLoadRevisionSlots( RevisionRecord[] $revs, array $options = [] ) that bulk-queries the slots (and, optionally, content) to populate into the passed RevisionRecords.

The first some seems preferable, because in order to do number 2, we'd have to have a way to inject the slots into RevisionStoreRecords, which are immutable objects and should not have setters.

I think we should rethink how bulk queries are done in MediaWiki. The current way makes schema changes very hard. More encapsulation means more freedom to change the schema.

On the other hand, more encapsulation also makes it harder to do things that cross boundaries. If you need to query based on page and revision and file data, do you use "PageQuery" or "RevisionQuery" or "FileQuery" or does someone have to implement some additional interface to smash the different things together?

More encapsulation also makes it harder to optimize queries. For example, whether you should include the namespace and/or title in the ORDER BY for MySQL depends on which pages exactly you're querying, and making the wrong choice will filesort. I doubt the encapsulation is going to handle that for you, so you'll have to duplicate work done by your setWherePages(). For another example, index forcing, while frowned upon, is sometimes necessary when MySQL is choosing stupid plans and you didn't provide any way to do that.

And it also makes it harder to do things outside of what was considered by the person writing the encapsulation. For example, at a quick glance I see there's no way in your proposal to do DISTINCT or GROUP BY or to use the query produced by your RevisionQuery as a subquery in some other query. Sure, you could probably add those things now that I pointed them out, but how many other use cases did I not think of just now?

Rather than trying to make a general-purpose encapsulation of "revision" or whatever, I think we'd be better served by encapsulating specific use cases to share logic between special pages and API modules that both serve the use case.

More encapsulation also makes it harder to optimize queries.

That's always the tradeoff. The fastest code is probably hand-optimized assembly. Abstraction comes at the cost of performance, but makes things more flexible, testable, and maintainable. But when taken too far, things get unpredictable and potentially slow (see the annoyance that is ORM). Striking the right balance is always a value judgement.

Rather than trying to make a general-purpose encapsulation of "revision" or whatever, I think we'd be better served by encapsulating specific use cases to share logic between special pages and API modules that both serve the use case.

That should definitely be done in any case. The idea behind my proposal is to reduce the number of classes that have to know about the revision storage schema.

I don't have strong feelings about introducing RevisionQuery. I think it would be a bit cleaner, would make paging easy, and the same approach could be shared for other cases where we want to iterate things, like pages or users. But newRevisionsFromRows() will serve the purpose as well, and is closer to How Things Were Always Done.

If you do want to encapsulate things more, a less all-out option might be something like rMW948d136243da: [WIP] Introduce QueryInfo and providing helper methods to create a revision-specific query info object which can than still be relatively easily altered if some less common behavior is needed (possibly in an abstract way, like it's done with the migration helpers).

Change 531308 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] WIP: Introduce RevisionStore::newRevisionsFromBatch

https://gerrit.wikimedia.org/r/531308

If you do want to encapsulate things more, a less all-out option might be something like rMW948d136243da: [WIP] Introduce QueryInfo and providing helper methods to create a revision-specific query info object which can than still be relatively easily altered if some less common behavior is needed (possibly in an abstract way, like it's done with the migration helpers).

You mean, like my original proposal? (click "old" on https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-mps2egoasz73r2d/)

I mean it's less all-out than the original proposal. The query info is not bound to a single model and can be easily customized, while able to provide some basic consistency checks. And it's not a huge change since much of our code already follows that pattern, except it's arrayly typed.

daniel added a subscriber: ArielGlenn.

Marking as blocked externally on CPT Clinic Duty board, since @ArielGlenn should have a look before we merge.

Marking as blocked externally on CPT Clinic Duty board, since @ArielGlenn should have a look before we merge.

I'm taking an actual honest to goodness vacation for the next 10 days so while I'll drop in from time to time to make sure nothing's busted, I won't be looking at any code until Ι'm back.

I think WikiExporter and xmlDumpWriter both need to be much better about indicating straight up in the code whether we want content retrieval or not, and whether we want revision/slot retrieval (or already have all of the info), so that folks updating the code in the future don't have (as much of) a headache around these issues. I'll look at that when I return.

Change 531308 merged by jenkins-bot:
[mediawiki/core@master] Introduce RevisionStore::newRevisionsFromBatch

https://gerrit.wikimedia.org/r/531308

We have the batch API to grab revisions from rows and the batch API to fetch blobs T230834, but we haven't really connected the 2 together yet, so no, this is not entirely done.

We have the batch API to grab revisions from rows and the batch API to fetch blobs T230834, but we haven't really connected the 2 together yet, so no, this is not entirely done.

That is correct, but the current state may be sufficient to unblock the MCR schema migration. The next step should be to use the new method in the Translate extension, and benchmark whether the performance is acceptable. If it is, we can use it to remove the dependency on the text table there, allowing us to move forward in the schema migration.

Pchelolo claimed this task.

I've created subtasks for pieces of followup work, but the interface is created, so this is done.

Change 538928 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Improve documentation of newRevisionsFromBatch()

https://gerrit.wikimedia.org/r/538928

Change 538932 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] newRevisionsFromBatch: don't throw on duplicate row.

https://gerrit.wikimedia.org/r/538932

Change 538928 merged by jenkins-bot:
[mediawiki/core@master] Improve documentation of newRevisionsFromBatch()

https://gerrit.wikimedia.org/r/538928

Change 539194 had a related patch set uploaded (by Daniel Kinzler; owner: Ppchelko):
[mediawiki/core@master] RevisionStore::newRevisionFromBatch should use Title::newFromRow

https://gerrit.wikimedia.org/r/539194

Change 539324 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] RevisionSTore: Introduce getSlotRowsForBatch

https://gerrit.wikimedia.org/r/539324

Change 538932 merged by jenkins-bot:
[mediawiki/core@master] newRevisionsFromBatch: don't throw on duplicate row.

https://gerrit.wikimedia.org/r/538932