Page MenuHomePhabricator

Batching of Linker::formatComment()
Open, Needs TriagePublic

Description

Links in comments are a major source of DB queries and presumably contribute to the slowness of the history page and other large lists of changes or events (e.g. T284274). The obvious solution is to format all the comments in a batch, so that page existence checks can be done in a batch.

The complications are:

  • Data flow
  • Helper classes that intervene between result wrappers and HTML: CommentStore, RevisionRecord, RecentChange, etc.
  • DELETED_COMMENT flags

The case studies below are the interesting cases pulled out of a caller survey of core and some extensions.

IndexPager

Direct Linker::formatComment() callers in pagers are mostly pretty easy:

class SomePager extends IndexPager {
	...
	public function preprocessResults( $result ) {
		$this->formattedComments = $this->commentFormatter->formatRows( 'some_comment', $result );
	}

	public function formatValue( $name, $value ) {
		switch ( $name ) {
			case 'some_comment':
				$formatted = $this->formattedComments[$this->getResultOffset()];
				break;
		}
	}
}

CommentFormatter calls CommentStore on the rows with "some_comment" being passed through as $key. IndexPager::getResultOffset() is trivial. So far, CommentFormatter only depends on CommentStore.

HistoryPager

revComment() is more difficult because it gets the comment text from a RevisionRecord. CommentFormatter in this case would need a RevisionStore, and all the RevisionRecord objects would be created twice, potentially with performance consequences.

class HistoryPager extends ReverseChronologicalPager {
	...
	protected function doBatchLookups() {
		...
		$this->formattedComments = $this->commentFormatter->formatRevisionRows(
			$this->mResult, false, true, false );
	}
	...

	private function historyLine( $row, $next, $notificationtimestamp, $resultOffset ) {
		...
		$s2 = $this->formattedComments[$resultOffset];
		...
	}
}

The parameters to formatRevisionRows() as proposed above are the same as Linker::revComment() except with the RevisionRecord replaced with a ResultWrapper. Linker::revComment() knows about deleted flags, whereas Linker::commentBlock() callers do their own deleted flag handling -- an inconsistency which is more obvious in the new interface.

To avoid unnecessary dependencies, we could have RevisionCommentFormatter which depends on RevisionStore. If you just want to format some strings, there would be another formatter class that is constructible without a RevisionStore.

RevisionList

The RevisionItem/RevisionList/RevDel* hierarchy is somewhat complicated. If we spray some around some @internal annotations, and break b/c, then I think it's feasible. There are some ways to do it in a backwards-compatible way, but I don't think extensions actually use these classes.

(Note: These classes do not provide a generic revision array backend. RevisionItemBase::getHTML() is abstract, firmly tying these classes to the index.php UI.)

Currently RevisionItemBase subclasses receive a RevisionListBase and a row as parameters. Most of the logic is implemented in the item subclasses -- lists don't really know how to get comments out of rows. So we'd need a list method that preprocesses results, and the comment extraction logic would move to there. Then items would receive a formatted comment as a constructor parameter.

A backwards-compatible way to do it would be to mutate the rows before passing them to the item classes, effectively adding a fake formatted_comment column to the result set.

ChangesList

The poorly-named ChangesList::insertComment() takes a RecentChange object and returns an HTML string. It's overridden by AbuseFilter. There is a place to put batch queries: ChangesList::initChangesListRows(). We could do the comment batch there, and store the formatted comments indexed by rc_id or something. ChangesList::insertComment() would call an accessor method which gets formatted comments out of the batch.

Event Timeline

Change 708241 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] [WIP] CommentFormatter

https://gerrit.wikimedia.org/r/708241

Not sure if that makes things to complex within this task:
The comments are stored in its own table and always joined in RevisionStore over the comment id. The new formatter/batch class could load the comment itself depending on the ids and than returning the formatted comment depending on the id, which takes already care of deduplicate comments before parsing. This can avoid OOM when there are 5000 revision record created for the batch to get the comments from.
A single mode is okay for diff view.

Change 714447 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Add benchmark for Linker::formatComment() in preparation for refactor

https://gerrit.wikimedia.org/r/714447

Change 714447 merged by jenkins-bot:

[mediawiki/core@master] Add benchmark for Linker::formatComment() in preparation for refactor

https://gerrit.wikimedia.org/r/714447

Change 708241 merged by jenkins-bot:

[mediawiki/core@master] Introduce CommentFormatter

https://gerrit.wikimedia.org/r/708241