Page MenuHomePhabricator

Replace usage of VirtualRESTServiceClient in Flow extension
Closed, ResolvedPublic

Description

Flow creates a VRS client based on the $wgVirtualRestConfig. If the config sets $wg...['modules']['restbase'], then RestbaseVirtualRestService (disabled for Flow) could be used but if not, falls back to using ParsoidVirtualRestService (which is what is being used now) because using the /restbase/ module drops data-parsoid mappings after transforming the wikitext -> html.

The main reason why Flow contacts Parsoid via RESTBase is to perform transformations to/from different formats (wikitext -> html, html->wikitext etc):

$url = '/restbase/local/v1/transform/' . $from . '/to/' . $to . '/' .
			urlencode( $prefixedDbTitle );

Recently, we've moved this logic into MW core and based on various options ($from, $to etc) in the Flow\Conversion\Utils::convert() path, we can apply core's Parsoid logic to perform this transformation/conversion directly and cut ties with RESTBase.

NOTE: If parsoid isn't configured, Flow\Conversion\Utils::convert() uses the legacy MediaWiki parser which only transforms wikitext -> html.

Acceptance Criteria

  • Put tests in place to capture different outputs after transformation is applied with RB configs in place
  • Remove usage of VirtualRestService and replace with core logic
  • Ensure the outputs returned by core code after transformation matches above (tests shouldn't break) - WIP: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Flow/+/927619
  • Review to make sure no VRS logic is still in extension
  • Test extension manually to make sure it still works as expected
  • Once deployed, monitor sites for any spikes or failures reported for a week or two.

Event Timeline

Implementation plan for "Remove usage of VirtualRestService and replace with core logic":

Step 1: In Flow\Conversion\Utls::convert, replace the if ( $from === 'wikitext' || $from === 'html' ) conditional section as follows:

		if ( $from === 'wikitext' && $to === 'html' ) {
			return self::wikitext2html( $content, $title );
		} elseif ( $from === 'html' && $to === 'wikitext' ) {
			return self::html2wikitext( $content, $title );
		}

The wikitext2html and html2wikitext should be private methods that contain the old logic, e.g.:

				if ( self::isParsoidConfigured() ) {
					return self::parsoid( 'wikitext', 'html', $content, $title );
				} else {
					return self::parser( 'wikitext', 'html', $content, $title );
				}

The two methods will for now have nearly identical code. Next, we want to replace the code in wikitext2html first, and then the code in html2wikitext later:

Step 2: Replace the code in wikitext2html :

	private static function wikitext2html( string $wikitext, PageIdentity $title ) {
		$parserOptions = ParserOptions::newFromAnon();
		$parserOptions->setRenderReason( __METHOD__ );

		$parser = MediaWikiServices::getInstance()->getParsoidParserFactory()->create();
		$output = $parser->parse( $$wikitext, $title, $parserOptions );

		return $output->getRawText();
	}

Step 3: replace the code in html2wikitext:

	private static function html2wikitext( string $html, PageIdentity $title ) {
		$transform = MediaWikiServices::getInstance()->getHtmlTransformFactory()
			->getHtmlToContentTransform( $html, $title );

		$transform->setOptions( [
			'contentmodel' => CONTENT_MODEL_WIKITEXT
		] );

		$content = $transform->htmlToContent();

		// TODO: Throw if $content isn't a WikitextContentModel instance
		return $content->getText();
	}

This should be done in three different patches.

Before we can replace the implementation, we probably need tests that cover more edge cases. We can get started working on the code already, but before we merge it, we need to be sure we are not going to break conversion for more complex content.

Change 923556 had a related patch set uploaded (by Richika Rana; author: Richika Rana):

[mediawiki/extensions/Flow@master] Refactor utils::convert()

https://gerrit.wikimedia.org/r/923556

Change 923648 had a related patch set uploaded (by Richika Rana; author: Richika Rana):

[mediawiki/extensions/Flow@master] Refactor utils::wikitextToHTML

https://gerrit.wikimedia.org/r/923648

Change 923651 had a related patch set uploaded (by Richika Rana; author: Richika Rana):

[mediawiki/extensions/Flow@master] Refactor utils::htmlToWikitext

https://gerrit.wikimedia.org/r/923651

Change 924478 had a related patch set uploaded (by Richika Rana; author: Richika Rana):

[mediawiki/extensions/Flow@master] Removing the usage of VirtualRestService.

https://gerrit.wikimedia.org/r/924478

Change 923556 merged by jenkins-bot:

[mediawiki/extensions/Flow@master] Refactor utils::convert()

https://gerrit.wikimedia.org/r/923556

Quick note that this topic came up on Slack last week.

A couple of potentially relevant notes:

  • IIRC, the content is stored in HTML and only converted back to wikitext as needed (via parsoid) when editing.
  • Every individual comment is a single entry in ExternalStore (I vaguely remember it being considered to make store just 1 document for an entire thread, but fairly certain that never ended up happening)
  • There are a couple of types of content formats; in addition to ‘html’ and ‘wikitext’, there’s also ‘topic-title-html’ and ‘topic-title-wikitext’ (these are captures in rev_flags); fewer things were supported in titles (only stuff like bold and links, I think). Looks like these use CommentFormatter rather than full parser (see Flow\Conversion\Utils)
  • Looks like the <body> node is stored & parsoid attributes are present. E.g. execute below via eval.php, where the url is one of SELECT rev_content, rev_flags FROM flow_revision; (to be executed against flowdb on x1)
$content = ExternalStore::fetchFromURL('DB://cluster27/397816490')
$content = \MediaWiki\MediaWikiServices::getInstance()->getBlobStoreFactory()->newSqlBlobStore()->decompressData( $content, ['utf-8', 'gzip'] );
var_dump($content);

e.g.

string(1309) "<body id="mwAA" lang="en" dir="ltr" data-parsoid="{&quot;dsr&quot;:[0,1027,0,0]}" parsoid-version="2.0.0" base-url="//www.wikidata.org/wiki/"><section data-mw-section-id="0" id="mwAQ" data-parsoid="{}"><p id="mwAg" data-parsoid="{&quot;dsr&quot;:[0,1027,0,0]}">Já bych spíše předpokládal, že častěji je položka obecního úřadu zakládána ke kategorii fotografií obecního úřadu na Commons. Myslel bych si, že fotografií obecních úřadů máme podstatně víc než např. položek pro periodika vydávaná obcemi. A ano, vydavatelem těch periodik skutečně bude většinou obec, ať už je vydává prostřednictvím obecního úřadu, nebo prostřednictvím např. redakce podléhající přímo samosprávě, zatímco fotky obecního úřadu jakožto budovy jsou prostě fotkami obecního úřadu - a není výstižnější je označovat jako fotky obce. V naprosté většině menších obcí je obecní úřad reprezentován právě svou jednou sídelní budovou - jen v pár větších městech úřad zabírá více budov, a v takových případech bývá jen jedna z budov sídelní. Ve většině případů obecních úřadů je praktičtější mít jedinou položku zároveň pro budovu i instituci, dělení do dvou položek by data zbytečně tříštilo.</p></section></body>"
  • Templates & magic words are supported and parsoid receives the title of the discussion page (i.e. Talk:My-page, not Topic:Xyzaskdjnasric) for context
  • I can’t remember Flow actually expecting anything specific in the html or wikitext content. As long as it was able to roundtrip between all formats and showed content that made sense, Flow didn’t bother with it or rely on anything; it only really mattered to Parsoid, and it’s ability to make sense of it again for conversion in the other direction. I guess I’d probably create a couple of posts on a live copy of flow+parsoid and hit it with every parser-related thing I can think of, then roundtrip those with new parser. If that works, then I see no reason for Flow to trip over anything.
  • The only PITA I remember is context for parser functions and templates etc, but relevant page title is included in all calls now IIRC.
  • Looking at Utils.php, it looks like there is some mild manipulation of content going on, most importantly patching up relative urls (in encodeHeadInfo, probably createRelativeTitle too) - since content can be visible both in the discussion namespace, and the Topic: namespace, relative paths may not otherwise accurately resolve.
DAlangi_WMF changed the task status from Open to In Progress.Jun 5 2023, 8:14 AM
DAlangi_WMF triaged this task as Medium priority.

Change 923648 merged by jenkins-bot:

[mediawiki/extensions/Flow@master] Use ParsoidParser to convert wikitext to HTML.

https://gerrit.wikimedia.org/r/923648

Change 927619 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/extensions/Flow@master] Fix Flow failing tests due to Parsoid being enabled (by default)

https://gerrit.wikimedia.org/r/927619

Change 928474 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/extensions/Flow@master] Skip affected tests unconditionally until gerrit->927619 lands

https://gerrit.wikimedia.org/r/928474

Change 928474 merged by jenkins-bot:

[mediawiki/extensions/Flow@master] Skip affected tests unconditionally until gerrit->927619 lands

https://gerrit.wikimedia.org/r/928474

Change 923651 merged by jenkins-bot:

[mediawiki/extensions/Flow@master] Use ParsoidParser to convert HTML to Wikitext.

https://gerrit.wikimedia.org/r/923651

Change 924478 merged by jenkins-bot:

[mediawiki/extensions/Flow@master] Remove usage of VirtualRestService from Flow.

https://gerrit.wikimedia.org/r/924478

DAlangi_WMF assigned this task to R_Rana.
DAlangi_WMF updated the task description. (Show Details)