Page MenuHomePhabricator

The MW Rest API does not normalize its string request parameters
Closed, ResolvedPublic

Description

When using the rest.php endpoint it appears that string parameters are not normalized the same way than the action API.

  • no NFC normalization
  • no � replacement on control chars such as the NUL character

Basically everything that \UtfNormal\Validator::cleanUp would usually do is not done when going through rest.php.

I believe that most of the internal MW apis do expect parameters to be normalized the same way as the action API.

Concretely the Search rest endpoint allows the NUL char to be passed through the internal SearchEngine class and then fails when hitting the search backend.

Details

Related Changes in Gerrit:

Event Timeline

Pinging Wikibase REST API (WPP) as it might possibly introduce discrepancies between edits made via action=wb* APIs and the new REST api.

API Platform dug into this issue in Code Mob today, using the specific error from T339293: Wikimedia\Assert\PostconditionException: Postcondition failed: Regex failed: 4 as a concrete example of this issue. tl;dr: @dcausse is correct.

The call chain that Action API uses to validate this parameter is:

api.php =>
ApiMain::execute =>
ApiMain::executeActionWithErrorHandling =>
ApiMain::executeAction =>
ApiOpenSearch::execute =>
ApiBase::ExtractRequestParams =>
ApiBase::getParameterFromSettings =>
ApiParamValidator::getValue =>
ParamValidator::getValue =>
TypeDef::getValue =>
ApiParamValidatorCallback::getValue =>
WebRequest::normalizeUnicode =>
Language::normalize =>
UtfNormal\Validator::cleanUp

The warning David mentioned in T339293 triggers in ApiParamValidatorCallback::getValue, if it detects that the raw and normalized strings differ.

I'm no Action API expert, but from what I can tell, the "search" parameter by which the unhappy value is passed to the endpoint is registered in SearchApi::BuildCommonApiParams, with the "sr" prefix being added in ApiQuerSearch.php. The notable thing here is this block in SearchApi.php:

			'search' => [
				ParamValidator::PARAM_TYPE => 'string',
				ParamValidator::PARAM_REQUIRED => true,
			],

There's nothing in the special about this parameter being associated with searches. It is just defined as a string. In other words, this is the same validation that is being done by the Action API on every string parameter, not just search strings. This is confirmed by hitting a completely unrelated Action API endpoint, while passing the same problematic value as a string parameter there:

https://www.mediawiki.org/w/api.php?action=searchtranslations&language=fr&query=%88

You'll see the same warning, indicating that the same validation is being done. This means that for rest.php, we could have run into similar errors with any of the endpoints that accept a string. This is just the one we happened to notice it on first.

This all confirms that the fix (per the task description written by @dcausse, thank you) is to add similar validation to the rest.php parameter validation system.

FWIW the REST API uses MediaWiki\Rest\Validator\ParamValidatorCallbacks so the question is how similar should that be to MediaWiki\Api\Validator\ApiParamValidatorCallbacks (especially in how it implements getValue() and recordCondition())?

The REST API also has a different mechanism for validating data passed in the POST body, see MediaWiki\Rest\Validator\BodyValidator and MediaWiki\Rest\Validator\JsonBodyValidator. Probably that should have similar behavior.

@BPirkle thanks for the investigations! Your conclusions do seem correct to me, I'm not being very knowledgeable in the MW API subsystems but I'm with you and Gergo here and I believe that since the REST api is "relatively new" it might be safer to apply the same normalization steps for all string params that the action API does. In fact in this case I believe that we somehow got "lucky" that CirrusSearch was very picky (not very robust to be more correct :) by failing on such utf8 sequences but I imagine that in other scenarios/endpoint the internal MW API could simply silently accept such strings and possibly update the database with them and pollute our datastores with these "bad" utf chars and sequences.

FWIW the REST API uses MediaWiki\Rest\Validator\ParamValidatorCallbacks

Yeah, in fact there's this fun little comment in ParamValidatorCallbacks.php, in the getValue function:

		// @todo Should normalization to NFC UTF-8 be done here (much like in the
		// action API and the rest of MW), or should it be left to handlers to
		// do whatever normalization they need?

Just a little ping to note that I'm on train duty this week and the Postcondition failed: Regex failed: 4 error is showing up as the first or second most common error in logspam-watch at the moment.

FWIW the REST API uses MediaWiki\Rest\Validator\ParamValidatorCallbacks

Yeah, in fact there's this fun little comment in ParamValidatorCallbacks.php, in the getValue function:

		// @todo Should normalization to NFC UTF-8 be done here (much like in the
		// action API and the rest of MW), or should it be left to handlers to
		// do whatever normalization they need?

Sounds like we are getting to the point where we should just go ahead and do that.

FJoseph-WMF changed the task status from Open to In Progress.Apr 25 2024, 3:32 PM
daniel lowered the priority of this task from High to Medium.

The call chain that Action API uses to validate this parameter is:
[...]
ApiParamValidator::getValue =>
ParamValidator::getValue =>
TypeDef::getValue =>
ApiParamValidatorCallback::getValue =>
WebRequest::normalizeUnicode =>
Language::normalize =>
UtfNormal\Validator::cleanUp

This is 9only the handling for the if ( substr( $rawValue, 0, 1 ) === "\x1f" ) special case. The "normal" validation is hidden in the call to $value = $this->apiMain->getVal( $name, $default ); at the top of ApiParamValidatorCallback::getValue:

  • ApiParamValidatorCallback::getValue ->
  • ApiMain::getVal ->
  • WebRequest::getVal ->
  • WebRequest::getGPCVal ->
  • WebRequest::normalizeUnicode ->
  • Language::normalize ->
  • UtfNormal\Validator::cleanUp

In MediaWiki\Rest\Validator\ParamValidatorCallbacks::getValue we can probably just call UtfNormal\Validator::cleanUp for now. But:

  • we should have a TODO there for applying language specific cleanup based on the content language
  • We should have a TODO saying that we should somehow warn if normalization was applied (do we want to introduce warning headers?)
  • Like ApiParamValidatorCallback::getValue, we need to skip cleanup if $options['raw'] is set
  • We should skip this step if $options['source'] is 'body', and implement normalization for the request body in Handler::parseBodyData instead.

For Handler::parseBodyData, there is two cases:

  • reyling on $request->getPostParams(). In that case, we should process each field in the same way we do in MediaWiki\Rest\Validator\ParamValidatorCallbacks::getValue
  • using json_decode - in that case, we should apply unicode normalization before decoding.
  • again, we need a TODO about warning when nromalization weas applied

Change #1051158 had a related patch set uploaded (by WQuarshie; author: WQuarshie):

[mediawiki/core@master] Normalise string params in MW Rest API

https://gerrit.wikimedia.org/r/1051158

Change #1051158 had a related patch set uploaded (by Daniel Kinzler; author: WQuarshie):

[mediawiki/core@master] Normalise string params in MW Rest API

https://gerrit.wikimedia.org/r/1051158

Change #1051158 merged by jenkins-bot:

[mediawiki/core@master] Normalise string params in MW Rest API

https://gerrit.wikimedia.org/r/1051158