Page MenuHomePhabricator

[ACTION-API] [TECH] Wikibase doesn’t validate formatter options, can crash with different TypeErrors
Open, Needs TriagePublic

Description

The wbformatvalue API has an options parameter that allows users to specify the FormatterOptions that will be used to format the value, as JSON. Several of Wikibase’s value formatters don’t really validate these options, so if an API user puts garbage in them, they can provoke various TypeErrors,

The purpose of this task is to find affected formatters and options, so that we can fix them later.

Possible investigation approaches: go through all ValueFormatters subclasses, or go through all FormatterOptions::getOption() callers. Not sure which is nicer / easier to do.


For example, LatLongFormatter has options to customize the strings used for N/E/S/W, °/'/", set the format to float/dms/dm/dd and determine the floating point precision, where to put spacing, whether to use -1° or 1° S, and what the separator between latitude and longitude is. (Don’t ask me why this class is so amazingly configurable.) By setting the “north” option to an int instead, we get a type error.

api.php?action=wbformatvalue&format=json&generate=text%2Fplain&datavalue=%7B%0A%20%20%22value%22%3A%20%7B%0A%20%20%20%20%22latitude%22%3A%2027.988055555556%2C%0A%20%20%20%20%22longitude%22%3A%2086.925277777778%2C%0A%20%20%20%20%22altitude%22%3A%20null%2C%0A%20%20%20%20%22precision%22%3A%200.00027777777777778%2C%0A%20%20%20%20%22globe%22%3A%20%22http%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ2%22%0A%20%20%7D%2C%0A%20%20%22type%22%3A%20%22globecoordinate%22%0A%7D%0A&options=%7B%22north%22%3A%201%7D&formatversion=2

datavalue={
  "value": {
    "latitude": 27.988055555556,
    "longitude": 86.925277777778,
    "altitude": null,
    "precision": 0.00027777777777778,
    "globe": "http://www.wikidata.org/entity/Q2"
  },
  "type": "globecoordinate"
}

options={"north": 1}

Exception caught: DataValues\\Geo\\Formatters\\LatLongFormatter::makeDirectionalIfNeeded(): Argument #2 ($positiveSymbol) must be of type string, int given

Event Timeline

Task Review Notes:

  • We will investigate the impact breadth of this issue, to try a gauge our LOE and scope these bugfixes might entail.
  • This ticket will be rewritten to reflect the investigation into the topic.

Prio Notes:

  • Affect end users / production
  • Does not affect monitoring
  • Does not affect development efforts
  • Does not affect onboarding efforts
  • Does not affect additional stakeholders
ItamarWMDE renamed this task from Wikibase doesn’t validate formatter options, can crash with different TypeErrors to [TECH][ACTION-API] Wikibase doesn’t validate formatter options, can crash with different TypeErrors.Aug 2 2023, 2:05 PM
ItamarWMDE moved this task from Incoming to [DOT] Prioritized on the wmde-wikidata-tech board.
ItamarWMDE added a project: Wikidata Dev Team.
Lucas_Werkmeister_WMDE renamed this task from [TECH][ACTION-API] Wikibase doesn’t validate formatter options, can crash with different TypeErrors to Wikibase doesn’t validate formatter options, can crash with different TypeErrors.Aug 2 2023, 2:13 PM
Lucas_Werkmeister_WMDE updated the task description. (Show Details)
Lucas_Werkmeister_WMDE renamed this task from Wikibase doesn’t validate formatter options, can crash with different TypeErrors to [TECH][ACTION-API] Wikibase doesn’t validate formatter options, can crash with different TypeErrors.Aug 2 2023, 2:15 PM
ItamarWMDE renamed this task from [TECH][ACTION-API] Wikibase doesn’t validate formatter options, can crash with different TypeErrors to [ACTION-API] [TECH] Wikibase doesn’t validate formatter options, can crash with different TypeErrors.Sep 20 2023, 2:52 PM

Had a quick look at this today. It's pretty easy to reproduce the bug with just the library code on its own - see this fork/branch on Github.

My suggested approach would be first to make the libraries themselves robust and prevent them throwing type errors. To do this, I would deprecate the generic setOption and getOption functions, and replace them with {get,set}{String,Int,Bool}Option functions that are robust to bad input. Then we can catch InvalidFormatterOptions in our API code and do something more graceful there. Would be happy to take a look at this.

Not sure if this code search is exhaustive, but it doesn't look like too many references outside of the library code.

My suggested approach would be first to make the libraries themselves robust and prevent them throwing type errors. To do this, I would deprecate the generic setOption and getOption functions, and replace them with {get,set}{String,Int,Bool}Option functions that are robust to bad input. Then we can catch InvalidFormatterOptions in our API code and do something more graceful there. Would be happy to take a look at this.

I’m not sure what the benefit of typed setter functions is. The options can be set via an API parameter, and the API module doesn’t have any idea what the expected type is; it could use the type of the given value, but then I don’t think “call setStringOption() if it’s a string, setBoolOption() if it’s a bool, etc.” has any benefit over “call setOption() with whatever the value is”. Typed getter functions, on the other hand, could be useful to reduce duplicate code in the formatters. But I’m not yet sure what the getter should do if the option has been set to a different type than expected: return a default value (e.g. null) that the caller can handle (e.g. $options->getString( self::OPT_NORTH_SYMBOL ) ?? 'N'), or just throw an exception? I think we should check if there’s any existing code that already handles invalid options, and whether it does that consistently.

I also notice that the formatters sometimes define default options upfront, like here in LatLongFormatter’s constructor:

$this->defaultOption( self::OPT_NORTH_SYMBOL, 'N' );
$this->defaultOption( self::OPT_EAST_SYMBOL, 'E' );
$this->defaultOption( self::OPT_SOUTH_SYMBOL, 'S' );
$this->defaultOption( self::OPT_WEST_SYMBOL, 'W' );

Arguably we could infer the expected type here – if the default is a string, it’s reasonable to assume that any custom value must also be a string (perhaps with an escape hatch in case other types should be supported after all). I also have some open work (from a previous GB&C event) to replace defaultOption() (mutates instance) with withDefaultOption() (returns new copy), see pull request, that I still want to get back to; perhaps we can combine this somehow.

As I understand your comment, you are not sure what the best approach would be here, but you nevertheless have some opinions about what it might look like and should not look like. If I'm going to pick the ticket up and work on it, it would be helpful for me to understand:

  • What do you insist is part of any proposed solution, and want do you insist must not be part of the solution?
    • Are you against typed setters? Or just don't see the benefit?
    • Are you for typed getters? Or are they just "nice to have"?
    • Do you insist that the solution should throw an exception? Or swallow exceptions? Or are you open to both, or neither?
  • In terms of checking existing code, that seems quite a broad search scope. Do you have any concrete examples in mind of best practices that you would like to follow / worst practices you would like to avoid? What evidence could I supply to convince you that there is no such code?
  • Is it mandatory for you that any proposed solution includes your withDefaultOption change?

Basically I would like to avoid that I spend time working on a solution that are going to reject. If there's not enough information now for you to be clear what you want, please try and scope the investigation so that we can make it into a concrete spike that I can work on.

I guess I should start with what I had in mind in some form – I imagined the implementation would look something like this, repeated a lot of times:

diff --git a/src/Formatters/LatLongFormatter.php b/src/Formatters/LatLongFormatter.php
index a10970aef8..765f672218 100644
--- a/src/Formatters/LatLongFormatter.php
+++ b/src/Formatters/LatLongFormatter.php
@@ -194,9 +194,14 @@ private function getSpacing( string $spacingLevel ): string {
 	}
 
 	private function formatLatitude( float $latitude, float $precision ): string {
+		$northSymbol = $this->options->getOption( self::OPT_NORTH_SYMBOL );
+		if ( !is_string( $northSymbol ) ) {
+			throw new InvalidArgumentException( self::OPT_NORTH_SYMBOL . ' must be a string' );
+		}
+
 		return $this->makeDirectionalIfNeeded(
 			$this->formatCoordinate( $latitude, $precision ),
-			$this->options->getOption( self::OPT_NORTH_SYMBOL ),
+			$northSymbol,
 			$this->options->getOption( self::OPT_SOUTH_SYMBOL )
 		);
 	}

Relative to this, an $options->getStringOption() method could be seen as a way to reduce code duplication – whereas it’s not clear to me how setStringOption() fits in. That doesn’t necessarily mean I’m vetoing it, I just don’t understand your proposal yet.

Are you for typed getters? Or are they just "nice to have"?

Nice to have, I guess, but there’s probably not really a reason not to have them? Unless we change the approach in some way I’m not seeing at the moment.

Do you insist that the solution should throw an exception? Or swallow exceptions? Or are you open to both, or neither?

I guess that’s actually a product question, in the end…? When users make API requests with invalid options, should we return an error or ignore the invalid options? (I think I would prefer throwing an exception which the API turns into an error, but I’d be fine with both.)

In terms of checking existing code, that seems quite a broad search scope. Do you have any concrete examples in mind of best practices that you would like to follow / worst practices you would like to avoid? What evidence could I supply to convince you that there is no such code?

That’s a fair question ^^ I sidestepped it by just taking a look myself now, with grep -rFA5 --color=yes '>getOption(' | less -R in vendor/data-values/. I found three places that seemed to do any kind of checking of options:

  • DmsCoordinateParser, OPT_DEGREE_SYMBOLthrows an exception if the degree symbol isn’t found in the “coordinate segment”, whatever exactly this means. (Note that this is technically a parser, with ParserOptions instead of FormatterOptions. But they’re similar enough in principle that I think it’s still useful here.)
  • LatLongFormatter, OPT_PRECISIONignores the option if it’s not a string, float or int.
  • LatLongFormatter, OPT_FORMATthrows an exception if the format isn’t one of four well-known strings (though it also depends on the precision, I think?).

Tons of other getOption() calls had no obvious error handling or type checking of any kind. So at the moment my impression is there is barely any existing code, and what little code exists isn’t even consistent, so we’re free to come up with whatever we want.

Is it mandatory for you that any proposed solution includes your withDefaultOption change?

Definitely not. If you can find a good use for it, maybe it’s worth including, otherwise it can stay shelved until some future GB&C as far as I’m concerned.

Okay. So since the code currently throws an exception in these cases, and the exception is presented to the users in the document response, it seems that catching that in the library and throwing a more specific exception is not a product-relevant change. Let's tidy up how the library handles invalid input, make the exceptions regular, then file another bug for the wbformatvalue action to process the exceptions in a more friendly way.

Per comment from in tech grooming @Lucas_Werkmeister_WMDE , we probably need to introduce the new exception class in the DataValues top-level library, and then throw it from the specific module (e.g. GeoValues).

Change 994017 had a related patch set uploaded (by Arthur taylor; author: Arthur taylor):

[mediawiki/extensions/Wikibase@master] Catch TypeErrors in value formatters for wbformatvalue

https://gerrit.wikimedia.org/r/994017

I had a chat with the maintainer of the DataValues library in this pull request. They were not interested in refining the way the library processes errors, so for now the proposed patch just catches raw TypeErrors that the ValueFormatters might generate.

The code for DataValues is pretty static. It would not be a problem just to fork the library and make the changes that we want. We would need to decide how much effort we want to put into making wbformatvalue a tidy API, vs. just ring-fencing the issues with the library and maintaining a stable API interface on our side.

According to T260553#9498017, the set of formatting options people actually use, at least via the API, is pretty limited. (According to T323776, the Lua interface doesn’t permit configuring the options; I’m not aware of any other place that makes the options configurable either.) If Jeroen doesn’t want the DataValues libraries to be improved, then I think we should just hard-code the set of options supported by wbformatvalue, with the allowed types / values for each option, and throw away any other options (perhaps with a warning).

(In the longer term, it probably makes sense for us to fork the libraries, but I don’t think it’s worth it at the moment.)

@Lucas_Werkmeister_WMDE - Okay. That does sound like a product decision though. Are you against merging the proposed changes to resolve this error before we start the discussion about changing the API?

@Lucas_Werkmeister_WMDE - Okay. That does sound like a product decision though.

Yeah, that’s fair.

Are you against merging the proposed changes to resolve this error before we start the discussion about changing the API?

Yes, I don’t think fixing this is urgent at all.

Pinging @Arian_Bozorg for product :) my summary:

The wbformatvalue API has an options parameter that can be used to set various options that affect value formatting. Some of these options can theoretically be abused to trigger errors (which are mostly harmless except for logspam). According to tracking we added two years ago, only a few options are actually being used by API users.

We suggest to change the API so that only those options are allowed. (The format of the options parameter would stay the same, so that nobody has to update their code.) Anyone who tried to use the more “exotic” options would get an API error, even if the options were in theory supported by the underlying formatter. This would probably be a breaking change and should be announced as such; on the other hand, I think it also gives us an opportunity to document the supported options at all for the first time (I don’t think they’re currently documented).

Does that sound okay?

Just to reiterate my perspective here - it would be straightforward to fork DataValues or merge the code into our codebase, tidy up the exception handling and resolve this issue from the tech side. Regardless of what happens with the options parameter, it would be good to update DataValues not to throw unsanitised TypeErrors. I agree that this is not urgent, but I see it as a separate question from what we want to do with the wbformatvalue API (which is certainly an interesting, but also a much longer conversation).

Quick response from a dev who worked with this code back in 2014 and is at least partially responsible for the mess. 😇️

The formatter options are (mis)used for two, maybe three different purposes.

  1. Only very few options are meant to accept values the user can freely choose from.
  2. Many options are never presented to the user. They are part of a round-trip where the decision is made in the backend, send as part of the HTML/JS to the client, where they become part of formatter calls in the UI. I think nobody bothered adding user-friendly validation here because it would be unreachable in all real-world scenarios.
  3. Some options are not meant to be accessible from the outside at all. They became an option either because it was unclear that this would make it public, or because we didn't know better, ignored YAGNI, and made it public anyway. Please feel free to identify these, remove them from the options system and turn them into dedicated constructor parameters.