Page MenuHomePhabricator

Use phan setting 'scalar_implicit_cast` = false without using strict_types = 1 in mediawiki-phan-config?
Closed, ResolvedPublic

Description

With T232949 and the release of mediawiki/mediawiki-phan-config to 0.9.0 the phan setting scalar_implicit_cast was set to false.

Explaination:
scalar_implicit_cast: as the name would suggest, it considers scalar types to be equivalent. While MW is pretty lax, this could hide bugs, and there are cases where scalars are not equivalent

This makes phan strict about string <-> int mismatches (or all other scalar type mismatches with bool and float) and enforce to add explict cast operator (int) or intval() to ensure that the correct type is used.

One discussion can be read on https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/560430/

  • There are some situation where this can help to choose the right function (in context of WebRequest to use getInt or getBool or getVal) and avoids injection when the backend does not escape correctly.
  • It can also makes it harder to document function which takes all scalar types or arrays with all scalar types int|string|float|(int|string|float)[]
  • When calculate with timestamp, than wfTimestamp( TS_UNIX ) needs a cast, because it returns a string

How to handle this in the future?

Event Timeline

Thanks @Umherirrender for opening the task. I'd like to add a couple of words, mostly what I wrote on gerrit.

  • First, I'd like to note that the setting is configurable, hence it may be selectively enabled/disabled in each repo. We just turned it off by default, and while doing that, I thought that we would've probably ended up with keeping it enabled somewhere (e.g. core) due to the high amount of issues.
  • This setting brings the benefits of strict_types (i.e. stricter control on scalar types), but without failing if there's a mismatch at runtime. It also helps us see where we have scalar mismatches, and take appropriate action.
  • Sometimes, the issues will be annoying. For instance, the example with wfTimestamp in the task description. Sometimes, they will unveil actual bugs; T232651 is an example; we've also found other examples while upgrading phan, e.g. some code that was localizing a number, then passing it to a method that expected an actual number.
  • As I said, the setting is configurable, but there's a problem with core. If a core method has an incomplete docblock (e.g. a scalar type missing from a @param annotation), every caller will be marked as "wrong" even outside of core. Hence, it's still necessary to fix some core methods: the ones with a mismatch that are called often enough. Fortunately, there aren't many.
  • Lastly, and for completeness, I'd like to note that this setting was disabled together with null_casts_as_any_type. As the name would suggest, this setting is specific for null, and causes it to be treated as any other scalar type. The difference with the other setting can especially be seen with typehints: suppose we have the following code
/**
 * @param int $par
 */
function foo( $par ) {
}

foo( '42' );
foo( null );

Disabling scalar_implicit_cast would flag the first call, while disabling null_casts_as_any_type would flag the second. Now, regardless of phan, someone may be tempted to typehint $par as integer. After doing so, the first call would still succeed (and $par be cast to int implicitly). However, the second would fail hard. Hence I find this setting more important than the other.

I worked on a few patches that involved version 0.9.0 of the Phan config which started reporting implicit casts. I was really open and curious. In the end I must say I was disappointed.

Here is a made-up example:

function example( int $b ) : int {
    return strlen( $b );
}

Questions:

  • Does the ) : int return type make this code better? My opinion: yea, a little bit. It's a guarantee that can not be violated. The type hint is equivalent to having an assertion in a test. Even better, because it is guaranteed to cover all (future) use cases, not only the few that are run as part of the test. Sure, in this trivial example it is very much obvious that this method can not return anything but an int. But not all code is so obvious.
  • Does the int $b type make this code better? My opinion: Yes, a lot. I find it very significant to be able to design code to be as crystal clear as possible. Even if the implementation in my example might be able to consume other things, it's not designed to do this. This function is designed to consume integers. It helps if the function header can say this.
  • And finally, does a strlen( (string)$b ) cast make this code better? My opinion: No. It's not even relevant for the design of this functions signature. It doesn't add any additional guarantee. All we need to know about this code is already there: $b is guaranteed to be an integer, and strlen() is guaranteed to cast it to a string.

My conclusion: Please roll the change back.

Here is a made-up example:

I know that I'm going to comment on a made-up example (which, consequently, isn't realistic), but:

  • And finally, does a strlen( (string)$b ) cast make this code better? My opinion: No. It's not even relevant for the design of this functions signature. It doesn't add any additional guarantee. All we need to know about this code is already there: $b is guaranteed to be an integer, and strlen() is guaranteed to cast it to a string.

While I agree with this, here the correct solution is not to add a (string) cast. If the method only needs to use $b as a string (like the made-up example), then $b should be typehinted/documented as string. If both int and string can be accepted (granted that it makes sense for the method in question and there's no alternative), then I don't think it should have an int typehint. Now, for your example it wouldn't probably make much difference, but if things were the other way around:

function example( string $x ) {
  // ...
  // Here goes something that assumes $x is an integer; e.g:
  return $x > 0;
  // Or
  return Language::MONTH_MESSAGES[$x];
  // Or
  return Language::MONTH_MESSAGES[$x + 1];
  // Or whatnot
}

then bad things could happen on the caller side. For instance:

example( '1' ); // This will work
example( '1 bottle of beer' ); // And what about this? When $x is implicitly cast, it will have the same effect as the previous call, without even emitting a PHP notice. But was it intended?
example( ( new LanguageAr )->formatNum( 9 ) ); // Uh-oh. And this is a real-world example. If, for instance, the method above was named "getMonthName", there may well be callers like this one.

This phan setting would force us to question whether our typehints/docs are correct. And while, sometimes, the right solution is adding explicit casts, sometimes it's not. The three lines above wouldn't raise any notice, and phan is the only tool we have that is capable of reporting them (together with tests, of course, but code coverage isn't really high at the moment).

My conclusion: Please roll the change back.

Note that we could also simply override the config where we don't want to fix these errors (see codesearch). Sure, we could also rollback, and re-enable the setting where it's already passing, but that would require more work.

My opinion: If a code base is intending to use strict_types, then setting Phan's scalar_implicit_cast to false makes sense. If it's not, then setting Phan that way is most likely just noise.

Since the default for MediaWiki development seems to be to not use strict_types, the default Phan configuration should set scalar_implicit_cast to true. As noted, for code that actually wants this behavior it can be individually configured to false.

I'm not convinced by the "sometimes, they will unveil actual bugs". Yes, sometimes it will, but it seems much more often it's just encouraging the pointless explicit casts that are the reason we tend not to want to use strict_types.

[…] $b should be typehinted/documented as string.

Uh, sorry. It seems my example was not good enough. The point I tried to make was: In the example, the function is intentionally designed to only accept integers. The fact it does a string cast is an implementation detail. That should not be exposed.

In this context, how does it help to force me adding a (string) token to the implementation?

return $x > 0;
return Language::MONTH_MESSAGES[$x];
return Language::MONTH_MESSAGES[$x + 1];

I honestly don't see what the problem with this code is. Sure, it might accidentally interpret strings like "9a" as 9. So what? How does this change by adding an (int) cast?

we could also rollback, and re-enable the setting where it's already passing, but that would require more work.

We don't need to do that. Again, what would be the benefit? The contrary. Enforcing this setting just because it passes now does not mean it will always pass with code that is going to be written in the future. We will face all the same problems then.

Let's please not force devs to add pointless type casts that don't do anything. If the developers of a codebase actively ask for it, they can turn these warnings on.

Change 769514 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] phan: Enable strict checks

https://gerrit.wikimedia.org/r/769514

Change 771705 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] phan: Disable scalar_implicit_cast setting

https://gerrit.wikimedia.org/r/771705

Change 771705 merged by jenkins-bot:

[mediawiki/core@master] phan: Disable scalar_implicit_cast setting

https://gerrit.wikimedia.org/r/771705

Umherirrender claimed this task.
Umherirrender edited projects, added MediaWiki-General; removed Patch-For-Review.

Change 769514 merged by jenkins-bot:

[mediawiki/core@master] phan: Disable null_casts_as_any_type setting

https://gerrit.wikimedia.org/r/769514