Page MenuHomePhabricator

Application Security Review Request : theresnotime/ipa-validator PHP library
Closed, ResolvedPublic

Description

Project Information

Description of the tool/project:
Composer package for validating and normalizing IPA. Authored by @TheresNoTime (WMF staff).

Description of how the tool will be used at WMF:
To normalize IPA input in the MediaWiki-extensions-Phonos extension. See r829068; we're only using the normalization bits, not validation. We have spoken with @sbassett who said this probably would not require a full vendor security review, and @Reedy concurred it is a low-risk package. This task was filed by request so as to leave a paper trail.

Dependencies
None.

Has this project been reviewed before?
No.

Working test environment

Please link or describe setup process for setting up a test environment.

Post-deployment

Name of team responsible for tool/project after deployment and primary contact.

Details

Other Assignee
TheresNoTime
Risk Rating
Low
Related Changes in Gerrit:

Event Timeline

Big +1 to all Librarization efforts, but why is this repository being maintained in a personal GitHub instead of in Gerrit under mediawiki/libs/?

Big +1 to all Librarization efforts, but why is this repository being maintained in a personal GitHub instead of in Gerrit under mediawiki/libs/?

Not sure, but it should be fairly easy to move/fork to gerrit? We could definitely add that as a recommendation/best practice within the forthcoming security review.

Change 831998 had a related patch set uploaded (by MusikAnimal; author: MusikAnimal):

[mediawiki/vendor@master] Add theresnotime/ipa-validator

https://gerrit.wikimedia.org/r/831998

this report will be delivered by November 4th at the latest. Please let us know if that still works with your timeline

JMcLeod_WMF subscribed.

@Mstyles, November 4th does work with our timeline. Are we on track to meet this date?

Security Review Summary - T316913 - 2022-11-04

Overall, the current vendor code under consideration...
has an overall risk rating of: low.
Typically projects with such a low contributor count are considered high risk, but since the contributors are WMF staff, the risk is slightly
mitigated. However those staff members could still leave the foundation, so the team should really consider keeping an eye on this project as it could
quickly become unmaintained.

ipa-validator PHP library

General Security Information

Statistic/InfoValueRisk
Repositoryhttps://github.com/theresnotime/php-ipa-validator none
Relevant tag/branchmaster none
Last commit reviewed (if relevant)e8c34df none
Recent contributions to code (6 months)20 medium
Active developers with > 10 commits1 medium
Current overall usage2 stars, 3 forks medium
Current open security issues0 none

Vulnerable Packages
no vulnerable packages found with php security checker none
no vulnerable packages found with snyk test none

several php deprecation notices, as listed below

PHP Deprecated:  Return type of Symfony\Component\Console\Helper\HelperSet::getIterator() should either be compatible with IteratorAggregate::getIterator(): Traversable, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in phar:///usr/local/Cellar/composer/2.0.8/bin/composer/vendor/symfony/console/Helper/HelperSet.php on line 112
Deprecated: Return type of Symfony\Component\Console\Helper\HelperSet::getIterator() should either be compatible with IteratorAggregate::getIterator(): Traversable, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in phar:///usr/local/Cellar/composer/2.0.8/bin/composer/vendor/symfony/console/Helper/HelperSet.php on line 112
Deprecation Notice: strlen(): Passing null to parameter #1 ($string) of type string is deprecated in phar:///usr/local/Cellar/composer/2.0.8/bin/composer/vendor/justinrainbow/json-schema/src/JsonSchema/Constraints/Constraint.php:48
Deprecation Notice: Return type of Composer\Repository\ArrayRepository::count() should either be compatible with Countable::count(): int, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in phar:///usr/local/Cellar/composer/2.0.8/bin/composer/src/Composer/Repository/ArrayRepository.php:277
Deprecation Notice: Return type of Composer\Repository\ArrayRepository::count() should either be compatible with Countable::count(): int, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in phar:///usr/local/Cellar/composer/2.0.8/bin/composer/src/Composer/Repository/ArrayRepository.php:277
Deprecation Notice: Return type of Composer\Repository\ArrayRepository::count() should either be compatible with Countable::count(): int, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in phar:///usr/local/Cellar/composer/2.0.8/bin/composer/src/Composer/Repository/ArrayRepository.php:277
Deprecation Notice: Return type of Composer\Repository\CompositeRepository::count() should either be compatible with Countable::count(): int, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in phar:///usr/local/Cellar/composer/2.0.8/bin/composer/src/Composer/Repository/CompositeRepository.php:180

Outdated Packages
As reported via composer outdated:
(no explicit vulnerabilities reported, simply noting for completeness' sake.)

PackageCurrentWantedDescription
composer/pcre1.0.13.0.2PCRE wrapping library that offers type-safe preg_*
composer/xdebug-handler2.0.53.0.3Restarts a process without Xdebug.
mediawiki/mediawiki-codesniffer39.0.040.0.1MediaWiki CodeSniffer Standards
mediawiki/mediawiki-phan-config0.11.10.12.0Standard MediaWiki phan configuration
mediawiki/phan-taint-check-plugin3.3.24.0.0A Phan plugin to do security checking
microsoft/tolerant-php-parser0.1.10.1.2Tolerant PHP-to-AST parser designed for IDE usage
phan/phan5.2.05.4.1A static analyzer for PHP
phpdocumentor/type-resolver1.6.11.6.2A PSR-5 based resolver
phpunit/php-code-coverage9.2.179.2.18Library that provides collection, processing and more
phpunit/phpunit9.5.249.5.26The PHP Unit Testing framework.
psr/log2.0.03.0.0Common interface for logging libraries
sabre/event5.1.46.0.0sabre/event is a library for lightweight event-based pr...
sebastian/comparator4.0.64.0.8Provides the functionality to compare PHP values for eq...
sebastian/exporter4.0.44.0.5Provides the functionality to export PHP variables for ...
sebastian/type3.1.03.2.0Collection of value objects that represent the types of...
squizlabs/php_codesniffer3.6.23.7.1PHP_CodeSniffer tokenizes PHP, JavaScript and CSS files...
symfony/console5.4.126.1.7Eases the creation of beautiful and testable command li...
symfony/deprecation-contracts3.0.23.1.1A generic function and convention to trigger deprecatio...
symfony/service-contracts3.0.23.1.1Generic abstractions related to writing services
symfony/string6.0.126.1.7Provides an object-oriented API to strings and deals wi...

Static Analysis Findings

  1. gitleaks returned no results. low risk
  2. git secrets run with a custom pattern set, returned no actionable results. low risk
  3. whispers returned no issues higher than medium. low risk
  4. scan scan:latest returned no results low risk
  5. semgrep 0.120.0 was run w 869 rules on 13 files and returned 2 low risk findings noted below low risk

    Details: https://sg.run/0Qzw
src/Validator.php
   php.lang.security.preg-replace-eval.preg-replace-eval
      Calling preg_replace with user input in the pattern can lead to arbitrary code execution. The
      eval modifier (`/e`) evaluates the replacement argument as code.

       80┆ $this->normalizedIPA = preg_replace( $this->diacriticsRegex, '', $this->normalizedIPA );
        ⋮┆----------------------------------------
      141┆ $this->normalizedIPA = preg_replace( $this->stripRegex, '', $this->normalizedIPA );

General Security Issues

  1. No critical or high issues were found with this library
  2. The highest risk is the project becoming unmaintained due to very few contributors
sbassett changed the task status from Open to In Progress.Nov 7 2022, 8:55 PM
sbassett triaged this task as Low priority.
sbassett moved this task from In Progress to Waiting on the secscrum board.

Thanks for the review, @Mstyles! I think you're saying it passed? (As I'm writing this I see sbasset changed the status to "In progress", so I guess the review isn't complete yet)

The deprecations appear to be dev dependencies and won't appear in production, but we can get those fixed anyway. In production, only the Validator.php class is used and it has no dependencies.

The patch to add this lib to mediawiki/vendor has now been updated https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/831998. I assume your team will do the +2'ing whenever the time comes?

Hey @MusikAnimal -

So I set the task as "in progress" as the review is now in your team's court, so to speak. Yes, the review came in as low risk, so technically no further action is required as that level of risk is automatically accepted by the WMF. Though we do want to allow code owners and maintainers an opportunity to address any issues surfaced during the review, either as related sub-tasks/change sets or as WONTFIX items.

Okay thanks for clarifying :)

The only issue I see worth investigating further is the callout about "preg_replace with user input in the pattern can lead to arbitrary code execution". That obviously sounds bad, but we don't appear to be using the /e eval modifier, nor is there user input on the pattern. So I'm guessing that note was just an FYI and not actually an actual concern? If that's the case, I think we're all set on our end.

I guess before formally asking for a +2 on https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/831998, let me ping @TheresNoTime to see if they want to move the code under the Wikimedia umbrella as suggested at T316913#8207507? In the process we might as well update the dev-only packages that are apparently causing the deprecation warnings.

Adding to the sprint since this may need attention from us.

The only issue I see worth investigating further is the callout about "preg_replace with user input in the pattern can lead to arbitrary code execution". That obviously sounds bad, but we don't appear to be using the /e eval modifier, nor is there user input on the pattern. So I'm guessing that note was just an FYI and not actually an actual concern? If that's the case, I think we're all set on our end.

Yep, that's a false positive for sure. Semgrep likely just sees two dynamic variables and assumes they could be tainted with user input. I'm hopeful that as semgrep's newer taint mode becomes more widely employed, it will lead to fewer issues like this, as $stripRegex = "/[\/\[\]]/ui"; and $diacriticsRegex = "/[\x{0300}-\x{036f}]/ui"; are obviously both fine.

I guess before formally asking for a +2 on https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/831998, let me ping @TheresNoTime to see if they want to move the code under the Wikimedia umbrella as suggested at T316913#8207507? In the process we might as well update the dev-only packages that are apparently causing the deprecation warnings.

Sounds good. Again, just to be clear, the Security-Team rates things in terms of risk as opposed to providing a +2 or not. I'd imagine @Mstyles or @Reedy or myself would be fine +2'ing that change set if you'd like, but risk is the Security-Team's primary unit of evaluation.

Okay thanks for clarifying :)

The only issue I see worth investigating further is the callout about "preg_replace with user input in the pattern can lead to arbitrary code execution". That obviously sounds bad, but we don't appear to be using the /e eval modifier, nor is there user input on the pattern. So I'm guessing that note was just an FYI and not actually an actual concern? If that's the case, I think we're all set on our end.

I guess before formally asking for a +2 on https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/831998, let me ping @TheresNoTime to see if they want to move the code under the Wikimedia umbrella as suggested at T316913#8207507? In the process we might as well update the dev-only packages that are apparently causing the deprecation warnings.

Adding to the sprint since this may need attention from us.

Happy to move it under the wikimedia org :)

It should be hosted on Gerrit under mediawiki/libs/ unless there's some compelling reason to not do so.

sgtm! I'll leave that to whomever to sort (as I have no idea how to even go about that... would I first need to request a gerrit repo..?)


Now at https://gerrit.wikimedia.org/g/mediawiki/libs/IPAValidator — waiting on T322736: Create project tag for IPA-Validator before logging tasks to "un-GitHub-ify" the repo (remove GitHub actions, update README etc.)

Hey @MusikAnimal -

I see @Reedy left a couple of comments on the gerrit change set. Would you or @TheresNoTime be able to provide a comprehensive diff of any changes to the package since the security review @maryum posted above on 11/4/2022? And any of the administrative updates @TheresNoTime was working on above? Thanks.

Using Maryum's commit listed of e8c34df vs HEAD of writing, d6881a4 - https://github.com/wikimedia/mediawiki-libs-IPAValidator/compare/e8c34df...d6881a4 the only "code" change I see is the namespacing as part of the move to gerrit etc.

The rest is mostly packaging and the like.

Using Maryum's commit listed of e8c34df vs HEAD of writing, d6881a4 - https://github.com/wikimedia/mediawiki-libs-IPAValidator/compare/e8c34df...d6881a4 the only "code" change I see is the namespacing as part of the move to gerrit etc.

The rest is mostly packaging and the like.

Fair enough. Then there isn't really anything blocking the merge of c831998 from a security standpoint. So once any administrative, packaging, cleanup issues are resolved, it should be fine to merge through standard code review, IMO.

As a frequent editor of IPA-related articles on Wikipedia, I'd like to note that the whole idea of converting an allophonic non-TTS-compatible transcription to a phonemic TTS-compatible one makes no sense. Depending on the conventions used, the intent of the transcriber, and the context, e.g. the allophonic transcription [pet] could be a transcription of any of pet /pɛt/, bed /bɛd/, pate /peɪt/, bade /beɪd/, and more (here using the conventions on English Wikipedia). This is even if you know the accent being transcribed. So if you include various accents, all bets are off.

The point is, you cannot convert a transcription to a TTS-compatible one without knowing the conventions being used, and what conventions are used should be within the purview of the editors of each project, so it should not be hard-coded in software. It should be done in templates and modules.

The idea of "validating" an IPA transcription should also be taken with extreme caution. What does one mean by "an IPA transcription"? Linguists use the IPA for all sorts of purposes. In archiphonemic representation, for instance, IPA letters are mixed with capital letters. There are the extensions to the IPA, some of which are used for non-disordered speech (e.g. Korean) as well. "ʱ" is extremely common even though it's never appeared on the official IPA chart (Unicode recently adopted superscript versions of all IPA letters—they should probably be included if you want to be comprehensive). This question needs to be answered first: when you say "IPA validator", what are you validating the input to be?

As a frequent editor of IPA-related articles on Wikipedia, I'd like to note that the whole idea of converting an allophonic transcription to a phonemic one makes no sense. Depending on the conventions used, the intent of the transcriber, and the context, the allophonic transcription [pet] could be a transcription of any of pet /pɛt/, bed /bɛd/, pate /peɪt/, bade /beɪd/, and more. This is even if you know the accent being transcribed. So if you include various accents, all bets are off.

The point is, you cannot convert a transcription to a TTS-compatible one without knowing the conventions being used, and what conventions are used should be within the purview of the editors of each project, so it should not be hard-coded in software. It should be done in templates and modules.

The idea of "validating" an IPA transcription should also be taken with extreme caution. What does one mean by "an IPA transcription"? Linguists use the IPA for all sorts of purposes. In archiphonemic representation, for instance, IPA letters are mixed with capital letters. There are the extensions to the IPA, some of which are used for non-disordered speech (e.g. Korean) as well. "ʱ" is extremely common even though it's never appeared on the official IPA chart (Unicode recently adopted superscript versions of all IPA letters—they should probably be included if you want to be comprehensive). This question needs to be answered first: when you say "IPA validator", what are you validating the input to be?

Thank you for raising these concerns (and for reaching out personally prior to doing so!) - I've cut this to T323912: IPA validation and conversion is flawed/unnecessary so it can get the specific attention it requires.

@TheresNoTime @MusikAnimal - Just reaching out to see if there's any update on whether or not this library is still planned for wikimedia production use. This is more just for the Security-Team's curiosity at this point as the security review has already come back as low risk.

@TheresNoTime @MusikAnimal - Just reaching out to see if there's any update on whether or not this library is still planned for wikimedia production use. This is more just for the Security-Team's curiosity at this point as the security review has already come back as low risk.

We'll not be using it anymore. I'm sorry we didn't realize this before the security review happened

sbassett lowered the priority of this task from Low to Lowest.
sbassett moved this task from Waiting to Our Part Is Done on the secscrum board.

Change 831998 abandoned by MusikAnimal:

[mediawiki/vendor@master] Add theresnotime/ipa-validator 1.1.1

Reason:

https://gerrit.wikimedia.org/r/831998