Page MenuHomePhabricator

Security review for TextCat library
Closed, ResolvedPublic

Description

Requesting security review for PHP port of TextCat language detection library. The library is located in wikimedia/textcat repository, https://github.com/wikimedia/wikimedia-textcat

Projected usage is detecting language of queries and potentially other texts. The main class to be uses is TextCat.php, other PHP files are command-line utilities.

Event Timeline

Smalyshev raised the priority of this task from to Needs Triage.
Smalyshev updated the task description. (Show Details)
Smalyshev added a subscriber: Smalyshev.
Smalyshev set Security to None.

300 lines of php, should be a quick review.

This has been reviewed. It looks good, generally, and seems to be quite an elegant solution. Only one minor issue was found:

General Observations

  • Positive
    • Limited attack surface
    • Code is well-documented and highly readable
  • Negative
    • None
  • Neutral
    • None

Issues

LOW

LM File Loading May Provide Vector for Malicious Code Execution
 - LM files are plain PHP. If the package is deployed directly from Github, ensure that any stock LM files are reviewed prior to use to ensure that malicious code has not been inserted.

Configuration Recommendations

None

Files

./catus.php

OK

./felis.php

OK

./lm2php.php

OK

./TextCat.php

LOW
- LM files loaded via `include()` may allow code execution
dpatrick triaged this task as Medium priority.
dpatrick added a subscriber: dpatrick.

Yes, I know the direct file loading is tricky, the reason it is so if because loading PHP file is the fastest way to load data into PHP, there's a lot of data to be loaded and using caching infrastructure of HHVM would be very beneficial. We do not yet have finalized the LM models, that's T123537. Once it is done, we'll ping you again.