Page MenuHomePhabricator

Security review for TextCat library
Closed, ResolvedPublic

Description

Requesting security review for PHP port of TextCat language detection library. The library is located in wikimedia/textcat repository, https://github.com/wikimedia/wikimedia-textcat

Projected usage is detecting language of queries and potentially other texts. The main class to be uses is TextCat.php, other PHP files are command-line utilities.

Event Timeline

Smalyshev raised the priority of this task from to Needs Triage.
Smalyshev updated the task description. (Show Details)
Smalyshev added a subscriber: Smalyshev.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 13 2016, 8:18 PM
Smalyshev updated the task description. (Show Details)Jan 13 2016, 8:18 PM
Smalyshev set Security to None.

300 lines of php, should be a quick review.

Smalyshev moved this task from Needs triage to Search on the Discovery board.Jan 15 2016, 10:10 PM

This has been reviewed. It looks good, generally, and seems to be quite an elegant solution. Only one minor issue was found:

General Observations

  • Positive
    • Limited attack surface
    • Code is well-documented and highly readable
  • Negative
    • None
  • Neutral
    • None

Issues

LOW

LM File Loading May Provide Vector for Malicious Code Execution
 - LM files are plain PHP. If the package is deployed directly from Github, ensure that any stock LM files are reviewed prior to use to ensure that malicious code has not been inserted.

Configuration Recommendations

None

Files

./catus.php

OK

./felis.php

OK

./lm2php.php

OK

./TextCat.php

LOW
- LM files loaded via `include()` may allow code execution
dpatrick triaged this task as Medium priority.
dpatrick added a subscriber: dpatrick.
Smalyshev added a comment.EditedJan 22 2016, 8:16 PM

Yes, I know the direct file loading is tricky, the reason it is so if because loading PHP file is the fastest way to load data into PHP, there's a lot of data to be loaded and using caching infrastructure of HHVM would be very beneficial. We do not yet have finalized the LM models, that's T123537. Once it is done, we'll ping you again.

@dpatrick language models now all updated.

dpatrick moved this task from Incoming to Waiting on the Security-Team board.
dpatrick moved this task from Waiting to Done on the Security-Team board.
Smalyshev closed this task as Resolved.Feb 11 2016, 7:34 PM