Page MenuHomePhabricator

Application Security Review Request : language-data library
Closed, ResolvedPublic

Description

Project Information

$ scc /home/abijeet/Projects/Wikimedia/language-data/ --exclude-dir=docs,data
───────────────────────────────────────────────────────────────────────────────
Language            Files       Lines    Blanks  Comments       Code Complexity
───────────────────────────────────────────────────────────────────────────────
JSON                    3         127         1         0        126          0
PHP                     3         782       112       223        447         31
YAML                    3          69        11         4         54          0
JavaScript              2         519        42       164        313         50
Markdown                2         500        75         0        425          0
XML                     2          28         0         0         28          0
License                 1         339        58         0        281          0
───────────────────────────────────────────────────────────────────────────────
Total                  16       2,364       299       391      1,674         81
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop (organic) $46,402
Estimated Schedule Effort (organic) 4.28 months
Estimated People Required (organic) 0.96
───────────────────────────────────────────────────────────────────────────────
Processed 92149 bytes, 0.092 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

Description of the tool/project:
Quoting from the README file of the language-data library

This library contains language related data, and utility libraries written in PHP and Node.js to interact with that data.

Here's a link to the PHP library: https://language-data.readthedocs.io/en/latest/index.html#using-the-php-library

Volunteers contribute language information to this YAML file: https://github.com/wikimedia/language-data/blob/master/data/langdb.yaml. This script is then run to generate a JSON file: https://github.com/wikimedia/language-data/blob/master/data/language-data.json. Here's a sample PR: https://github.com/wikimedia/language-data/pull/453

This data is then loaded into the PHP library via the following code:

$this->data = json_decode( file_get_contents( __DIR__ . '/' . self::LANGUAGE_DATA_PATH ) );

Similarly in JavaScript:

const languageData = require( '../data/language-data.json' );

The library is maintained by the Language and Product Localization team at the Wikimedia Foundation.

A new version of the library is released every 6 months, but we will released more often once the library is integrated into MediaWiki.

The language-data library's PHP API currently uses mustangostang/spyc but we will eventually make this a dev dependency soon.

Related task: T190129: Consolidate language metadata into a 'language-data' library and use in MediaWiki

Description of how the tool will be used at WMF:

This will then be used to support the language selector that will also be bundled with MediaWiki core and will be extended in the future for more usage. See: T190129: Consolidate language metadata into a 'language-data' library and use in MediaWiki

Dependencies
This library has no runtime dependency but has the following dev dependencies:

npm

  • eslint 8.57.0
  • eslint-config-wikimedia 0.31.0
  • mocha 10.6.0

composer

  • ext-curl
  • phpunit/phpunit 9.6.20
  • mediawiki/mediawiki-codesniffer 47.0.0
  • mustangostang/spyc 0.6.3

Has this project been reviewed before?
No, it hasn't.

Working test environment

Post-deployment
Language and product localization team will continue to maintain the library.
Contacts:

Details

Risk Rating
Low

Event Timeline

abi_ renamed this task from Application Security Review Request : ... to Application Security Review Request : language-data library.Nov 28 2025, 2:01 PM
sbassett changed the task status from Open to In Progress.Jan 5 2026, 5:55 PM
sbassett assigned this task to Mstyles.
sbassett triaged this task as Medium priority.
sbassett moved this task from Upcoming Quarter Planning Queue to In Progress on the secscrum board.
sbassett moved this task from Incoming to In Progress on the Security-Team board.

@abi_ Is this project still scheduled for deployment on Jan 31? I wanted to follow up on the timeline.

@abi_ Is this project still scheduled for deployment on Jan 31? I wanted to follow up on the timeline.

No, we've missed the deadline. It would be good to have this reviewed this quarter. We will potentially start integrating it into MediaWiki core towards the end of the quarter.

@abi_ Great, I'll post the review by the end of February so you have plenty of time.

Is there an updated status on the security review?

@Nikerabbit sorry I've been out sick but will post by tomorrow

Security Review Summary - T411267 - 2026-Mar-06
Last commit reviewed: aa1f8b6

Summary

Overall Risk Rating: low

  • No critical production-impacting vulnerabilities detected in application code.
  • Several moderate-to-high risks exist in third-party dependencies.
  • Code hygiene issues (use of potentially dangerous PHP functions) present minor risk if misused. See table below.

*PHP Code Hygiene Issues – Dangerous Functions / Static Analysis Findings*

FileFunctionLine(s)Description / RiskExample Fix / Safer Usage
LanguageUtil.phpfile_get_contents()53Reads language data file; may allow path traversal or invalid file access.`php $path = realpath(DIR . '/' . self::LANGUAGE_DATA_PATH); if ($path && str_starts_with($path, DIR)) { $this->data = json_decode(file_get_contents($path)); } else { throw new Exception("Invalid file path"); } `
ulsdata2json.phpfile_get_contents()20Reads langdb.yaml; may allow path traversal or read of unintended files.`php if (is_readable(DATA_DIRECTORY . '/langdb.yaml')) { $yamlLangdb = file_get_contents(DATA_DIRECTORY . '/langdb.yaml'); } else { throw new Exception("Cannot read langdb.yaml"); } `
ulsdata2json.phpfopen()29Opens supplemental data file for writing; may overwrite critical files.`php $supplementalDataFile = fopen($supplementalDataFilename, 'w'); if (!$supplementalDataFile) { throw new Exception("Cannot open file for writing"); } ` Ensure proper path validation and permissions.
ulsdata2json.phpfile_put_contents()112Writes JSON language data; may overwrite files if path not validated.`php $outputFile = DATA_DIRECTORY . '/language-data.json'; if (str_starts_with(realpath(dirname($outputFile)), DATA_DIRECTORY)) { file_put_contents($outputFile, $jsonVerbose, LOCK_EX); } else { throw new Exception("Invalid output path"); } `
ulsdata2json.phprequire_once15Loads vendor/autoload.php; dynamic paths can be unsafe.`php $safeFile = DIR . '/../../vendor/autoload.php'; require_once $safeFile; ` Use fixed paths.
ulsdata2json.phpprint19, 34, 42, 50, 109, 114Direct output may leak internal state.`php use Psr\Log\LoggerInterface; $logger->info("Reading langdb.yaml..."); ` Replace print with structured logging.
ulsdata2json.phpecho81, 88Direct output may leak internal state.`php $logger->warning("Unknown language $language for territory $territoryCode"); ` Replace with logging.

Vulnerable Packages - PHP

PackageVersionSeverityCVE / AdvisoryNotes
phpunit/phpunit9.6.20HighCVE-2026-24765Unsafe deserialization in PHPT code coverage handling. Upgrade to latest patched version (>=9.6.34).
phpunit/php-code-coverage9.2.32ModerateN/AUpgrade to 10.1.16 to match PHPUnit updates.
phpunit/php-file-iterator3.0.6ModerateN/AUpgrade to 4.1.0.
phpunit/php-invoker3.1.1ModerateN/AUpgrade to 4.0.0.

Vulnerable Packages - Javascript

PackageVersionSeverityAdvisory / NotesRecommended Fix
ajv<6.14.0ModerateReDoS via $data optionnpm audit fix
jsdiff5.0.0 - 5.2.1ModerateDenial of Service in parsePatch / applyPatchnpm audit fix
js-yaml4.0.0 - 4.1.0ModeratePrototype pollution via merge (<<)npm audit fix
lodash4.0.0 - 4.17.21ModeratePrototype pollution in _.unset / _.omitnpm audit fix
minimatch<=3.1.3, 5.0.0-5.1.7, 9.0.0-9.0.6HighReDoS via repeated wildcards and GLOBSTAR combinatorial backtrackingnpm audit fix
serialize-javascript<=7.0.2HighRCE via RegExp.flags and Date.prototype.toISOString()npm audit fix --force (may require Mocha downgrade)

Outdated Packages

npm outdated returned no results.

As reported via composer outdated:

PackageCurrent VersionLatest VersionNotes
phpunit/phpunit9.6.2010.5.63PHP Unit Testing framework
phpcsstandards/phpcsextra1.4.01.5.0Collection of sniffs/standards for PHP_CodeSniffer
phpcsstandards/phpcsutils1.1.11.2.2Utility functions for PHP_CodeSniffer
phpunit/php-code-coverage9.2.3210.1.16Library for code coverage collection and rendering
phpunit/php-file-iterator3.0.64.1.0Filters files based on a list of patterns
phpunit/php-invoker3.1.14.0.0Invoke callables with timeout
phpunit/php-text-template2.0.43.0.1Simple template engine
phpunit/php-timer5.0.36.0.0Timing utility class
sebastian/cli-parser1.0.22.0.1CLI options parser
sebastian/code-unit1.0.82.0.0Represents PHP code units
sebastian/code-unit-reverse-lookup2.0.33.0.0Maps lines to functions/methods
sebastian/comparator4.0.95.0.5Compare PHP values for equality
sebastian/complexity2.0.33.2.0Calculate complexity of code units
sebastian/diff4.0.65.1.1Diff implementation
sebastian/environment5.1.56.1.0Handle HHVM/PHP environments
sebastian/exporter4.0.85.1.4Export PHP variables for visualization
sebastian/global-state5.0.86.0.2Snapshot global state
sebastian/lines-of-code1.0.42.0.2Count lines of code
sebastian/object-enumerator4.0.45.0.0Enumerate arrays and object graphs
sebastian/object-reflector2.0.43.0.0Reflect object attributes
sebastian/recursion-context4.0.65.0.1Recursively process PHP variables
sebastian/type3.2.14.0.0Represents PHP types
sebastian/version3.0.24.0.1Version management for Git-hosted projects
squizlabs/php_codesniffer3.13.24.0.1Tokenizes PHP, JS, CSS files to detect issues
symfony/console5.4.476.4.34CLI interface utilities
symfony/string6.4.306.4.34String API utilities
symfony/yaml5.4.456.4.34YAML load/dump library
theseer/tokenizer1.3.12.0.1Convert tokenized PHP code to XML

Gitleaks Scan

  • Target: . (current repository)
  • Commits scanned: 466
  • Size scanned: ~1.75 MB
  • Leaks found: 0 ✅

Trivy File System Scan

  • Scanners used: vuln, secret, misconfig
  • Version: 0.68.2 (v0.69.3 available)
  • Issues found: 0 ✅

I'll leave this open for a week for feedback/questions, but it's okay to just note the results since this is marked as low risk.

Could you please edit your comment about to properly display the remarkup rather than including it as code?

@Pppery sorry for the markup issue, fixed now

@Mstyles Thanks for your review.

I've addressed most of the comments.

For this npm dependency:

serialize-javascript<=7.0.2HighRCE via RegExp.flags and Date.prototype.toISOString()npm audit fix --force (may require Mocha downgrade)

As per: https://github.com/mochajs/mocha/issues/5781#issue-4020034511 this does not impact mocha because:

Requires untrusted data

We've incorporated all the inputs from the security review and released a new version of the language data library. See: https://github.com/wikimedia/language-data/releases/tag/1.1.10

Thank you for the review. Feel free to mark this ticket as resolved.