Page MenuHomePhabricator

MediaWiki support for Composer equivalent for JavaScript packages
Closed, DeclinedPublic

Description

  • Affected components: TBD.
  • Engineer(s) or team for initial implementation: TBD.
  • Code steward: TBD.

Motivation

We should have an equivalent for JavaScript to what we have with Composer for PHP. A simple way to manage dependencies and fetch required JavaScript code automatically.

Both MediaWiki core and extensions depend upon various 3rd party javascript libraries. We currently don't have a way to manage those dependencies.

Who will benefit:

  1. Developers: Development efficiency
    • More predictable and maintainable code due to reduced code complexity
    • Less worries about conflicts across different MW extensions and a simple way to share JS code between them
  2. Security: Simpler to keep an eye on security issues in 3rd party libraries and to deal with them
  3. Possibly, WMF Deployment team.
Problems
  1. Libraries get duplicated into multiple repositories.
  2. Libraries need to be manually registered with PHP code and conditionals. Example 1, Example 2.
  3. It is hard to know the current version of a library in Git (manually inspect source files and/or git commit log)
  4. Libraries are difficult to update (location of external source varies, typically one or more JS files downloaded from the libraries' official web site)
  5. There is no easy way to automatically scan for vulnerabilities (via something like nsp)
  6. Libraries are hard to register with ResourceLoader (each shipped file needs to be listed)
Requirements
  1. Libraries should ideally be deduplicated libraries/flattened dependency trees.
  2. Library versions should be pinned, and not subject to a version range or dependency tree that is unpredictable or can vary over time, e.g. commit lock file or shrinkwrap file to Git.
  3. Deployment must only depend on stuff that's in git - nothing external.
  4. Tarball must not require running npm.
  5. We need to be able to patch libraries in cases of emergencies (or major regressions) without waiting on a new upstream release

User story: Add an external package with a dependency to MediaWiki core, or an extension, as follows:

  • Add two package names and versions to <a file>, where B depends on A.
  • Run <some command>.
  • (optional) Register names and files with ResourceLoader.
  • Package B can call require('a');.
  • Our own code can call require('b');.

Exploration

(TBD: Use this space for data gathering, status quo, proposals, other considerations etc.)

See also: T174922: Decide what to do with Wikibase JS-only libraries regarding the build/deployment of Wikidata code

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

fxpio/composer-asset-plugin is a Composer plugin that allows assets to be defined in standard npm/bower format, then parses packages.json, transforms the declarations into Composer format, and calls Composer to install those packages.

On one hand, that sounds scary. On the other hand, it's zero effort on our part (if it works; haven't tested), works everywhere where Composer does, and this task has been blocked since forever. So maybe worth considering?

Thanks for testing. It seems like Composer plugins that are not yet installed when Composer starts get loaded too late to influence the dependency resolution process (which makes sense in hindsight).

Something installed globally feels icky... Especially as something people might need to install/run separately outside MW to get composer-ed things to work...

composer-merge-plugin does some gymnastics to get around this bootstrapping problem. That sort of solution could be upstreamed to fxp/composer-asset-plugin if it would work there as well.

composer-merge-plugin does some gymnastics to get around this bootstrapping problem. That sort of solution could be upstreamed to fxp/composer-asset-plugin if it would work there as well.

AIUI the problem is that composer-asset-plugin never gets installed in the first place because Composer tries dependency resolution first, and that errors out if there are any bower/npm "virtual packages" present. The composer-merge-plugin approach would work if those virtual packages were stored in some place that Composer normally ignores (like extra) but they are stored under require/require-dev so Composer tries to look them up on packagist and fails.

@Tgr I agree that that plugin could come in handy, but it is in my opinion only a small part of the puzzle. A part I am not currently worried about solving, given plenty of alternate options (including plain npm install based on separate package.json file we'd keep somewhere).

The specifying of package names/versions and distributing of files is only part of the problem for this task. I haven't looked far into the mentioned Composer plugin, but as advertised it does look fine to me. And certainly one of the best I've seen so far. Especially given that it would allow for keeping only one entry point for "external updates", and one entry point for packaging them for tarballs and wmf-production (mediawiki/vendor.git). Although the fetching of the files from npm is relatively easy and could (for sake of argument) even be a shell script that runs something several commands in sequence of the form npm install <package>@<version> and then move them to a special subdirectory inside vendor/ (if we want to re-use that). It wouldn't be very pretty, but it would work, and could even be run automatically as shell script from a hook run by composer update, without a plugin.

The other part of the problem is file discovery and registering with ResourceLoader in way that is both intuitive, scalable and easy to maintain.

For PHP code, the fetching is pretty much all that needs to happen because we an use an autoloader that knows where all files are, and the (same) autoloader registry is allowed to load additional files on-demand as needed at run-time.

For JavaScript on the other hand, we need to register all JS and CSS files ahead of time for bundling and serving to the browser. And while an autoloader registry is a standard and mandatory part of composer.json, a typical package.json file for Node only contains a main entry point. All other references happen at run-time via relative file-path require() commands.

Task T133462 deals with adding support for multi-file packages in a way that allows for partial execution (e.g. not just plain js concatenation). That would be a prerequisite for this task. But even with that solved, we'd still need each file (or directory) to be specified manually (short of shipping entire directories to the client, at the risk of sending too much).

We could use some sort of text scanning for require() calls and try to find all references. But short of that, we'd be forced to list each relevant file somewhere, which means we need more than just a registry of package name and version, but also an enumeration of file paths, which seems fragile to do without the files themselves being committed to the repository.

Anyhow, regardless of what we may be blocked on, I propose we start exploring what automation we can already apply to our current approach.

The update-oojs.sh maintenance script in MediaWiki core is an example of such automation.

It'd be nice if we can first get to a point where all our external JS/CSS libraries are listed in a single file with name, version, fetch source, and ResourceLoader-compatible definition. And have automation in place that can easily update or verify those dependencies (even whilst still committed to Git, for the time being).

TechCom triage: seems ready for an IRC meeting.

This post: I’m harvesting credit card numbers and passwords from your site. Here’s how. makes some interesting points on the security aspects of using npm modules (which also underscores the importance of using our own bundling infrastructure and only relying on npm/whatever for sources).

That is a nice article. The points and dangers are generally applicable to using any package managers and running/serving untrusted code, and deep dependency chains.

Thanks for sharing that article, @Tgr!

Problems we face right now that I'd like to solve:

  • Hard to figure out what JavaScript libraries we use.
  • Updating / tracking out of date libraries is ad hoc at best, non-existent for most.
  • No automated security scanning for dependencies

Basic requirements:

  • Deployment can only depend upon stuff that's in git - nothing external
  • Tarball must not require running npm
  • Dependencies should be pinned
  • Automated security scanning, probably with nsp?

Tooling: Ideally we would use close to what everyone else uses. So I think using composer to manage JavaScript libraries is effectively ruled out. We already use npm for CI, so expanding our usage of it seems natural. Regarding yarn, after our experience with HHVM, I don't think I'm ready to add another Facebook managed dependency to our stack.

Implementation:
The simplest implementation to get us started would be to just create a package.json file, and commit the results of npm install (example for jquery: https://gerrit.wikimedia.org/r/#/c/407055/). Extensions can also adopt the same pattern very easily, without us needing to invent something like npm-merge-plugin. Putting it inside resources/ avoids needing to open up a new top-level directory for web access. This would not address the deduplication problem, we would continue to rely upon ResourceLoader modules having unique names.

Commiting the dependencies also means that people installing from git don't need to run npm locally (only developers who want to update something). I think this is a necessary thing given Debian doesn't have npm packaged, Ubuntu ships an older version that is not compatible with package-lock.json, and so on.

In my example commit (https://gerrit.wikimedia.org/r/#/c/407055/) it would be nice if we could just commit README/LICENSE and the dist/ folder, and not all of src/.

Problems we face right now that I'd like to solve:

I'd also like to add "ability to share JavaScript modules across extensions without having to put code in core."
Right now the only way to share code in JS is to get it inside core, which seems to work against the goal of making core smaller. Minerva skin for example depends on MobileFrontend as it uses a common JavaScript bundle that's defined in MobileFrontend. I'd like to solve this problem as this dependency in forced by our infrastructure and not necessary.

Deployment can only depend upon stuff that's in git - nothing external
Tarball must not require running npm

Can you elaborate on these requirements and where they come from?

Basic requirements:

  • Deployment can only depend upon stuff that's in git - nothing external
  • Tarball must not require running npm
  • Dependencies should be pinned
  • Automated security scanning, probably with nsp?

Parsoid repo satisfies all these requirements. We use shrinkwrap to pin versions and dependencies. Our deploy repo has the package code in git which is what we deploy. We use nsp and it runs as part of jenkins CI tests.

Problems we face right now that I'd like to solve:

I'd also like to add "ability to share JavaScript modules across extensions without having to put code in core."
Right now the only way to share code in JS is to get it inside core, which seems to work against the goal of making core smaller. Minerva skin for example depends on MobileFrontend as it uses a common JavaScript bundle that's defined in MobileFrontend. I'd like to solve this problem as this dependency in forced by our infrastructure and not necessary.

I think what you're asking for is the deduplication support mentioned above? Where A and B both depend on library C, and right now we have A depend upon B, which depends upon library C, but really we should be able to express that A depends on library C and B depends on library C and we only include one copy of library C.

Deployment can only depend upon stuff that's in git - nothing external
Tarball must not require running npm

Can you elaborate on these requirements and where they come from?

I came up with them based on the requirements we had when we introduced composer dependencies to MediaWiki core in 1.25. I don't think there's anything controversial about them - for composer we deploy from mediawiki/vendor (git commit of composer dependencies) and bundle that repo into the tarball so people don't have to run composer.

If you want to avoid npm install <dependencies>, maybe you can use npm pack <dependencies> when _generating_ the tarball? This would allow the tarball to be unpacked without NPM.

we should be able to express that A depends on library C and B depends on library C and we only include one copy of library C.

Exactly. When considering our restrictions in ResourceLoader we'd probably want to support shipping multiple versions of a library e.g C might refer to a ResourceLoader module Library.C@1.0.1 or Library.C@1.2.0 - but provide some kind of logging that says when this is the case (I'd imagine in production we'd never want to have multiple versions of a library running and would need to work out how to deal with that). It's the only thing blocking me from making Minerva not depend on MobileFrontend so something I'm keen to rectify.

Just use the same approach as for composer? npm installs into vendor, that gets reviewed and committed, and used in WMF production and in tarballs. Extension tarballs and third-party installs of extensions would still require running npm but they require running composer as well so not much difference there.

Just use the same approach as for composer? npm installs into vendor, that gets reviewed and committed, and used in WMF production and in tarballs. Extension tarballs and third-party installs of extensions would still require running npm but they require running composer as well so not much difference there.

I considered that, but I think we run into a bunch of weird path problems: is the library in ./../vendor/node_modules or ./node_modules?

Also the level of distro packaging of composer and npm are different to the point where I'm not sure it's reasonable to be asking people (who aren't developers) to run npm locally.

Tgr updated the task description. (Show Details)

Following today's IRC meeting, we've collected a wide range of problems we want to solve, and an informal agreement to first solve a subset of these with two first steps as possible solution.

Specifically, the current proposed solution is to:

  • Track names and versions of 3rd-party javascript libraries. – using package.json
  • Fetch dependencies from their external source(s) in an automated fashion. – using npm.

This would address the following 3 problems (out of 6 total):

  • It is hard to know the current version of a library. The library versions will be self-documented, in package.json.
  • Libraries are difficult to update. Git commit authors update them using the npm CLI.
  • There is no easy way to automatically scan for vulnerabilities. By using package.json, most standard scan tools will be able to find vulnerabilities.

A few catches to think about:

  • We currently mostly update libraries based on official non-npm distributions from the websites of the upstream libraries. These are typically a single file. If we standardise on npm as the source of module code, we'll also face that the way packages are published there, compared to how they are published on their own website or on CDNs, they will likely be multiple files. Thus, as long as we manually register modules and their files, we'll need to carefully enumerate each file. In the case of files only conditionally accessed, this could likely lead to issues in production.
    • This could be addressed in the future by using an automated file scanner in PHP that looks for require() calls. This could be based on, or PHP-ported from, logic that already exists in JS-based bundling tools like WebPack.
  • ..

Meeting started by TimStarling at 22:00:34 UTC. The full logs are
available at
https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-01-31-22.00.log.html
.

Meeting summary

  • RFC: MediaWiki support for Composer equivalent for JavaScript packages (TimStarling, 22:00:54)
    • https://phabricator.wikimedia.org/T107561 (Krinkle, 22:01:03)
    • Problem: We manually download upstream libs and commit to Git. We'd like to automate this to some extent. Whether outside Git, or within git with a script. (Krinkle, 22:06:37)
    • Problem: Updating / tracking out of date libraries is ad hoc at best, non-existent for most. (legoktm, 22:07:11)
    • Problem (cont'd): Lib versions are inferred based on file contents and/or commit log, not a self-documenting manifest of sort. (Krinkle, 22:08:10)
    • Problem: No automated security scanning for JavaScript dependencies (legoktm, 22:08:15)
    • Requirement: Consider ease of forking/patching of current approach. (Krinkle, 22:09:05)
    • Problem: JS package management is MW specific instead of canonical JS. (niedzielski, 22:09:11)
    • Problem: Enabling extension and skin authors to depend on npm libraries without duplication (joakino, 22:10:55)
    • Requirement: Support extensions as well (de-duplicate, still easy to fetch all at once for installed extensions). (Krinkle, 22:11:09)
    • Problem: Currently, MW extensions using js libs need to conditionally register their module based on whether another extension has already registered it. This means version-needs are not honoured or even checked. (Krinkle, 22:17:41)
    • As first step, we might want to keep things in Git like now. But focus on automating the fetching of things from npm, and focus on declaring the package names/versions in a machine-readable and standard format (eg. package.json) (Krinkle, 22:38:17)
    • <joakino> for committing packages, it'd be probably better to just commit the dist files going to be used, so instead of commiting the full node_modules/jquery we could commit just resources/node_modules/jquery/dist/jquery.js (legoktm, 22:43:14)
    • Problem problem: Npm module names support namespaces with an '@', which RL has already reserved. Think about... (Krinkle, 22:43:18)
    • <Krinkle> legoktm: We can also upstream improvements to package.json for distributing fewer files (legoktm, 22:43:22)
    • <legoktm> joakino: yeah, I'm leaning towards that. just commit README, LICENSE, and dist/ (legoktm, 22:43:40)
    • AGREED: we'll use npm and we'll commit the result of running npm to the extension or core git repo (TimStarling, 23:04:10)

I added one more restriction that I forgot about earlier today:

We need to be able to patch libraries in cases of emergencies (or major regressions) without waiting on a new upstream release

As long as we follow the "must be in git" this restriction will be taken care of too.

  • Track names and versions of 3rd-party javascript libraries. – using package.json

Was there agreement on the location of this file ? As lego proposed putting it into ./resources
When we use a file, we should pretty much know where to put it, otherwise it's still problematic :)

I missed the IRC meeting yesterday, but if I can add my 2 cents, I'd like to add as a requirement that we should verify the downloaded packages before including them in our repos.
This is not an easy task as proper signatures and list of official signature keys of the upstream packages might not be available.

In addition npm packages are not yet signed (AFAIK), although there is a discussion since early 2015. Recently there has been even some attempt to do it externally, like pkgsign.

Automatically including external libraries without any sort of integrity check is intrinsically dangerous and should be avoided IMHO. Moreover npm has had already some security incidents that involved code published by non-maintainers.

@Volans while in theory i agree with you, in practice... none of our current JS code EVER passed a sign check either. So if that's something we want to do, then we should be pushing npm, because requirements without infrastructure are just plain impossible in practicality.

@Volans while in theory i agree with you, in practice... none of our current JS code EVER passed a sign check either. So if that's something we want to do, then we should be pushing npm, because requirements without infrastructure are just plain impossible in practicality.

I think that what @Volans was suggesting is to think of a way to check such dependencies without having to rely on the non-compromised state of npm.

Now I understand we probably face the same issue right now, but since we're institutionalizing the use of npm, it is important to also figure out how to mitigate this type of risk.

I don't think this should block the RfC though.

@Volans while in theory i agree with you, in practice... none of our current JS code EVER passed a sign check either. So if that's something we want to do, then we should be pushing npm, because requirements without infrastructure are just plain impossible in practicality.

Presuming that the package.json file would not only list dependencies, but also the specific version of the dep that should be used, it ought to also be possible to include an MD5 or other checksum. It doesn't eliminate the risk, but at least ensures that a package that's been inspected by a developer hasn't been altered upstream.

Regarding integrity, package-lock.json has sha1/sha512 hashes in it (how they're used, I haven't researched yet).

I updated https://gerrit.wikimedia.org/r/407055 to not include all the extraneous files with some .gitignore rules. Do we want an extra copy of jquery-slim in the repository?

Also, the first two non-jQuery projects I tried migrating had dirty diffs than what we had committed, and it seemed like the project maintainer had just uploaded the wrong thing to npm. mustache 0.8.2 on npm produced a javascript file that included 0.8.1 in the source, among other changes. html5shiv 3.7.3 on npm produced a javascript file that included "3.7.3-pre" as the version. Maybe just a bad coincidence? :( I'll try a few more packages.

https://docs.npmjs.com/files/package-lock.json
Integrity

For bundled dependencies this is not included, regardless of source.
For registry sources, this is the integrity that the registry provided, or if one wasn't provided the SHA1 in shasum.
For git sources this is the specific commit hash we cloned from.
For remote tarball sources this is an integrity based on a SHA512 of the file.
For local tarball sources: This is an integrity field based on the SHA512 of the file.

Note.. I suspect that package.lock is not supported by the version of npm that we currently run in production ?

Note.. I suspect that package.lock is not supported by the version of npm that we currently run in production ?

Correct (see T179229: Decide whether we want the package-lock.json to commit or ignore), but we would spot it locally, if not in CI.

Krinkle moved this task from P1: Define to P2: Resource on the TechCom-RFC board.

Obsoleted by ForeignResourceManager which (unlike npm or composer) meets the operational requirements for security and independence, and supports both npm and other distribution sources of packages.

Hey Krinkle, any chance you can point me to the details of the ForeignResourceManager solution? I wasn't so lucky finding any relevant documentation, and curious what has happened since the RFC log from 2018 where npm/git was the preferred solution.

To get started, follow https://www.mediawiki.org/wiki/ResourceLoader/Foreign_resources.

When working on specificic changes, there is also:

  • Reference docs in the comment of the /resources/lib/foreign-resources.yaml file in MediaWiki core.
  • See also maintenance/manageForeignResources.php --help.