Page MenuHomePhabricator

Set up a packagist mirror for Wikimedia
Open, LowPublic

Description

Looks pretty trivial to get this started in Cloud Services for now. Ideally this will help with CI reliablity, remove an external dependency, and use our resources to help the rest of the ecosystem.

Event Timeline

Legoktm created this task.Sep 5 2018, 2:59 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 5 2018, 2:59 AM

Mentioned in SAL (#wikimedia-cloud) [2018-11-26T05:54:50Z] <legoktm> created packagist-mirror1, cloned https://github.com/Webysther/packagist-mirror and started mirror creation script in a screen (T203529)

So the mirror creation is done, I installed apache2, and symlinked /var/www/html to /srv/packagist-mirror/public/ and verified that curl localhost works. But I'm unable to reach port 80 from outside of the instance, whether via DNS proxy or from inside another cloud vps (toolforge in this case). Probably an issue with security groups, but I don't see anything obvious. On IRC, arturo offered to take a look <3.

I added a rule allowing HTTP (80/tcp) traffic to the instance from anywhere to the security group.

Before and after:

aborrero@tools-bastion-03:~$ telnet packagist-mirror1.eqiad.wmflabs 80
Trying 172.16.1.212...
^C
aborrero@tools-bastion-03:~$ telnet packagist-mirror1.eqiad.wmflabs 80
Trying 172.16.1.212...
Connected to packagist-mirror1.eqiad.wmflabs.
Escape character is '^]'.
^C^C^C

In fact, I just created a separate security group, called HTTP/HTTPS, instead of adding rules to the default one.

This is the resulting instance configuration:

aborrero@tools-bastion-03:~$ telnet packagist-mirror1.eqiad.wmflabs 80
Trying 172.16.1.212...
Connected to packagist-mirror1.eqiad.wmflabs.
Escape character is '^]'.
GET /
<!DOCTYPE html>
<html>
[...]
                    <p>
                        This is PHP package repository Packagist.org mirror site.
                    </p>
[...]
hashar added a subscriber: hashar.Jun 24 2019, 8:08 PM

packagist.org had some issue on Friday (T226253), I guess it is a project we might want to revive?

I remember getting stuck somewhere, let me see where I was...

OK, a few hours of fiddling with Apache, and I've gotten this working!

Copying it here just in case so it doesn't get lost.

<Directory /srv/packagist-mirror/public>
  Require all granted
    RewriteEngine on

    # Serve correct content types, and prevent mod_deflate double gzip.
    RewriteRule "\.json$" "-" [T=application/json,E=no-gzip:1]

    <FilesMatch "\.json$">
      # Serve correct encoding type.
      Header append Content-Encoding gzip

      # Force proxies to cache gzipped &
      # non-gzipped json files separately.
      Header append Vary Accept-Encoding
    </FilesMatch>

</Directory>

OK, it works. Instructions at https://packagist-mirror.wmflabs.org/

A word of caution, if the mirror script is malicious, then it can inject/execute malicious code basically. I've finished auditing the mirror code itself, but not its dependencies. I sent a PR to get rid of one dependency so far.

km@km-pt ~> curl --compressed -I 'https://packagist-mirror.wmflabs.org/packages.json' | grep last-modified
last-modified: Wed, 26 Jun 2019 16:33:02 GMT

It seems like every time the mirror script runs (every 5 minutes), it's touching all the files, updating last-modified values, and preventing basic caching from working...

I've finished auditing the mirror code itself, but not its dependencies. I sent a PR to get rid of one dependency so far.

So far illuminate/support and nesbot/carbon don't seem to be used (issue).

Remaining stuff:

✔️guzzlehttp/guzzle                6.3.3    MIT                       
✔️guzzlehttp/promises              v1.3.1   MIT                       
✔️guzzlehttp/psr7                  1.4.2    MIT                       
league/flysystem                 1.0.52   MIT                       
league/flysystem-cached-adapter  1.0.9    MIT                       
✔️php-snippets/circular-array      v1.0.0   MIT                       
✔️psr/cache                        1.0.1    MIT                       
✔️psr/http-message                 1.0.1    MIT                       
✔️psr/log                          1.1.0    MIT                       
✔️sebastian/version                2.0.1    BSD-3-Clause              
✔️symfony/console                  v3.4.27  MIT                       
✔️symfony/debug                    v4.2.8   MIT                       
✔️symfony/polyfill-mbstring        v1.11.0  MIT                       
vlucas/phpdotenv                 v2.4.0   BSD-3-Clause-Attribution

(I didn't actually review the symfony stuff, just assuming it's safe)

km@km-pt ~> curl --compressed -I 'https://packagist-mirror.wmflabs.org/packages.json' | grep last-modified
last-modified: Wed, 26 Jun 2019 16:33:02 GMT

It seems like every time the mirror script runs (every 5 minutes), it's touching all the files, updating last-modified values, and preventing basic caching from working...

Something I have noticed is that the files listing the packages keep changing. Entries are appearing and disappearing as code is updated or new tags are pushed. An example I had is rackbeat/laravel-morph-where-has which tagged a new release as I was testing my theory, it has cut a new tag after a year or so of inactivity. Its definition thus moved from p-provider-2018-07.json to p-provider-latest.json forcing both files to be downloaded again.

The /packages.json is an index of all those provider files and list their checksum. Thus each time a package somewhere is updated, one of the provider file changes and packages.json has the sha256 updated.

So I guess it is working as intended?

Beside that, the zip/tarballs are reused from the local cache.

If you need help I'm here, already helping with photos and edits, but never imagined that could help with OSS inside wikimedia.

km@km-pt ~> curl --compressed -I 'https://packagist-mirror.wmflabs.org/packages.json' | grep last-modified
last-modified: Wed, 26 Jun 2019 16:33:02 GMT

It seems like every time the mirror script runs (every 5 minutes), it's touching all the files, updating last-modified values, and preventing basic caching from working...

Something I have noticed is that the files listing the packages keep changing. Entries are appearing and disappearing as code is updated or new tags are pushed. An example I had is rackbeat/laravel-morph-where-has which tagged a new release as I was testing my theory, it has cut a new tag after a year or so of inactivity. Its definition thus moved from p-provider-2018-07.json to p-provider-latest.json forcing both files to be downloaded again.
The /packages.json is an index of all those provider files and list their checksum. Thus each time a package somewhere is updated, one of the provider file changes and packages.json has the sha256 updated.
So I guess it is working as intended?
Beside that, the zip/tarballs are reused from the local cache.

Yes, composer is optimized for download compressed files and the cache is only optimized for zip/tarballs. json without *.gz is only for legacy support.

Hey guys, anyone knows the server(s) location?

How often the mirror is updated?

Hey!

Hey guys, anyone knows the server(s) location?

Geographically? It's in Virginia, USA. Specifically https://wikitech.wikimedia.org/wiki/Eqiad_cluster

How often the mirror is updated?

It should be running every 5 minutes.

Hey!

Hey guys, anyone knows the server(s) location?

Geographically? It's in Virginia, USA. Specifically https://wikitech.wikimedia.org/wiki/Eqiad_cluster

How often the mirror is updated?

It should be running every 5 minutes.

That's what I needed, thanks!

There's a fork location of repository?

There's a fork location of repository?

No, it's using your code off of github, I just hadn't pulled it in a while (just did so now).