Page MenuHomePhabricator

The restricted/mediawiki-webserver image should include skins and resources
Open, HighPublic

Description

The image as built right now does not include any php-related file; however, the docroots link to files in the /srv/mediawiki/php directory:

www-data@mediawiki-pinkunicorn-b5db99bf9-fp27f:/srv/mediawiki/docroot/wikimediafoundation.org/static/current$ ls -lart
total 0
lrwxrwxrwx 1 somebody somebody 24 Feb 23 23:42 skins -> /srv/mediawiki/php/skins
lrwxrwxrwx 1 somebody somebody 28 Feb 23 23:42 resources -> /srv/mediawiki/php/resources
lrwxrwxrwx 1 somebody somebody 29 Feb 23 23:42 extensions -> /srv/mediawiki/php/extensions

and those links are broken within the image. We need to add them back somehow.

Outstanding problematic URLs:
https://en.wikipedia.org/favicon.ico (due to T288848)

Event Timeline

Which image are you looking at? I'm looking at a recent image published as docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2021-06-22-155254-publish and I can list the contents of those directories by way of the symlink.

dduvall@releases1002:~$ docker run --rm -it --entrypoint /bin/bash localhost/plib-image-n89hwxz6 -c 'ls /srv/mediawiki/docroot/wikimediafoundation.org/static/current/skins'
CologneBlue  Modern    Nostalgia  Timeless  WikimediaApiPortal
MinervaNeue  MonoBook  README	  Vector

Which image are you looking at? I'm looking at a recent image published as docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2021-06-22-155254-publish and I can list the contents of those directories by way of the symlink.

dduvall@releases1002:~$ docker run --rm -it --entrypoint /bin/bash localhost/plib-image-n89hwxz6 -c 'ls /srv/mediawiki/docroot/wikimediafoundation.org/static/current/skins'
CologneBlue  Modern    Nostalgia  Timeless  WikimediaApiPortal
MinervaNeue  MonoBook  README	  Vector

The mediawiki-webserver one, (now commented in the pipeline build... and should be de-commented), from which we exclude the php directory. That is the image that runs apache2 in the medaiwiki pod, and needs to have all static assets included.

We will need a more complex build for the webserver image, that will need to include both these default static assets and a checkout of the wwwportals directory, see T285325

Which image are you looking at? I'm looking at a recent image published as docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2021-06-22-155254-publish and I can list the contents of those directories by way of the symlink.

dduvall@releases1002:~$ docker run --rm -it --entrypoint /bin/bash localhost/plib-image-n89hwxz6 -c 'ls /srv/mediawiki/docroot/wikimediafoundation.org/static/current/skins'
CologneBlue  Modern    Nostalgia  Timeless  WikimediaApiPortal
MinervaNeue  MonoBook  README	  Vector

The mediawiki-webserver one, (now commented in the pipeline build... and should be de-commented), from which we exclude the php directory. That is the image that runs apache2 in the medaiwiki pod, and needs to have all static assets included.

We will need a more complex build for the webserver image, that will need to include both these default static assets and a checkout of the wwwportals directory, see T285325

Got it! Thanks for clarifying. I'll look into fixing up that image build.

Just to verify something further, @Joe, does the image also require code for all active MediaWiki versions? I'm looking at the following and seeing that a version specific includes/WebStart.php is included as part of the w/static.php codepath.

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/mediawiki/templates/apache/mediawiki-vhost.conf.erb#57
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/w/static.php#39

Having to include the versions atop two base images (php-fpm and httpd) will greatly increase the build times since they will not share cache chains. If there's no way around this, we'll JFDI, but I'm wondering if there's some straightforward way we can restructure the images.

One idea that comes to mind is to combine the php-fpm and httpd base images and building a single multiversion image that can either be run as either the webserver or the backend (httpd or fpm respectively).

One idea that comes to mind is to combine the php-fpm and httpd base images and building a single multiversion image that can either be run as either the webserver or the backend (httpd or fpm respectively).

To illustrate this proposed change a bit more:

We currently structure the images with the following layers (each branch constituting an independent cache chain):

                 buster
               ⇙         ⇘
       php-fpm            mediawiki-httpd
          ↓                       ↓
       php deps               httpd deps
          ↓                       ↓
 define fpm entrypoint   define httpd entrypoint
          ↓                       ↓
 add mediawiki-config     add mediawiki-config
          ↓                       ↓
add mediawiki versions   add mediawiki versions

Doing the following instead would lead to a bulkier base image but would result in only a single branch of layers (and one cache chain) up to where the entrypoint is defined.

                 buster
                   ↓
          mediawiki-httpd-php-fpm
                   ↓
               httpd deps
                   ↓
                php deps
                   ↓
           add mediawiki-config
                   ↓
          add mediawiki versions
          ⇙                   ⇘
define fpm entrypoint   define httpd entrypoint

Hi @dduvall sorry I missed your comments.

We don't need to include the full multiversion stuff in the httpd image, I hope, else we should really stop and consider that maybe we should just mount the mediawiki code into the images as a volume from the host.

Let me list what needs to be in the httpd images:

  • Any php file that is called *directly* from the web request, which basically means what is in /w in /srv/mediawiki
  • Any static asset accessed *directly* from the web, like the timeless css sheet, *not* stuff that is only accessed via /w/static.php or /w/load.php. those static assets need to either be present under /srv/mediawiki/docroot or be linked therein.

I will do some logs collection to get a list of urls that need to be answered directly by the webserver, it's not too much stuff, probably under 100 MBs anyways.

So my idea was that we can use the mutliversion image as a build base for the webserver image, and then just copy over the stuff we need.

I'll get a clearer picture of all that's needed today.

So after some more scavenging, We need the following directories to be included:

  • php/skins which is about 18 MB
  • php/resources which is about 17 M
  • all the non-php, non-boilerplate files in php/extensions. I don't know if there is a way to extract the list of static assets for each extension, but even just with a simple script it can be slimmed down to about 200M. With some further work it can be probably be reduced below 180M or so.

The script I used was simply:

find . -type f -name "*.php" -delete
find . -type d -name "i18n" | xargs rm -rf
find . -type d -name "tests" | xargs rm -rf

So it should be doable to have one build step after creating multiversion that copies multiversion over but then removes all the unneeded stuff, and finally copies stuff over to the httpd image.

I should also add that I'm starting to feel that the benefits of having the code inside the images are now balanced by an always-more-complex build system, us needing to optimize both the distribution method of images and how we build them, and I have little confidence I haven't missed a few things too.

I am starting to think that we should just mount a volume from the local disc for some time, and in the meantime work on:

  • Rationalizing how static assets are managed
  • Allow us to NOT run multiversion, but rather have separate deployments for group0/1/2
  • Modernize how mediawiki is configured so we don't need a code deployment at every turn.

I know we decided to go the other way around but, with the current organization of the code, it's proving quite hard to do things the way we wanted. I'll bring this up at our next IC meeting.

Just to clarify: if it wasn't for managing the code split inside the containers, we would already be able to serve live traffic from kubernetes, instead of having to wait for:

  • Our registry scaling up, and us finishing our tests with dragonfly etc
  • That we add a step to deploy to k8s for every scap run
  • That we figure out with 100% confidence which files we need to include where.

Just to clarify further: I don't think we can manage having two 5 gb images per pod to pull every time one config file changes, so we need optimizations.

I am starting to think that we should just mount a volume from the local disc for some time, and in the meantime work on:

  • Rationalizing how static assets are managed
  • Allow us to NOT run multiversion, but rather have separate deployments for group0/1/2
  • Modernize how mediawiki is configured so we don't need a code deployment at every turn.

I don't have a strong opinion about the code living in the images but I'd love to see progress on those bullet points.

So after some more scavenging, We need the following directories to be included:

  • php/skins which is about 18 MB
  • php/resources which is about 17 M
  • all the non-php, non-boilerplate files in php/extensions. I don't know if there is a way to extract the list of static assets for each extension, but even just with a simple script it can be slimmed down to about 200M. With some further work it can be probably be reduced below 180M or so.

[...]

So it should be doable to have one build step after creating multiversion that copies multiversion over but then removes all the unneeded stuff, and finally copies stuff over to the httpd image.

That would result in a much slimmer webserver image. In aggregate, though, it still duplicates image content and incurs more computational cost at build time.

What do you think of the alternate image composition I proposed above? As I mentioned it would start with larger base images, but if you consider that all layers between the two images are identical save for the final entrypoint definition, which is of negligible size, it would reduce the number of copy operations at build time, transfer, unpacking, and possibly even memory usage on nodes given a more efficient page caching—I think, that is if we're using the overlayfs driver.

I should also add that I'm starting to feel that the benefits of having the code inside the images are now balanced by an always-more-complex build system, us needing to optimize both the distribution method of images and how we build them, and I have little confidence I haven't missed a few things too.

I disagree here. It is certainly a complex problem of optimization due to the problems you've mentioned, but the build process itself is semi straightforward in its form, and it will get progressively more sane when we solve those other things.

Using a shared volume between nodes would, IMO, result in a betwixt state and the worst of both k8s and legacy worlds where half the stack must be managed with scap or other rsync-based tooling and the other half of the stack is being managed with k8s based tooling. That precludes us from making any real advances in simplifying the aggregate toolchain and using deployment strategies that k8s would otherwise afford us. I think it's still very much worthwhile to figure out what to do about these large images.

If we can figure out a better solution to building and distributing the l10n cache, I honestly think that's enough on its own—but maybe also in tandem with the image composition alternative I mentioned. It makes up the vast majority of the image size and incurs a substantial cost during the build.

Helm supports hooks. What if we define pre-install hook and a k8s Job in the chart templates that generates the l10n cache upon deployment and stores the results on a shared persistent volume mounted read-only by the application pods? The cache path on the shared volume would be keyed by deployment name, not simply MW version, which retains the atomicity of deployments. The generated cache path would be inject into mediawiki-config during deployment. Unlike with application code, I think it's an approach appropriate for a cache and I can't think of a real downside other than the need for garbage collection. What do you think?

What I'm describing is similar to the initContainer approach that @dancy experimented with previously. However, it only has to run in a single place per deployment and doesn't run up against the hostPath/futex issues that arose in the previous experiment.

I am starting to think that we should just mount a volume from the local disc for some time, and in the meantime work on:

  • Rationalizing how static assets are managed

This is a bit annoying for sure. It would be so nice if extensions and skins declared in some static form exactly what static assets they hold. The same goes for l10n files—it's better than it used to be with extension.json but FWICT there's still the option of using php to dynamically declaring a list of l10n files.

  • Allow us to NOT run multiversion, but rather have separate deployments for group0/1/2

Also a laudable long-term goal for sure. We've been trying to imagine what this might look like in our weekly RelEng m8s checkins, but we know it's solidly SRE domain. Still, we've love to actively work towards it.

  • Modernize how mediawiki is configured so we don't need a code deployment at every turn.

Also something we're definitely all thinking about. We can work on this in parallel/tandem.

I know we decided to go the other way around but, with the current organization of the code, it's proving quite hard to do things the way we wanted. I'll bring this up at our next IC meeting.

Let's schedule some time to talk about these problems synchronously when you're back, during the next IC meeting or even before.

Helm supports hooks. What if we define pre-install hook and a k8s Job in the chart templates that generates the l10n cache upon deployment and stores the results on a shared persistent volume mounted read-only by the application pods? The cache path on the shared volume would be keyed by deployment name, not simply MW version, which retains the atomicity of deployments. The generated cache path would be inject into mediawiki-config during deployment. Unlike with application code, I think it's an approach appropriate for a cache and I can't think of a real downside other than the need for garbage collection. What do you think?

What I'm describing is similar to the initContainer approach that @dancy experimented with previously. However, it only has to run in a single place per deployment and doesn't run up against the hostPath/futex issues that arose in the previous experiment.

After discussing this more with @dancy and @jeena it seemed the initContainer approach was still feasible but in a slightly different form using local storage persistent volumes. I've started a new task to track discussion of this idea (T286952).

Hi and sorry for the late replies, just got back from my break and I'm catching up with the backlog.

Just to clarify my point of view: I am not advocating for using an hostPath for serving the code as distributed by scap as a long-term solution. I just want to unblock a relatively important part of the work SRE need to do in the short term.

Specifically things we don't have at the moment:

  • We don't run the same code as production most of the time.
  • The mediawiki-webserver image is not properly updated since april IIRC
  • Deploying the new versions of the code right now requires an SRE to run cumin incantations to download the image without having the kubernetes api time out
  • Not all static assets are present in the webserver image (the focus of this task)
  • We can't deploy to multiple kubernetes nodes at the same time to avoid network issues

some of the above issues are easier to solve, others are quite hard and time-consuming to solve.

So I think that using this as a temporary workaround while we evaluate the solutions we might use (initContainer to generate the localization cache and/or get away from multiversion; the p2p docker pull with dragonfly) would allow SREs to continue working on performance testing, providing an mwdebug environment, etc. without waiting for the solutions to be ready.

Given that implementing this would require me ~ 1 day of work, it seems like a good way to unblock our work.

I remain fully convinced that our end goal should be including the code in the images we build. In fact, SRE and Release Engineering spent a significant amount of time working in that direction and I'm confident we'll end up having an efficient process to deploy to kubernetes.

I like the approach of using a Job to populate the l10n cache on each node, but I am not 100% sure of how we can ensure we run one job per node; that's a technical detail though.

As for setting up a meeting: I'd be happy to talk this week.

So after discussion yesterday, it appears we've come to a consensus that given we're now building incremental images for smaller code changes, the issue with slowness of pulling will only happen when we want to do a full train deployment, which will probably require us to pre-pull the image on all nodes. So we can proceed without hostPath or initContainers trickeries and just keep all the code together.

Now this means that we need to get back to the original intent of this task.

I think we need at the very minimum to fix the broken links in the docroots, but potentially we need to also include static assets from extensions (although I suspect those are mostly reached via static.php).

I'll scavenge the production logs to confirm what gets handled directly by apache.

Things that get served statically include:

That's about it. Most static assets are served via load.php and static.php as they should be.

Most interestingly, I think we can get away with not including the static assets in extensions in this webserver image, but just check out the portals repository and copy over the current static assets from the multiversion image.

This doesn't work:

curl -v -H 'X-Wikimedia-Debug: backend=k8s-experimental' https://en.wikipedia.org/favicon.ico

Responds with an HTTP 500 with text:

<!DOCTYPE html>
<p>Failed to fetch URL &quot;https://en.wikipedia.org/static/favicon/wikipedia.ico&quot;</p>

This works:

curl -v -H 'X-Wikimedia-Debug: backend=k8s-experimental' https://en.wikipedia.org/static/favicon/wikipedia.ico

Hmm. What's going on?

/favicon.ico is (or should be) written to w/favicon.php.

I think the issue is that the php-fpm container processing the request is trying to make an outgoing HTTP request to https://en.wikipedia.org/static/favicon/wikipedia.ico and it is being blocked.

Note: There is always a delay of 3 seconds before the 500 response is returned.

favicon.ico issue is an example of T288848

Change 713488 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] httpbb: Add check for https://en.wikipedia.org/favicon.ico

https://gerrit.wikimedia.org/r/713488

Change 713488 merged by RLazarus:

[operations/puppet@production] httpbb: Add check for https://en.wikipedia.org/favicon.ico

https://gerrit.wikimedia.org/r/713488

@Joe As of docker-registry.wikimedia.org/restricted/mediawiki-webserver:2021-08-04-134912-webserver it looks like all necessary files are included in the image. If this is not the case, please respond with examples of URLs that do not work correctly. https://en.wikipedia.org/favicon.ico is one.

Change 720817 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] Add tests to exercise uses of the php symlink in operations/mediawiki-config

https://gerrit.wikimedia.org/r/720817

Can someone give me an example of a curl command that exercises the /w/static.php codepath?

Change 720817 merged by Alexandros Kosiaris:

[operations/puppet@production] Add tests to exercise uses of the php symlink in operations/mediawiki-config

https://gerrit.wikimedia.org/r/720817

As far as I can tell, we're still missing some symlinks:

# docker run --rm -ti --user root --entrypoint /bin/bash docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2021-08-04-134912-webserver
root@5256de3030c6:/srv/mediawiki# ls -la /srv/mediawiki/docroot/standard-docroot/static/current
total 0
drwxr-xr-x 2 somebody somebody 54 Feb 23  2021 .
drwxr-xr-x 6 somebody somebody 69 Feb 23  2021 ..
lrwxrwxrwx 1 somebody somebody 29 Feb 23  2021 extensions -> /srv/mediawiki/php/extensions
lrwxrwxrwx 1 somebody somebody 28 Feb 23  2021 resources -> /srv/mediawiki/php/resources
lrwxrwxrwx 1 somebody somebody 24 Feb 23  2021 skins -> /srv/mediawiki/php/skins
root@5256de3030c6:/srv/mediawiki# ls -la /srv/mediawiki/php/extensions
ls: cannot access '/srv/mediawiki/php/extensions': No such file or directory
root@5256de3030c6:/srv/mediawiki#

@Joe I propose decoupling the webserver and app images by eliminating all uses of the 'php' symlink in operations/mediawiki-config and replacing them with Apache rewrites that use static.php in the app container. However, I haven't yet figured out the right way to use static.php. Open to suggestions!

@Krinkle Timo, is what I propose above feasible?

Can someone give me an example of a curl command that exercises the /w/static.php codepath?

See also Grafana: MediaWiki Static.

Additional thing to note is that in Varnish, these urls are routed such that they are all treated as if their hostname is en.wikipedia.org even if the request is for a different wiki. This should be transparent to you, but might help explain certain behaviour. This is done because static.php is deterministic regardless of current wiki, and thus improves CDN capacity/effiency/hit-rate.

@dancy I like your idea, even if I generally don't like using rewrite rules much.

I'll try to bake a set of rewrite rules that prevent direct access to static resources that are currently under /php/, and route them via static.php (which is btw what the wikis themselves do).

Change 721258 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mediawiki: allow rewriting static assets to multiversion

https://gerrit.wikimedia.org/r/721258

Change 721258 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: allow rewriting static assets to multiversion

https://gerrit.wikimedia.org/r/721258

Change 721265 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki::web::sites: allow k8s-only parameters

https://gerrit.wikimedia.org/r/721265

Change 721265 merged by Giuseppe Lavagetto:

[operations/puppet@production] mediawiki::web::sites: allow k8s-only parameters

https://gerrit.wikimedia.org/r/721265

As it stands, my new configuration would mean that we're going to issue a permanent redirect for anything under /static/current/skins to /w/skins and so on for extensions and resources.

Of course if we like this approach we should modify production to do the same, at the end of the day it seems to me that we only have these symlinks for historical reasons and we don't really want people to use them anymore.

@Krinkle @dancy LMK if you agree with this approach.

at the end of the day it seems to me that we only have these symlinks for historical reasons and we don't really want people to use them anymore.

+1. I would love to see these symlinks disappear eventually.

As it stands, my new configuration would mean that we're going to issue a permanent redirect for anything under /static/current/skins to /w/skins and so on for extensions and resources.

Of course if we like this approach we should modify production to do the same, at the end of the day it seems to me that we only have these symlinks for historical reasons and we don't really want people to use them anymore.

Not exactly historical. We currently offer and expect three ways to serve static files from MediaWiki in production:

  1. /w/(extensions|skins|resources)/*?1234567

These are considered immutable, publicly cacheable for a year, and consolidated at the edge in a hostname-agnostic way. That is, all wiki hostnames share the same cache object. These URLs are the ones we use most commonly. E.g. anything internally referenced by MediaWiki that is aimed at the general audience and meant to perform well, is automatically formatted this way by ResourceLoader, CSSMin or OutputPage.

The reason we're able to serve this in a hostname-agnostic fashion with immutable/long-term caching is that resources are always referenced by their current version (which the server knows because we map the URL to a file, sha1sum the file, and insert the current checksum as known to the PHP code into the URL).

The reason this doesn't cause UX problems (e.g. different pages retain references to different URLs in their ParserCache or CDN), is that we generally never reference such assets directly in the HTML. We always propagate them either through a page-independent stylesheet, or through the ResourceLoader startup manifest (Docs).

The implementation detail is that these URLs are written to /w/static.php (source), which ensures that the hash matches the file currently on disk. (This avoids non-recovering cache poisoning around deployments, details in the file docs).

  1. /w/* without version query

This exists for lower stakes references with weaker guruantees or expectations. This generally shouldn't be used anywhere prominent since we can't cache it for long, and also can't know which wiki version is expected to be used. Example usage includes:

  • Gadgets and user scripts that augment core functionality and re-purpose some of our assets. For example, enwiki: Vector.css references some of the Vector svg icons, but it can't know its current file version as it has no access to the disk to determine that.
  • Debug mode from ResourceLoader where internal JS and CSS files are served without minification at the current version.
  • A tail of random things in core and extensions that reference static files that can't or don't yet use ResourceLoader. Such as Special:Version refencing the COPYING file.

These are currently best-effort cached with a shorter expiry, the CDN treats them as hostname-specific, but it doesn't mean that much since it's also cached for several hours, so it might as well be a random MW branch instead of the current one for the given hostname.

  1. /static/*

This endpoint also offers immutable, long-term caching, with hostname-agnostic CDN handling. The main difference is that it does this without any kind of version parameter, and exists specifically for cases where we have to refer to a file in a way that we can't propagate through a two-step system (like ResourceLoader CSS and JS would). An example would be the project logos and favicons and other such custom assets served from the operations/mediawiki-config repository which we serve from URLs that we pass on to external entities, expose through APIs, and may bake into the HTML through the ParserCache and CDN.

This is never used directly by the MediaWiki software out-of-the-box for two reasons. 1) It wouldn't be mutliversion-aware and thus may often serve incompatible files that are too new or too old, and 2) MediaWiki wouldn't know this endpoint exists since it isn't part of the MediaWiki application directory. We only ever point here from wmf-config, either directly by path to a specific resource, or by path to a directory of assets given to a MW extension configuration variable.

In addition to serving custom assets like project logos through here, we also expose /static/current. The name "current" is a remnant of when we still had /static/php-1.XX-wmf.YY, which we used for point 1 above before we had the multiversion-aware static.php proxy. However the "current" directory already existed then and continues to exist to day for the purpose of serving a static file with strong caching for cases where you can't have a version parameter.

This is for cases where an extension or something in wmf-config can't version its assets or can't use ResourceLoader, but needs to serve assets with strong client-side caching and isn't concerned about propagating updates immediately.

Usage included/includes:

  • ULS fonts. These are not served by ResourceLoader and did't need to be updated immediately after change. Looking at it now, it seems ULS was altered to append custom version parameters as a way to propagate updates quicker, and has since mimicked ResourceLoader-style URLs by using /w/ and the version hashes that /w/static.php expects (T135806).
  • Footer "Powered by" icons. These should not have to be redownloaded every day by browsers. These files exist in core, but configure them in wmf-config to be served from /static/current for better caching performance (change 295184)

There are a small handful of other use cases that come and go. It's not a lot, but it's a powerful tool to have handy. Having said that, I do think we implement a similar feature to /static/current/ inside /w/static.php. For example, we could say that if the URL ends with ?current to treat it in a special way and give it immutable caching semantics.

However, they should not be redirected, and should not be served from `/w/ plainly as either of those would be a performance regression.