Page MenuHomePhabricator

Apache on doc1001 does not see updated PHP files for hours/days after deployment
Closed, ResolvedPublic

Description

I merged https://gerrit.wikimedia.org/r/662785 (T273247) and deployed it with Scap, but https://doc.wikimedia.org/ keeps showing the previous content.

== DEFAULT ==
:* doc1001.eqiad.wmnet
:* contint2001.wikimedia.org
:* contint1001.wikimedia.org

I confirmed on doc1001, that /srv/deployment/integration/docroot contains the commit. and org/wikimedia/doc/opensource.yaml contains the […] change.

But https://doc.wikimedia.org/ and https://doc.wikimedia.org/?random=1234 keep showing outdated content. I've also checked locally on doc1001:

$ krinkle@doc1001:~$ curl 'http://doc1001.eqiad.wmnet' -H 'Host: doc.wikimedia.org' 

And strangely that also shows outdated content. This suggests one of two things to me:

  1. […]
  2. Or, there is some kind of additional caching locally to this server that I'm not familiar with and that Scap does not know to reload or purging.

[…] But where was it cached? Does the local Apache have an HTTP cache proxy that serves stale responses if it gets 500 from PHP?

This is still consistently an issue. E.g. after deploying https://gerrit.wikimedia.org/r/666245, the server keeps producing fresh HTTP responses from Apache with old PHP code. Same as last time, it does not appear to be correcting itself, it just stays there presumably until a root does a hard Apache or php-fpm restart.

Maybe it has something to do with the symlinks used by /srv/deployment? Or maybe opcache is configured on this host to cache forever and never look at file paths?

Tagging SRE and RelEng. I don't know who the first respondent is for this service. It seems that since Dec 2020 (ref T149924) it is basically become impossible to deploy changes to integration.wm.o and doc.wm.o. The only way changes become applies is if a root SRE restarts Apache and/or php-fpm locally on the doc1001 backend.

Event Timeline

I have quickly talked about it this morning pointing out we have an issue with eg file_get_contents( __DIR__ . '/opensource.yaml' ); and the how changes to PHP files are not taken in account after a symlink swap.

Giuseppe pointed there is a file stat cache which might explain the issue for file_get_contents. But most importantly there is the opcache, so if php-fpm serves the old file, surely DIR is resolved to the old dir and the old opensource.yaml file is processed.

To save us from figuring out opcache tuning or debugging PHP internal related to cache invalidation, the easiest by far is to have scap to restart php-fpm after deployment.

Change 666308 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/docroot@master] scap: restart php-fpm on promote

https://gerrit.wikimedia.org/r/666308

Change 666309 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] doc: script to restart php-fpm

https://gerrit.wikimedia.org/r/666309

Change 666309 merged by Jbond:
[operations/puppet@production] doc: script to restart php-fpm

https://gerrit.wikimedia.org/r/666309

Mentioned in SAL (#wikimedia-operations) [2021-03-16T15:03:24Z] <hashar@deploy1002> Started deploy [integration/docroot@44d5685]: Verify check can restart php-fpm # T275468

Mentioned in SAL (#wikimedia-operations) [2021-03-16T15:03:34Z] <hashar@deploy1002> Finished deploy [integration/docroot@44d5685]: Verify check can restart php-fpm # T275468 (duration: 00m 07s)

Change 666308 merged by jenkins-bot:
[integration/docroot@master] scap: restart php-fpm on promote

https://gerrit.wikimedia.org/r/666308

Change 672741 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/docroot@master] scap: only restart php-fpm on doc* hosts

https://gerrit.wikimedia.org/r/672741

Change 672741 merged by jenkins-bot:
[integration/docroot@master] scap: only restart php-fpm on doc* hosts

https://gerrit.wikimedia.org/r/672741

We are now restarting php-fpm on the doc hosts. That clears out the opcache and should address the stall content after a deployment.