Page MenuHomePhabricator

Parsoid-php doesn't get updated after a code deploy
Closed, ResolvedPublic

Description

Parsoid-php is in a peculiar situation, where code deployed via scap3 is included into MediaWiki.

We have noticed that after a parsoid code deploy, the application servers were still serving an old version of the code.

This happens not because of an opcache malfunction (although the cure, probably, is the same in this case), but an underlying problem with how we include files and how opcache stores them.

When we store files in opcache, symbolic links are resolved. So when we include /srv/deployment/parsoid/src/$path, that becomes a cache key like /srv/deployment/parsoid/deploy-cache/revs/451db1e60149b05781ec1e63c949c932c08d4a8b/.

Sadly, php-fpm has a realpath cache that interferes with opcache invalidation, creating a race condition. See for instance https://engineering.facile.it/blog/eng/realpath-cache-is-it-all-php-opcache-s-fault/

I could prove it by simply watching the contents of opcache during a deployment:

while true; do sleep 1; date; TO_FIND=$(readlink /srv/deployment/parsoid/deploy); echo "Parsoid code at $TO_FIND"; php7adm opcache-meta; jq . /tmp/opcache_dump_meta | fgrep $TO_FIND -  | wc -l; done

As I sadly expected, the realpath cache issue made the revalidation of opcache visit the old file that was never refreshed, with the exception of newly spawned children of php-fpm (as the realpath cache is process-local and not shared like opcache).

All concluded, I think we have the following options to solve the issue:

  1. Make scap3 restart php-fpm
  2. Disable the realpath cache on php-fpm
  3. Include parsoid-php in the mediawiki repository

I would opt for solution 2 temporarily, and move to 3 on the long term.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2019-10-23T13:57:16Z] <_joe_> manually changing the symlinked deployed version of parsoid on wtp1025 T236275

Thanks @Joe. What are the perf. implications of solution (2)? Is there a reason why you don't prefer (1) as the temporary solution?

Change 545600 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] parsoid: allow restarting safely php-fpm during deployments.

https://gerrit.wikimedia.org/r/545600

Change 545600 merged by Giuseppe Lavagetto:
[operations/puppet@production] parsoid: allow restarting safely php-fpm during deployments.

https://gerrit.wikimedia.org/r/545600

Change 545702 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] parsoid: actually support safe restarts at deploy time.

https://gerrit.wikimedia.org/r/545702

Change 545702 merged by Giuseppe Lavagetto:
[operations/puppet@production] parsoid: actually support safe restarts at deploy time.

https://gerrit.wikimedia.org/r/545702

Change 545703 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[mediawiki/services/parsoid/deploy@master] Restart php-fpm upon deployment

https://gerrit.wikimedia.org/r/545703

Thanks @Joe. What are the perf. implications of solution (2)? Is there a reason why you don't prefer (1) as the temporary solution?

While perf implications are not large, that solution proved unreliable in fixing our problem properly. So I have worked to implement solution 1 - which should now be ready.

Change 545703 merged by jenkins-bot:
[mediawiki/services/parsoid/deploy@master] Restart php-fpm upon deployment

https://gerrit.wikimedia.org/r/545703

When trying to deploy to the beta cluster today, we hit,

Executing check 'restart_php'
Unhandled error:
deploy-local failed: <OSError> {u'child_traceback': u'Traceback (most recent call last):\n  File "/usr/lib/python2.7/subprocess.py", line 1292, in _execute_child\n    os.execvpe(executable, args, env)\n  File "/usr/lib/python2.7/os.py", line 355, in execvpe\n    _execvpe(file, args, env)\n  File "/usr/lib/python2.7/os.py", line 370, in _execvpe\n    func(file, *argrest)\nOSError: [Errno 2] No such file or directory\n'}

18:54:30 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'parsoid/deploy', '-g', 'default', 'rollback', '--refresh-config'] on deployment-mediawiki-parsoid10.deployment-prep.eqiad.wmflabs returned [70]: Rolling back from revision d932d6ad1b2820432a8a674eac9706eaccada492 to 4c64c9c1f0a60f1f6ebc9767a8492d9631c8c8ee
Removing old revision /srv/deployment/parsoid/deploy-cache/revs/77630c59f081ecef658fea29861a6c2835b7add9
Check depool is empty and will not be run
Check repool is empty and will not be run

which presumably means that command: /usr/local/sbin/restart-php7.2-fpm is not on those servers.

This comment was removed by Arlolra.

On beta cluster, we worked around this by continuing the deployment and restarting the php7.2-fpm service manually.

When we proceeded to production and deployed a new version of parsoid, it failed for other reasons. See a snippet of the transcript below:

ssastry@deploy1001:/srv/deployment/parsoid/deploy$ scap deploy 'Update parsoid to 089bf28d'
19:38:25 Started deploy [parsoid/deploy@d932d6a]
19:38:25 Deploying Rev: HEAD = d932d6ad1b2820432a8a674eac9706eaccada492
19:38:26 Started deploy [parsoid/deploy@d932d6a]: Update parsoid to 089bf28d
19:38:26
== CANARY ==
:* wtp2001.codfw.wmnet
:* wtp2002.codfw.wmnet
:* wtp1025.eqiad.wmnet
:* wtp1026.eqiad.wmnet
parsoid/deploy: fetch stage(s): 100% (ok: 4; fail: 0; left: 0)
parsoid/deploy: config_deploy stage(s): 100% (ok: 4; fail: 0; left: 0)
19:39:14 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'parsoid/deploy', '-g', 'canary', 'promote', '--refresh-config'] on wtp2001.codfw.wmnet returned [1]: Linking config files at: /srv/deployment/parsoid/deploy-cache/revs/d932d6ad1b2820432a8a674eac9706eaccada492/.git/config-files
Executing check 'depool'
Check 'depool' completed, output: 2019-10-28 19:38:56,325 [INFO] Depooling currently pooled services

Executing check 'symlink'
Check 'symlink' completed, output:
Restarting service 'parsoid'
Port 8000 not up. Waiting 3.00s
Port 8000 up in 3.00s
Executing check 'repool'
Check 'repool' completed, output: 2019-10-28 19:39:02,622 [INFO] Pooling currently depooled services
2019-10-28 19:39:03,123 [WARNING] LB lvs2003:9090 reports pool parsoid_8000 as enabled/down/not pooled, should be enabled/up/pooled
2019-10-28 19:39:08,135 [WARNING] LB lvs2003:9090 reports pool parsoid_8000 as enabled/down/not pooled, should be enabled/up/pooled

Executing check 'restart_php'
Check 'restart_php' failed: 2019-10-28 19:39:13,755 [INFO] Depooling currently pooled services
2019-10-28 19:39:14,074 [INFO] Restarting the service
Failed to restart php7.2-fpm.service: The name org.freedesktop.PolicyKit1 was not provided by any .service files
See system logs and 'systemctl status php7.2-fpm.service' for details.
2019-10-28 19:39:14,081 [ERROR] Executing command systemctl restart php7.2-fpm.service failed: Command '['systemctl', 'restart', 'php7.2-fpm.service']' returned non-zero exit status 1
2019-10-28 19:39:14,082 [WARNING] Service restart failed. NOT repooling

....


parsoid/deploy: rollback stage(s): 100% (ok: 0; fail: 4; left: 0)
19:41:08 4 targets had deploy errors
19:41:08 4 targets failed
19:41:08 Finished deploy [parsoid/deploy@d932d6a]: Update parsoid to 089bf28d (duration: 02m 42s)
19:41:08 Finished deploy [parsoid/deploy@d932d6a] (duration: 02m 42s)

The name org.freedesktop.PolicyKit1 was not provided by any .service files

The error really does not make it obvious but this should be a case of "just missing sudo".

see stackexchange

quote from the comments there "Possibly the most counter-intuitive error ever." :p

Change 546758 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid: fix path to systemctl for php-restart sudo command line

https://gerrit.wikimedia.org/r/546758

Change 546758 merged by Dzahn:
[operations/puppet@production] parsoid: fix path to systemctl for php-restart sudo command line

https://gerrit.wikimedia.org/r/546758

@ssastry The error in production should hopefully be gone now:

root@wtp1026:/etc/sudoers.d# sudo -u deploy-service sudo -u root /usr/local/sbin/restart-php7.2-fpm
2019-10-28 23:43:17,500 [INFO] The server is depooled from all services. Restarting the service directly
root@wtp1026:/etc/sudoers.d#

@ssastry The error in production should hopefully be gone now:

root@wtp1026:/etc/sudoers.d# sudo -u deploy-service sudo -u root /usr/local/sbin/restart-php7.2-fpm
2019-10-28 23:43:17,500 [INFO] The server is depooled from all services. Restarting the service directly
root@wtp1026:/etc/sudoers.d#

Thanks! Will retry deploy tomorrow again.

Prod problem is gone now - beta problem is still open.

Change 546985 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[mediawiki/services/parsoid/deploy@master] add seperate restart_php command for beta

https://gerrit.wikimedia.org/r/546985

Change 546985 merged by jenkins-bot:
[mediawiki/services/parsoid/deploy@master] add seperate restart_php command for beta

https://gerrit.wikimedia.org/r/546985

Alas, we're not quite there for beta,

19:53:36 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'parsoid/deploy', '-g', 'default', 'promote', '--refresh-config'] on deployment-parsoid09.deployment-prep.eqiad.wmflabs returned [1]: Linking config files at: /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/.git/config-files
Check depool is empty and will not be run
Check repool is empty and will not be run
Executing check 'symlink'
Check 'symlink' completed, output: 
Restarting service 'parsoid'
Port 8001 not up. Waiting 3.00s
Port 8001 up in 3.00s
Check depool is empty and will not be run
Check repool is empty and will not be run
Executing check 'restart_php'
Check 'restart_php' failed: 
We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

sudo: no tty present and no askpass program specified

Change 547349 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid-php: on beta, add sudo privs for php-fpm restarts

https://gerrit.wikimedia.org/r/547349

Change 547349 merged by Dzahn:
[operations/puppet@production] parsoid-php: on beta, add sudo privs for php-fpm restarts

https://gerrit.wikimedia.org/r/547349

Here's the thing with the beta setup:

In production the use_php Hiera key is what turns a classic (non-MW) parsoid server into a MW-parsoid server. Setting that to true includes all these classes:

require ::profile::mediawiki::scap_proxy
require ::profile::mediawiki::common
require ::profile::mediawiki::nutcracker
require ::profile::mediawiki::mcrouter_wancache
require ::profile::prometheus::mcrouter_exporter
require ::profile::services_proxy
require ::profile::mediawiki::php
require ::profile::mediawiki::php::monitoring
include ::profile::mediawiki::php::restarts
require ::profile::mediawiki::webserver

and in addition it adds the sudo::user { 'scap3_restart_php': needed for the restarts.

In deployment-prep though use_php is not set in Hiera for the instance which means even though it uses role::parsoid -> profile::parsoid it does not get any of the MediaWiki profiles and was missing the sudo::user for the restarts.

Now you may wonder how does the instance have MediaWiki installed anyways. That is because, unlike prod where each node can only have exactly one role, it has several roles assigned to it. Instead of only role::parsoid with use_php it uses these different ones directly:

role::parsoid
role::mediawiki::common
role::mediawiki::appserver
role::beta::mediawiki
profile::mediawiki::php

The ideal fix would be to have an instance that is more like prod, only includes a single role and has $use_php in Hiera. Just doing that without removing other roles would most likely create duplication definition errors in Puppet though.

As a quicker fix and since the whole restart setup is temporary anyways i just used if $use_php or $::realm == 'labs' {to get the missing sudo::user.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/547349/2/modules/profile/manifests/parsoid.pp#33

This added the following on deployment-mediawiki-parsoid10:

root@deployment-mediawiki-parsoid10:/etc/sudoers.d# cat scap3_restart_php 
# This file is managed by Puppet!

deploy-service ALL = (root) NOPASSWD: /usr/local/sbin/restart-php7.2-fpm
deploy-service ALL = (root) NOPASSWD: /bin/systemctl restart php7.2-fpm

And now i can sudo /bin/systemctl restart php7.2-fpm as deploy-service@deployment-mediawiki-parsoid10 and that matches the separate restart command for beta.

deploy-service@deployment-mediawiki-parsoid10:/etc/sudoers.d$ sudo /bin/systemctl restart php7.2-fpm
deploy-service@deployment-mediawiki-parsoid10:/etc/sudoers.d$

Should be resolved now, please let me know next time you deploy.

Some notes from today's deploy.

Here's the log, https://phabricator.wikimedia.org/P9543

restart_php runs after repool, despite the attempt to fix that in https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/deploy/+/547311

Note sure why there're two depool / repool cycles, rather than just depool, perform actions, repool

restart_php failed, yet scap reported all good (ok: 2; fail: 0; left: 0)

Oh, and the above log is from the beta cluster deploy.

Production didn't seem to have the restart_php failure issue, however, it did have the ordering issue.

The result being,

"PHP Fatal Error: Class 'Parsoid\Config\SiteConfig' not found"

errors showing up on the dashboard,
https://logstash.wikimedia.org/app/kibana#/dashboard/AW4Y6bumP44edBvO7lRc

restart_php failed, yet scap reported all good (ok: 2; fail: 0; left: 0)

The reason why it failed to restart on beta changed. Previously it was missing sudo privileges which was fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/547349

Now we have another issue, missing unit file:

Check 'restart_php' failed: Failed to restart php7.2-fpm.service: Unit php7.2-fpm.service failed to load: No such file or directory.

It is expected you see that failure as one of the two servers you're deploying to in beta, deployment-parsoid09.deployment-prep.eqiad.wmflabs, doesn't have MediaWiki or php7 installed. The other server, deployment-mediawiki-parsoid10.deployment-prep.eqiad.wmflabs, should restart php just fine.

In fact, php 7.2 has been restarted at Wed 2019-11-06 19:55:17 UTC which matches @Arlolra's paste.

Resolving this task as it's done.

All concluded, I think we have the following options to solve the issue:

  1. Make scap3 restart php-fpm
  2. Disable the realpath cache on php-fpm
  3. Include parsoid-php in the mediawiki repository

I would opt for solution 2 temporarily, and move to 3 on the long term.

Since we're now including parsoid in the main mediawiki vendor repository, does this task need to be re-opened and revisited?

The underlying issues here might resurface as we reconfigure beta due to T246833: Parsoid/RESTbase seems to be unavailable in Beta/T246854: Replace deployment-mediawiki-parsoid10 with a "purer" deployment-parsoid11 box.

(This comment is mostly just to mention these bugs for cross-linking purposes.)