Page MenuHomePhabricator

Scap pool counter error while backporting
Closed, ResolvedPublicBUG REPORT

Description

During UTC late backport window (6/16/22) we encountered the following error:

20:26:58 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807 (ran as mwdeploy@mw1412.eqiad.wmnet) returned [2]: Restarting php7.2-fpm: free opcache 685 MB
2022-06-16 20:26:58,580 [ERROR] Error running command with poolcounter: timed out
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/poolcounter/client.py", line 383, in run
    backend.get_lock(lock_type, key, **kwargs)
  File "/usr/lib/python3/dist-packages/poolcounter/client.py", line 210, in get_lock
    resp = self._command(req)
  File "/usr/lib/python3/dist-packages/poolcounter/client.py", line 256, in _command
    self._stream = self._connect()
  File "/usr/lib/python3/dist-packages/poolcounter/client.py", line 270, in _connect
    stream.connect((self.ipaddr, self.port))
socket.timeout: timed out
2022-06-16 20:26:58,581 [INFO] Error trying to release the lock: Trying to release a lock without a connection
2022-06-16 20:26:58,582 [INFO] Error trying to release the lock: Trying to release a lock without a connection
2022-06-16 20:26:58,582 [INFO] Error trying to release the lock: Trying to release a lock without a connection
2022-06-16 20:26:58,582 [INFO] Error trying to release the lock: Trying to release a lock without a connection

20:27:22 1 hosts had failures restarting php-fpm

What happens?:

It seemed to sync fine but the error msg popped into stream

What should have happened instead?:

Sync without error msg

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc.:

[thcipriani@deploy1002 ~]$ scap version
4.9.4-1

Event Timeline

I did three deployment sessions in the past week, each time encountered this during at least one of the syncs.

I hope this can be prioritised as we do not want deployers to get desensitised or otherwise normalize warning output during a Scap command.

@dancy Can you confirm what the impact of this is today? As I understand it, based on the design conversations, the intention is that scap syncs "must" to the extent possible (e.g. short of being unreachable/offline/depooled) apply the new code to the relevant servers. When the graceful poolcounter times out, I believe there is mean to be a fallback and still restart php-fpm so that the new code is applied. Is that the case? From the output it's not obvious that the essential action was still taken despite there being a "time out" since it appears that the timeout bubbled up fairly high with there being a Scap-internal stack trace shown to the deployer.

00:42:20 Running '/usr/local/sbin/restart-php-fpm-all php7.2-fpm 9223372036854775807' on 307 host(s)
00:42:27 /usr/bin/sudo -u root -- /usr/local/sbin/restart-php-fpm-all php7.2-fpm 9223372036854775807 (ran as mwdeploy@mw1370.eqiad.wmnet) returned [2]: 2022-07-14 00:42:27,056 [ERROR] Error running command with poolcounter: timed out
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/poolcounter/client.py", line 382, in run
    backend.get_lock(lock_type, key, **kwargs)
  File "/usr/lib/python3/dist-packages/poolcounter/client.py", line 210, in get_lock
    resp = self._command(req)
  File "/usr/lib/python3/dist-packages/poolcounter/client.py", line 255, in _command
    self._stream = self._connect()
  File "/usr/lib/python3/dist-packages/poolcounter/client.py", line 269, in _connect
    stream.connect((self.ipaddr, self.port))
socket.timeout: timed out
2022-07-14 00:42:27,057 [INFO] Error trying to release the lock: Trying to release a lock without a connection
2022-07-14 00:42:27,057 [INFO] Error trying to release the lock: Trying to release a lock without a connection
2022-07-14 00:42:27,057 [INFO] Error trying to release the lock: Trying to release a lock without a connection
2022-07-14 00:42:27,057 [INFO] Error trying to release the lock: Trying to release a lock without a connection

00:42:35 1 hosts had failures restarting php-fpm
00:44:15 Finished php-fpm-restarts (duration: 01m 55s)

In particular the log line "1 hosts had failures restarting php-fpm" sounds a lot like it might have in fact not restarted php-fpm.

If that is the case, then we need to announce to all deployers that if they see this, they have to basically run the deploy again. Is that right?

Reedy renamed this task from Scap pool counter error while backporting to Scap pool counter error while backporting.Thu, Jul 14, 1:01 AM
Reedy updated the task description. (Show Details)

I see that the python client uses a connection timeout of 1s without retries. Linux has a hard-coded SYN retry time of 1s, so if a SYN packet is lost, the 1s timeout will likely be exceeded. Using a default connect timeout of 2s would be an easy way to avoid this issue. MediaWiki uses a connect timeout of 10ms with an application-level retry after a sleep of 1ms.

Change 814851 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/software/python-poolcounter@master] Avoid additional errors if connection to poolcounter server fails

https://gerrit.wikimedia.org/r/814851

I did three deployment sessions in the past week, each time encountered this during at least one of the syncs.

I hope this can be prioritised as we do not want deployers to get desensitised or otherwise normalize warning output during a Scap comment.

Agreed.

@dancy Can you confirm what the impact of this is today?
In particular the log line "1 hosts had failures restarting php-fpm" sounds a lot like it might have in fact not restarted php-fpm.

I looked through operations/puppet/modules/conftool/files/safe-service-restart.py and I don't see any code for falling back to an ungraceful php-fpm restart if using the poolcounter approach fails.

If that is the case, then we need to announce to all deployers that if they see this, they have to basically run the deploy again. Is that right?

That's right.

I looked through operations/puppet/modules/conftool/files/safe-service-restart.py and I don't see any code for falling back to an ungraceful php-fpm restart if using the poolcounter approach fails.

The reason for this is that if we can't connect to poolcounter, we can't guarantee that concurrency will be respected, and worse things could happen otherwise.

Change 815747 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/software/python-poolcounter@master] Raise the default connection timeout to 2 seconds

https://gerrit.wikimedia.org/r/815747

Change 814851 merged by Giuseppe Lavagetto:

[operations/software/python-poolcounter@master] Avoid additional errors if connection to poolcounter server fails

https://gerrit.wikimedia.org/r/814851

Change 815747 merged by Giuseppe Lavagetto:

[operations/software/python-poolcounter@master] Raise the default connection timeout to 2 seconds

https://gerrit.wikimedia.org/r/815747

I looked through operations/puppet/modules/conftool/files/safe-service-restart.py and I don't see any code for falling back to an ungraceful php-fpm restart if using the poolcounter approach fails.

The reason for this is that if we can't connect to poolcounter, we can't guarantee that concurrency will be respected, and worse things could happen otherwise.

Joe changed the task status from Open to In Progress.Tue, Jul 26, 6:53 AM
Joe claimed this task.
Joe triaged this task as Medium priority.

Mentioned in SAL (#wikimedia-operations) [2022-07-26T06:58:11Z] <_joe_> upgrade all of codfw to python3-poolcounter 0.0.3 T310835

Mentioned in SAL (#wikimedia-operations) [2022-07-26T07:30:22Z] <_joe_> running a restart-all for php-fpm on appservers in codfw to test python-poolcounter 0.0.3 T310835

Mentioned in SAL (#wikimedia-operations) [2022-07-26T07:48:27Z] <_joe_> deploy python3-poolcounter everywhere T310835

I deployed the new version of python3-poolcounter everywhere, tentatively resolving for now.