Page MenuHomePhabricator

Socket Errors on PHP7
Closed, ResolvedPublic

Description

When we temporarily switched PHP7 to 0%, we noticed that socket errors decreased. Socket errors kept increasing as we were pushing more and more traffic to PHP7.2, specifically the 'tcp/attemptfails' metric

After researching, we found out that the kernel increases this counter when

a) a tcp packet has both RST and SYN flags set (not the case here)
b) a socket is left in a SYN_SENT or SYN_RECV state

in our case

# netstat -s
    4224373 failed connection attempts

Event Timeline

jijiki created this task.May 29 2019, 7:19 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 29 2019, 7:19 AM
jijiki triaged this task as Normal priority.May 29 2019, 7:20 AM
jijiki added a subscriber: Joe.

Change 513033 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/mediawiki-config@master] Remove kafka1018 from ProductionServices.php

https://gerrit.wikimedia.org/r/513033

Joe assigned this task to jijiki.Jun 21 2019, 6:59 AM
Joe moved this task from PHP7 migration: backlog to Doing on the serviceops board.

Change 513033 merged by jenkins-bot:
[operations/mediawiki-config@master] Remove kafka1018 from ProductionServices

https://gerrit.wikimedia.org/r/513033

Mentioned in SAL (#wikimedia-operations) [2019-06-21T09:09:44Z] <jiji@deploy1001> Synchronized wmf-config/ProductionServices.php: Remove kafka1018 from ProductionServices - T224538 (duration: 00m 56s)

jijiki moved this task from Doing to Next up on the serviceops board.Jul 3 2019, 12:40 PM
jijiki moved this task from Next up to Doing on the serviceops board.Jul 11 2019, 7:42 AM

Mentioned in SAL (#wikimedia-operations) [2019-07-11T12:39:37Z] <jijiki> Disable puppet on mw1222, server will be depooled and pooled a few times for tests - T224538

Removing kafka1018 didn't fix the problem, still looking

jijiki moved this task from Doing to Next up on the serviceops board.Jul 17 2019, 7:59 AM
jijiki updated the task description. (Show Details)Tue, Jul 30, 3:52 PM
jijiki added a comment.EditedMon, Aug 5, 2:19 PM

For connection pooling purposes, when we want to access search.svc.eqiad.wmnet from php-fpm, we are doing so via nginx. This nginx is installed on each mw* server listening on *:80. Mediawiki is configured to do so via https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/ProductionServices.php#L147, and trying to resolve localhost. For every request, localhost first resolves to ::1 where nginx is not listening to, and then continues to 127.0.0.1, which is successful. The first failure increases the tcp attemptfails counter.

Change 529401 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] profile:templates:services_proxy: Enable ipv6

https://gerrit.wikimedia.org/r/529401

Mentioned in SAL (#wikimedia-operations) [2019-08-09T19:46:29Z] <mutante> mwdebug1001 - temp stopped puppet, editing nginx config to test making it listen on IPv6 for upstream proxies (529401) (T224538)

Mentioned in SAL (#wikimedia-operations) [2019-08-12T10:28:25Z] <jijiki> Disable puppet on all servers running a services_proxy - T224538

Change 529401 merged by Effie Mouzeli:
[operations/puppet@production] profile:templates:services_proxy: Enable ipv6 and listen only locally

https://gerrit.wikimedia.org/r/529401

Mentioned in SAL (#wikimedia-operations) [2019-08-12T10:47:06Z] <jijiki> Enabling puppet and rolling restarting nginx across the fleet - T224538

jijiki closed this task as Resolved.Mon, Aug 12, 1:40 PM

Fixed!