Page MenuHomePhabricator

All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp)
Open, MediumPublicPRODUCTION ERROR

Description

This error show up in Logstash whenever I'm using PHP 7 via X-Wikimedia-Debug

PHP Fatal error:  Uncaught LogicException: The UdpSocket to 127.0.0.1:10514 has been closed and can not be written to anymore in /srv/mediawiki/php-1.33.0-wmf.14/vendor/monolog/monolog/src/Monolog/Handler/SyslogUdp/UdpSocket.php:45
Stack trace:
#0 /srv/mediawiki/php-1.33.0-wmf.14/vendor/monolog/monolog/src/Monolog/Handler/SyslogUdp/UdpSocket.php(31): Monolog\Handler\SyslogUdp\UdpSocket->send('<15>Jan 25 21:3...')
#1 /srv/mediawiki/php-1.33.0-wmf.14/vendor/monolog/monolog/src/Monolog/Handler/SyslogUdpHandler.php(47): Monolog\Handler\SyslogUdp\UdpSocket->write('@cee: {"@timest...', '<15>Jan 25 21:3...')
#2 /srv/mediawiki/php-1.33.0-wmf.14/vendor/monolog/monolog/src/Monolog/Handler/AbstractProcessingHandler.php(37): Monolog\Handler\SyslogUdpHandler->write(Array)
#3 /srv/mediawiki/php-1.33.0-wmf.14/vendor/monolog/monolog/src/Monolog/Logger.php(337): Monolog\Handler\AbstractProcessingHandler->handle(Array)
#4 /srv/mediawiki/php-1.33.0-wmf.14/vendor/monolog/monolog/src/Monolog/Logger.php(532): Monolog\Logger->addRecord(100, 'Wikimed in /srv/mediawiki/php-1.33.0-wmf.14/vendor/monolog/monolog/src/Monolog/Handler/SyslogUdp/UdpSocket.php on line 45
typesyslog
hostmwdebug1002
programphp7.2-fpm

Note that these are logged directly from php, and not via MediaWiki's fatal error handler. As such, they don't have any context about which wiki, or which url it happened with.

Also, I haven't seen my web requests aborted with this error. This suggests that maybe this happening post-send.

Event Timeline

Krinkle created this task.Jan 25 2019, 9:38 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 25 2019, 9:38 PM
Joe triaged this task as High priority.Feb 7 2019, 12:47 PM

Is it sporadic or reproducible at will? My first thought was rsyslog not running (or restarting) and thus 10514 udp not open

fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.Apr 23 2019, 12:29 PM
Krinkle lowered the priority of this task from High to Low.May 7 2019, 4:21 PM
Krinkle added a subscriber: Joe.

Didn't realise it was limited to mwdebug1002. Never checked other servers but @Joe is right that it only happens there. Let's ignore for now.

Krinkle renamed this task from PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) to PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002).May 7 2019, 4:21 PM
Krinkle moved this task from Untriaged to Debug-only on the Wikimedia-production-error board.
Krinkle removed a subscriber: Krinkle.May 7 2019, 9:56 PM
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM
Krinkle added a comment.EditedSep 11 2019, 6:38 PM

Still seen regularly. It's somewhat concerning me because we rely on mwdebug1002 to find non-fatal warnings and errors when testing patches and expect that if Logstash/mwdebug1002 is clear, that means the patch is good to go.

If we instead see "The UdpSocket to 127.0.0.1:10514 has been closed and can not be written to" that is first of all confusing to anyone who doesn't know it's specific to mwdebug1002. But more importantly - does this not mean that some messages were dropped and not reported to Logstash?

That's more concerning because it means we're missing messages that could very well be errors.

@Joe If this is specific to mwdebug1002, perhaps re-imaging the server could help to rule out anything that might've gotten out of whack with this one?


EDIT: might be related to T151428: Error when provisioning mediawiki-vagrant: "The UdpSocket to 127.0.0.1:10514 has been closed and can not be written to anymore"

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.EditedOct 12 2019, 11:19 PM

Tagging RelEng for awareness as this means when looking at Logstash/mwdebug1002 during deployments, there may be critical warnings not visible due to them not making it from MW to syslog/Kafka.

Krinkle raised the priority of this task from Low to Medium.Oct 12 2019, 11:19 PM
jijiki added a subscriber: jijiki.Oct 15 2019, 9:32 AM

Server will be reimaged, I will ping here when it is done :)

Mentioned in SAL (#wikimedia-operations) [2019-10-23T12:37:26Z] <effie> Depool mwdebug1002 - T214734

Mentioned in SAL (#wikimedia-operations) [2019-10-23T13:34:01Z] <effie> disable puppet on mwdebug1002 - T214734

ema added a subscriber: ema.EditedNov 8 2019, 5:01 PM

Note that debug servers aren't pooled in etcd like regular production ones, so mwdebug1002 is still serving debug traffic:

$ curl -v -H "X-Wikimedia-Debug: backend=mwdebug1002.eqiad.wmnet" https://en.wikipedia.org/wiki/Main_Page 2>&1 | grep server:
< server: mwdebug1002.eqiad.wmnet

We could temporarily comment it out from debug_proxy's configuration (modules/role/manifests/debug_proxy.pp), which would result in requests with mwdebug1002 in X-W-D getting a 400 error.

Mentioned in SAL (#wikimedia-operations) [2019-11-12T12:11:27Z] <effie> Reimage mwdebug1002 - T214734

jijiki added a subscriber: Krinkle.Nov 12 2019, 2:49 PM

@Krinkle Please let me know if the issue on mwdebug1002 persists, as the server has been reimaged :)

Change 551515 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] trafficserver: re-enable mwdebug1002

https://gerrit.wikimedia.org/r/551515

Change 551515 merged by Ema:
[operations/puppet@production] trafficserver: re-enable mwdebug1002

https://gerrit.wikimedia.org/r/551515

Looks like it is still happening:

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2019.11.20/syslog?id=AW6KkyhearkxubcmwWlL&_g=h@44136fa

host: mwdebug1002
type: syslog
program:	php7.2-fpm
message:

PHP Fatal error:  Uncaught LogicException: The UdpSocket to 127.0.0.1:10514 has been closed and can not be written to anymore in /srv/mediawiki/php-1.35.0-wmf.5/vendor/monolog/monolog/src/Monolog/Handler/SyslogUdp/UdpSocket.php:45

Steps to reproduce the issue:

  1. Enable WikimediaDebug to mwdebug1002 with Logging enabled.
  2. Disable WikimediaDebug (because T237687#5679746).
  3. Go to https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002

Change 556302 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/puppet@production] Add MOTD to mwdebug1002 warning about T214734

https://gerrit.wikimedia.org/r/556302

Change 556489 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] mwdebug1002: add motd warning

https://gerrit.wikimedia.org/r/556489

Change 556489 merged by CDanis:
[operations/puppet@production] mwdebug1002: add motd warning

https://gerrit.wikimedia.org/r/556489

Krinkle added a comment.EditedDec 12 2019, 12:17 AM

@jijiki Can we depool mwdebug1002 again so that it can't be used via the ATS & debug_proxy endpoint?

We reversed this to verify that things recovered, but they haven't. That might be more efficient than MOTD, in particular because using mwdebug does not always involve a shell (e.g. Logstash/XHGui options).

Change 556302 abandoned by Gergő Tisza:
Add MOTD to mwdebug1002 warning about T214734

Reason:
If9ca8c (correctly) does this already

https://gerrit.wikimedia.org/r/556302

jijiki added a comment.EditedDec 12 2019, 4:39 PM

First of all, we found that this fatal error is present on mwdebug1001, not only mwdebug1002 like we have been assuming in this task.

I have been debugging this today, and after reading UdpSocket.php, my theory is that PHP closes the socket where it has been streaming logs to 127.0.0.1:10514, but then tries to use it again, only to find it has been closed, and thus throwing a fatal error.

new socket:

1576165211.867070 socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 6<socket:[36083338]>

streaming logs:

1576165211.878189 sendto(6<socket:[36083338]>, "<15>Dec 12 15:40:11 mwdebug1002 mediawiki: @cee: {\"@timestamp\":\"2019-12- 
12T15:40:11.877806+00:00\",\"@version\":1,\"host\":\"mwdebug1002\",\"message\":\"MainObjectStash using store 
RedisBagOStuff\",\"type\":\"mediawiki\",\"channel\":\"redis\",\"level\":\"DEBUG\",\"shard\":\"s1\",\"normalized_message\":\"MainObjectStash using store {class}\",\"phpversion\":\"7.2.24- 
1+0~20191026.31+debian9~1.gbpbbacde+wmf1\",\"url\":\"/wiki/Main_Page\",\"ip\":\"10.64.0.46\",\"http_method\":\"GET\",\"server\":\"en.wikipedia.org\",\"referrer\":null,\"unique_id\":\"XfJfWwpAAC4AAF4 - 
WLUAAAAG\",\"wiki\":\"enwiki\",\"mwversion\":\"1.35.0-wmf.8\",\"reqId\":\"XfJfWwpAAC4AAF4-WLUAAAAG\",\"class\":\"RedisBagOStuff\"}\n", 624, 0, {sa_family=AF_INET, sin_port=htons(10514), sin_addr=inet_addr("127.0.0.1")}, 16) = 624

<snip>

1576165211.884218 sendto(6<socket:[36083338]>, "<15>Dec 12 15:40:11 mwdebug1002 mediawiki: @cee: {\"@timestamp\":\"2019-12-12T15:40:11.884049+00:00\",\"@version\":1,\"host\":\"mwdebug1002\",\"message\":\"MainWANObjectCache using store MemcachedPeclBagOStuff\",\"type\":\"mediawiki\",\"channel\":\"objectcache\",\"level\":\"DEBUG\",\"shard\":\"s1\",\"normalized_message\":\"MainWANObjectCache using store {class}\",\"phpversion\":\"7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1\",\"url\":\"/wiki/Main_Page\",\"ip\":\"10.64.0.46\",\"http_method\":\"GET\",\"server\":\"en.wikipedia.org\",\"referrer\":null,\"unique_id\":\"XfJfWwpAAC4AAF4-WLUAAAAG\",\"wiki\":\"enwiki\",\"mwversion\":\"1.35.0-wmf.8\",\"reqId\":\"XfJfWwpAAC4AAF4-WLUAAAAG\",\"class\":\"MemcachedPeclBagOStuff\"}\n", 652, 0, {sa_family=AF_INET, sin_port=htons(10514), sin_addr=inet_addr("127.0.0.1")}, 16) = 652

last log streamed using this socket :

1576165212.057219 sendto(6<socket:[36083338]>, "<15>Dec 12 15:40:12 mwdebug1002 mediawiki: @cee: {\"@timestamp\":\"2019-12-12T15:40:12.057093+00:00\",\"@version\":1,\"host\":\"mwdebug1002\",\"message\":\"Saving all sessions on shutdown\",\"type\":\"mediawiki\",\"channel\":\"session\",\"level\":\"DEBUG\",\"shard\":\"s1\",\"normalized_message\":\"Saving all sessions on shutdown\",\"phpversion\":\"7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1\",\"url\":\"/wiki/Main_Page\",\"ip\":\"10.64.0.46\",\"http_method\":\"GET\",\"server\":\"en.wikipedia.org\",\"referrer\":null,\"unique_id\":\"XfJfWwpAAC4AAF4-WLUAAAAG\",\"wiki\":\"enwiki\",\"mwversion\":\"1.35.0-wmf.8\",\"reqId\":\"XfJfWwpAAC4AAF4-WLUAAAAG\"}\n", 586, 0, {sa_family=AF_INET, sin_port=htons(10514), sin_addr=inet_addr("127.0.0.1")}, 16) = 586

also streamed to mwlog1001:

 1576165212.058534 sendto(7<socket:[36083339]>, "XWikimediaDebug 2019-12-12 15:40:12 [XfJfWwpAAC4AAF4-WLUAAAAG] mwdebug1002 enwiki 1.35.0-wmf.8 session DEBUG: Saving all sessions on shutdown  \n", 144, 0, {
sa_family=AF_INET, sin_port=htons(8420), sin_addr=inet_addr("10.64.32.175")}, 16) = 144

closing both sockets:

1576165212.058882 close(7<socket:[36083339]>) = 0
1576165212.058941 close(6<socket:[36083338]>) = 0

and right after that, it raises a fatal error:

1576165212.059214 open("/etc/php/php7-fatal-error.php", O_RDONLY) = 4</etc/php/php7-fatal-error.php> 
1576165212.059278 fstat(4</etc/php/php7-fatal-error.php>, {st_mode=S_IFREG|0444, st_size=4775, ...}) = 0

which it logs

1576165212.060176 sendto(4<socket:[36083385]>, "MediaWiki.errors.fatal:1|c", 26, 0, {sa_family=AF_INET, sin_port=htons(8125), sin_addr=inet_addr("10.64.16.149")}, 16) = 26
1576165212.060351 sendto(5<socket:[35123055]>, "<27>Dec 12 15:40:12 php7.2-fpm: @cee: {\"exception\":{\"message\":\"Uncaught LogicException: The UdpSocket to 127.0.0.1:10514 has been closed and can not be written to anymore in /srv/mediawiki/php-1.35.0-wmf.8/vendor/monolog/monolog/src/Monolog"..., 1542, MSG_NOSIGNAL, NULL, 0) = 1542

The log that php tries to stream (on mwdebug1002) but can't is:

<15>Dec 12 15:40:12 mwdebug1002 mediawiki: @cee: {"@timestamp":"2019-12-12T15:40:12.059028+00:00","@version":1,"host":"mwdebug1002","message":"Wikimedia\\Rdbms\\LBFactory::destroy: closing connection to database '10.64.32.76'.","type":"mediawiki","channel":"DBConnection","level":"DEBUG","shard":"s1","normalized_message":"Wikimedia\\Rdbms\\LBFactory::destroy: closing connection to database '10.64.32.76'.","phpversion":"7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1","url":"/wiki/Main_Page","ip":"10.64.0.46","http_method":"GET","server":"en.wikipedia.org","referrer":null,"unique_id":"XfJfWwpAAC4AAF4-WLUAAAAG","wiki":"enwiki","mwversion":"1.35.0-wmf.8","reqId":"XfJfWwpAAC4AAF4-WLUAAAAG"}

and on mwdebug1001:

<15>Dec 12 16:35:23 mwdebug1001 mediawiki: @cee: {\"@timestamp\":\"2019-12-12T16:35:23.662606+00:00\",\"@version\":1,\"host\":\"mwdebug1001\",\"message\":\"Wikimedia\\\\Rdbms\\\\LBFactory::destroy: closing connection to database '10.64.16.12'.\",\"type\":\"mediawiki\",\"channel\":\"DBConnection\",\"level\":\"DEBUG\",\"shard\":\"s1\",\"normalized_message\":\"Wikimedia\\\\Rdbms\\\\LBFactory::destroy: closing connection to database '10.64.16.12'.\",\"phpversion\":\"7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1\",\"url\":\"/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector\",\"ip\":\"10.64.32.123\",\"http_method\":\"GET\",\"server\":\"en.wikipedia.org\",\"referrer\":\"https://en.wikipedia.org/wiki/Main_Page\",\"unique_id\":\"XfJsSgpAIHsAAHVV0v4AAAAH\",\"wiki\":\"enwiki\",\"mwversion\":\"1.35.0-wmf.8\",\"reqId\":\"XfJsSgpAIHsAAHVV0v4AAAAH\"}

I am not sure how to continue from here. It appears that both hosts have similar behaviour. Just to be clear, I am quite positive that all logs right before the fatal error, were properly streamed to logstash.

Let me know what you think and how we should proceed

Joe added a comment.Dec 12 2019, 4:54 PM

It appears to me that we try to send something on that udp socket with send() after we called close(); This only happens on the debug servers? If so, various things are peculiar there:

  • we install the tideways extension
  • we allow profiling to be active

I would suggest you try to remove tideways from one of the servers to exclude that's what is causing this.

CDanis renamed this task from PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) to All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp).Dec 16 2019, 10:00 PM

Change 559088 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Revert "mwdebug1002: add motd warning"

https://gerrit.wikimedia.org/r/559088

Change 559088 merged by CDanis:
[operations/puppet@production] Revert "mwdebug1002: add motd warning"

https://gerrit.wikimedia.org/r/559088

I also see the UdpSocket warning consistently in the Beta Cluster when using Verbose log.

Debug-only, not a prod failure. Tagging as devprod given the noise on mwdebug is scaring people when staging commits before deployment.

jijiki moved this task from Incoming 🐫 to Unsorted on the serviceops board.Aug 17 2020, 11:47 PM