Page MenuHomePhabricator

Kubernetes WebGrid experiences intermittent 500 errors with increasing frequency until restart on iabot tool
Closed, ResolvedPublicBUG REPORT

Description

For some reason, when using Kubernetes as the backend with a PHP 7.4 image, to operate the web service for the iabot tool, users will encounter a 500 Internal Server Error randomly, but with increasing frequency until the web service is restarted.

Sometimes, they will receive a 503 service unavailable instead, but this is a rarer occurrence.

Event Timeline

Harej renamed this task from Kubernetes WebGrid behaving "strangely" to Kubernetes WebGrid experiences intermittent 500 errors with increasing frequency until restart on iabot tool.May 6 2023, 9:38 PM

On T319803: Migrate iabot from Toolforge GridEngine to Toolforge Kubernetes, @Harej reported switching on 2022-10-07 and @Cyberpower678 mentioned reverting to grid engine on 2022-10-12. The most frequent lines in the $HOME/error.log between those times are:

  • PHP Fatal error: Maximum execution time of 30 seconds exceeded in /data/project/iabot/master/app/src/Core/APII.php on line 438
  • PHP Fatal error: Maximum execution time of 30 seconds exceeded in /data/project/iabot/master/app/src/Core/APII.php on line 1507
  • sh: 1: /usr/sbin/sendmail: not found
  • establishing connection failed: socket: unix:/var/run/lighttpd/php.socket.iabot-0: Resource temporarily unavailable

There are a very large number of the establishing connection failed: socket: unix:/var/run/lighttpd/php.socket.iabot-0: Resource temporarily unavailable error on 2022-10-12. I have no idea what the root cause was, but it looks to me like the PHP fcgi process started by lighttpd died in some way that lighttpd did not try to restart it. Once that happened lighttpd had no means to process PHP files and just logged the same series of error messages to that effect for each incoming request. The fact that this happened is unfortunate, but I don't currently see any sign that it was directly correlated with the Kubernetes backend.

The other 3 error messages (two timeouts and the missing sendmail script) occur elsewhere in the $HOME/error.log data both before and after the known Kubernetes runtime usage window. The sendmail issue would be expected inside of any of our containers. As noted at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Email#Mail_from_Tools none of the Kubernetes containers have a local mailer installed. Instead code should use their own SMTP client to send messages via the mail.tools.wmcloud.org host.

The PHP timeouts and sendmail issues are not what we are worried about, as those are fixable on our end. We don't know what the establishing connection failed: socket: unix:/var/run/lighttpd/php.socket.iabot-0: Resource temporarily unavailable is for or how to fix it as that is what is causing the intermittent 500 error, which happens with increasing frequency until we reboot the web service.

For some reason, https://iabot.toolforge.org/api.php?action=statistics&format=flat most commonly triggers the 500 errors. It's triggering right now, so maybe you can use it to see what is going on?

For some reason, https://iabot.toolforge.org/api.php?action=statistics&format=flat most commonly triggers the 500 errors. It's triggering right now, so maybe you can use it to see what is going on?

$HOME/error.log
2023-05-15 01:52:50: http-header-glue.c.1499) read() 8 9: Connection reset by peer
2023-05-15 01:52:50: gw_backend.c.2275) response not received, request sent: 1420 on socket: unix:/var/run/lighttpd/php.socket.iabot-1 for /api.php?action=statistics&format=flat, closing connection

The 'Connection reset by peer' message indicates that the fcgi PHP process closed the socket with lighttpd before lighttpd expected it to. This is often a sign of a hard process crash (segfault or similar). The $HOME/public_html/core file I was able to find was not related to this particular failure as it was nearly a year old. I deleted it mostly to free up NFS storage.

If I comment out loadStats( $jsonOut ); on line 150 of api.php a response is returned, so I would suggest that you start by looking deeper into the work that the loadStats method does.

Well segfaults are insanely difficult to determine especially if random on identical calls to the same endpoint delivering identical results. :/ Would there be any indication that it could be a crash outside of PHP? I can certainly see if I can find the segfault inducing code on my end.

I cannot reproduce any process issue on my machine with ANY PHP version regarding this. I have tried and tried and tried, and it loads successfully in all cases. I don't believe I am inducing a segfault. This endpoint loads and works fine for a while after a web service restart and then stops randomly working. I'm more inclined to believe that issue lies outside of PHP? @bd808

Cyberpower678 claimed this task.