Page MenuHomePhabricator

mysqli (PHP 7.3 kubernetes container) fails to communicate with MariaDB v10.4.12 instance hosted on cyberbot-db-01.cyberbot.eqiad1.wikimedia.cloud
Closed, InvalidPublic

Description

Symptom: When PHP uses the mysqli driver to connect the DB VPS located at cyberbot-db-01, it will error out and produce the following messages:

2021-04-15 02:14:58: (mod_fastcgi.c.421) FastCGI-stderr: PHP Warning:  Packets out of order. Expected 0 received 1. Packet size=107 in /data/project/iabot/master/app/src/html/Includes/session.php on line 128
2021-04-15 02:14:58: (mod_fastcgi.c.421) FastCGI-stderr: PHP Warning:  mysqli_connect(): MySQL server has gone away in /data/project/iabot/master/app/src/html/Includes/session.php on line 128
2021-04-15 02:14:58: (mod_fastcgi.c.421) FastCGI-stderr: PHP Warning:  mysqli_connect(): Error while reading greeting packet. PID=15 in /data/project/iabot/master/app/src/html/Includes/session.php on line 128
2021-04-15 02:14:58: (mod_fastcgi.c.421) FastCGI-stderr: PHP Warning:  mysqli_connect(): (HY000/2006): MySQL server has gone away in /data/project/iabot/master/app/src/html/Includes/session.php on line 128

However when accessing the DB via the mysql command from the shell, it connects and queries just fine. Furthermore, my local dev environments that connect to the same DB through SSH tunnels can connect and query as expected as well. As far as I can tell, it is just Toolforge that this is happening on.

This is the connection test output from DataGrip to the SQL server.

image.png (110×424 px, 6 KB)

As can be seen it's clearly able to identify the DB as MariaDB v10.4.12

I've restarted the SQL server, and restarted the webservice on toolforge. I suspect there may be some incompatibility from some update I may have missed, but it would be appreciated if this could be looked into.

Event Timeline

Cyberpower678 reopened this task as Open.
Cyberpower678 claimed this task.
Cyberpower678 triaged this task as Low priority.

Oops wrong ticket. This may need some more investigation.

[17:14]  <Cyberpower678> NGL, I'm getting more convinced that this isn't IABot, but something on the webservice.  Doing more testting.  I'm creating a test script.
[17:30]  <Cyberpower678> bd808: OK, I can DEFINITELY confirm that there is an issue with one of the webservices.
[17:30]  <Cyberpower678> I just wrote a test script and threw it into the public_html directory
[17:30]  <    harej> What if we just nuked the current webservice and created a new one?
[17:31]  <Cyberpower678> I tried that already, but I suspect I keep landing in the same one.  I don't know how to check this.
[17:31]  <    harej> https://usercontent.irccloud-cdn.com/file/wQienLtC/image.png
[17:31]  <    harej> That's... interesting
[17:31]  <    bd808> Cyberpower678: kubernetes or job gird?
[17:31]  <    bd808> *grid
[17:31]  <Cyberpower678> When I ran the test script from php-cli, it executed to completion, and successfully exited.
[17:31]  <Cyberpower678> On the webservice, it threw a fatal error.
[17:32]  <Cyberpower678> bd808: how do I check, there is no entry in qstat?
[17:32]  <    bd808> qstat is for grid jobs
[17:32]  <Cyberpower678> So kubernetes I assume
[17:33]  <    bd808> so you don't know how you are running your code?
[17:33]  <    harej> IABot is a webservice isn't it? There would be a webservice log
[17:35]  <Cyberpower678> bd808: I know it runs on the toolforge webservice as it always has.
[17:35]  <Cyberpower678> But the issue spawned on a webservice restart.
[17:35]  <Cyberpower678> Outside of the webservice the code runs fine and mysqli successfully connects to the DB
[17:36]  <Cyberpower678> But within, a strange packets out of order error crops up.
[17:36]  <    bd808> "webservice" is a python script that starts either grid engine jobs or kubernetes deployments. These two runtime environments have differences that are material to investigating
[17:37]  <Cyberpower678> Well, the exact command I used is "webservice start"
[17:37]  <Cyberpower678> It prints a bunch of dots and that's it.
[17:38]  <    bd808> and do you have a ~/service.template file or would that only use the default settings built into `webservice`?
[17:38]  <    bd808> what does `webservice status` tell you?
[17:39]  <    bd808> and what environment was your code running with before you shut it down last week or whenever?
[17:39]  <Cyberpower678> Your webservice of type php7.3 is running on backend kubernetes
[17:40]  <Cyberpower678> I've been running PHP 7.3 last time, and our last conversation I believe the bot was also running on kubernetes
[17:41]  <    bd808> which last conversation?
[17:41]  <Cyberpower678> From a few months ago.
[17:41]  <Cyberpower678> The webservice was untouched since then.
[17:41]  <    bd808> so vague and unactionable
[17:41]  <Cyberpower678> I'm sorry.
[17:42]  <Cyberpower678> From my vantage point it's not easy to inquire what current environment my bot is running on a webservice.
[17:42]  <    bd808> why is that?
[17:43]  <    bd808> `webservice status` has existed for ~4 years
[17:43]  <    bd808> and ~/service.manifest records the active state as well
[17:43]  <Cyberpower678> And it produces that one sentence I pasted above.
[17:43]  <Cyberpower678> I KNOW it was running with PHP 7.3 for a while now.
[17:44]  <Cyberpower678> I'm almost certain it was on Kubernetes
[17:44]  <Cyberpower678> But beyond that, I'm not sure what environment info you need.
[17:45]  <Cyberpower678> bd808: The manifest file says I'm on Debian right now.
[17:45]  <    bd808> Cyberpower678: I'm trying to help you figure out what changed. To do that we need before/after information. We now have some after information, but apparently before is lost to the sands of time
[17:47]  <Cyberpower678> bd808: TBH, I don't think anything has changed environment wise.  I can't say what distribution I was running on, but, that aside, IABot is mostly environment agnostic.
[17:48]  <    bd808> Cyberpower678: you stoped the webservice within the last ~2 weeks correct?
[17:48]  <Cyberpower678> bd808 correct
[17:48]  <    bd808> and then when you started it back up it was mysteriously broken?
[17:48]  <Cyberpower678> Yes
[17:49]  <    bd808> and you have no proof of what runtime (grid or kubernetes) it was on when it was not broken?
[17:50]  <Cyberpower678> If kubernetes never shows up in qstat, then it was kubernetes no doubt
[17:50]  <    bd808> I can see that there is a ~/service.log stating "2021-03-25T18:48:17.120388 No running webservice job found, attempting to start it"
[17:50]  <Cyberpower678> I haven't seen a webservice job in there for a long while now.
[17:50]  <    bd808> That log file is related to the grid and not kubernetes
[17:51]  <Cyberpower678> Is that a left over bigbrother thing?
[17:51]  <    bd808> so the grid watcher at least thought you were running the webservice on the grid a few weeks ago
[17:52]  <Cyberpower678> Huh.  I definitely don't recall seeing any webservice jobs listed in the qstat output.
[17:53]  <    bd808> I would suggest trying the webservice on the grid backend to see if it works differently. `webservice stop; webservice --backend=gridengine start`
[17:53]  <Cyberpower678> bd808: I'm sorry if I'm giving you a headache.
[17:53]  <    bd808> that may or may not make it better but it will give you some more data
[17:53]  <Cyberpower678> Let me switch
[17:54]  <Cyberpower678> bd808 it workds
[17:55]  <    bd808> magic!
[17:55]  <Cyberpower678> It's executing successfully.
[17:55]  <    bd808> so your code works on php7.2 (grid and bastion) and not php7.3 (kubernetes)
[17:55]  <Cyberpower678> bd808: you're my best friend here. :-)
[17:55]  <Cyberpower678> But IABot does work with PHP 7.3
[17:56]  <    bd808> but not with the mysqli that is in our php7.3 apparently
[17:56]  <Cyberpower678> It's the version I use to actively develop IABot on my machine.
[17:57]  <Cyberpower678> I think it's still worth having a look into at some point.
[17:58]  <Cyberpower678> bd808: I wonder if it's some network issue perhaps from kubernetes to cyberbot-db-01
[17:59]  <Cyberpower678> If everyone else is working just fine, and I'm getting "Packets out of order" when trying to connect to it, maybe something funky is happening when routing and/or IOing to and from it.
[18:00]  <Cyberpower678> After all, my DB lives elsewhere than most Toolforge users.
[18:01]  <Cyberpower678> But in any event, thank you for that quick fix.

Some googling reveals that this error is generally a symptom of running into mysql limits like max_connections, packet size, max_allowed_packet or memory usage.

None of those seem to apply as far as I can tell. It’s been discovered that this is only happening on the Kunernetes web service. The grid engine web service works fine.

bd808 renamed this task from mysqli (PHP) appears to not behave as intended on Toolforge to mysqli (PHP 7.3 kubernetes container) fails to communicate with MariaDB v10.4.12 instance hosted on cyberbot-db-01.cyberbot.eqiad1.wikimedia.cloud.Apr 15 2021, 10:32 PM

How do I reproduce this? I can see if I can do a bit of debugging.

taavi subscribed.

The server is refusing all connections from Kubernetes nodes but lets me try to open a connection from a grid node:

[taavi@tools-k8s-worker-68 ~] $ nc cyberbot-db-01.cyberbot.eqiad1.wikimedia.cloud 3306
k�jHost 'tools-k8s-worker-68.tools.eqiad1.wikimedia.cloud' is not allowed to connect to this MariaDB server

[taavi@tools-sgeexec-0901 ~] $ nc cyberbot-db-01.cyberbot.eqiad1.wikimedia.cloud 3306
q
5.5.5-10.4.12-MariaDB-1:10.4.12+maria~stretchg�E;:MTA)F���dEjgHq!M@76Emysql_native_password^C

A quick search suggests that this might have something to do with missing database grants.

Closing due to inactivity.