Page MenuHomePhabricator

nodejs can't connect to mysqld via tcp/localhost any longer (was: mariadb failing on testreduce1001)
Open, LowPublic

Description

Looks like testreduce1001 has been converted to buster but it seems to have broken mariadb install there.

ssastry@testreduce1001:~$ sudo systemctl status mariadb.service
● mariadb.service - MariaDB 10.3.27 database server
   Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sat 2021-02-06 15:03:59 UTC; 48s ago
     Docs: man:mysqld(8)
           https://mariadb.com/kb/en/library/systemd/
  Process: 31504 ExecStartPre=/usr/bin/install -m 755 -o mysql -g root -d /var/run/mysqld (code=exited, status=0/SUCCESS)
  Process: 31505 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
  Process: 31507 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= ||   VAR=`cd /usr/bin/..; /usr/bin/galera_recovery`; [ $? -eq 0 ]   && systemctl set-environment _WSREP_START_POSITION=$VAR || e
  Process: 31575 ExecStart=/usr/sbin/mysqld $MYSQLD_OPTS $_WSREP_NEW_CLUSTER $_WSREP_START_POSITION (code=exited, status=1/FAILURE)
 Main PID: 31575 (code=exited, status=1/FAILURE)
   Status: "Starting final batch to recover 13 pages from redo log"

Since we recently converted parsoid-rt round trip testing service to use a local db on testreduce1001, this is a blocker for running Parsoid's round trip tests.

Event Timeline

ssastry triaged this task as High priority.Feb 6 2021, 3:15 PM
ssastry added a project: Parsoid (Tracking).

Looks like testreduce1001 has been converted to buster but it seems to have broken mariadb install there.

It's been buster since July 2020, it was just restarted yesterday, see https://sal.toolforge.org/production?p=0&q=testreduce1001&d=

I tried to start mariadb and here's the failure log:

Feb 06 20:05:18 testreduce1001 mysqld[10139]: 2021-02-06 20:05:18 0 [Note] /usr/sbin/mysqld (mysqld 10.3.27-MariaDB-0+deb10u1-log) starting as process 10139 ...
Feb 06 20:05:18 testreduce1001 mysqld[10139]: 2021-02-06 20:05:18 0 [Warning] Could not increase number of max_open_files to more than 16384 (request: 32186)
Feb 06 20:05:18 testreduce1001 mysqld[10139]: 2021-02-06 20:05:18 0 [Warning] No argument was provided to --log-bin and neither --log-basename or --log-bin-index where used;  This may cause repliction to break when this server acts as a m
Feb 06 20:05:18 testreduce1001 mysqld[10139]: 2021-02-06 20:05:18 0 [Note] InnoDB: Using Linux native AIO
Feb 06 20:05:18 testreduce1001 mysqld[10139]: 2021-02-06 20:05:18 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
Feb 06 20:05:18 testreduce1001 mysqld[10139]: 2021-02-06 20:05:18 0 [Note] InnoDB: Uses event mutexes
Feb 06 20:05:18 testreduce1001 mysqld[10139]: 2021-02-06 20:05:18 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
Feb 06 20:05:18 testreduce1001 mysqld[10139]: 2021-02-06 20:05:18 0 [Note] InnoDB: Number of pools: 1
Feb 06 20:05:18 testreduce1001 mysqld[10139]: 2021-02-06 20:05:18 0 [Note] InnoDB: Using SSE2 crc32 instructions
Feb 06 20:05:18 testreduce1001 mysqld[10139]: 2021-02-06 20:05:18 0 [Note] InnoDB: Initializing buffer pool, total size = 4G, instances = 8, chunk size = 128M
Feb 06 20:05:18 testreduce1001 mysqld[10139]: 2021-02-06 20:05:18 0 [Note] InnoDB: Completed initialization of buffer pool
Feb 06 20:05:18 testreduce1001 mysqld[10139]: 2021-02-06 20:05:18 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
Feb 06 20:05:18 testreduce1001 mysqld[10139]: 2021-02-06 20:05:18 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=48579
Feb 06 20:05:18 testreduce1001 mysqld[10139]: 2021-02-06 20:05:18 0 [Note] InnoDB: Starting final batch to recover 13 pages from redo log.
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [Note] InnoDB: 128 out of 128 rollback segments are active.
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [Note] InnoDB: Creating shared tablespace for temporary tables
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [Note] InnoDB: Waiting for purge to start
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [Note] InnoDB: 10.3.27 started; log sequence number 1601327; transaction id 8
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [Note] InnoDB: Loading buffer pool(s) from /srv/sqldata/ib_buffer_pool
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [Note] InnoDB: Cannot open '/srv/sqldata/ib_buffer_pool' for reading: No such file or directory
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [Note] Plugin 'FEEDBACK' is disabled.
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [ERROR] Could not open mysql.plugin table. Some plugins may be not loaded
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [Note] Recovering after a crash using testreduce1001-bin
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [Note] Starting crash recovery...
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [Note] Crash recovery finished.
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 6 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1017: Can't find file: './mysql/' (errno: 2 "No such file or directory")
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [ERROR] Can't open and lock privilege tables: Table 'mysql.servers' doesn't exist
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [Note] Server socket created on IP: '::'.
Feb 06 20:05:19 testreduce1001 mysqld[10139]: 2021-02-06 20:05:19 0 [ERROR] Fatal error: Can't open and lock privilege tables: Table 'mysql.user' doesn't exist

I have fixed this issue and mysql is now running.

root@testreduce1001:/var/lib# mysql testreduce -e "show tables"
+----------------------+
| Tables_in_testreduce |
+----------------------+
| commits              |
| pages                |
| perfstats            |
| results              |
| stats                |
+----------------------+

Also ran puppet and got: Notice: /Stage[main]/Profile::Parsoid::Rt_server/Testreduce::Server[parsoid-rt]/Service[parsoid-rt]/ensure: ensure changed 'stopped' to 'running' (corrective)

This is the first time I see this server, I don't think we own it.
I noticed some issues issues there:

  • /etc/my.cnf has: datadir = /srv/sqldata but there's nothing there, the real datadir (where the testreduce database lives is: /var/lib/mysql. /srv/sqldata is the directory we use in production. So puppet for this server needs changing.
  • This host runs mariadb 10.3 (we don't have that in production, my recommendation would be to go to 10.4).

Removing the DBA as we don't own this, the initial problem has been fixed. Mysql is now up - I will stay on the task though in case you need some more input.

I noticed some issues issues there:

  • /etc/my.cnf has: datadir = /srv/sqldata but there's nothing there, the real datadir (where the testreduce database lives is: /var/lib/mysql. /srv/sqldata is the directory we use in production. So puppet for this server needs changing.

/srv/sqldata is the default you get when just including profile::mariadb::generic_server. I was hoping that default to be ok and it could be used as is for a test setup like this.

I don't know how exactly the data got into /var/lib/mysql though. We can change the puppet parameter for $datadir now to match reality of course. Will upload a patch for that. But if this was a new server would there be a reason to not just use /srv/sqldata as puppetized and used in production?

  • This host runs mariadb 10.3 (we don't have that in production, my recommendation would be to go to 10.4).

That as well is simply what you get per default from class { '::mariadb::packages': } on buster. Production DBs are using custom wmf-mariadb101* packages but test setups like this or cloud VPS projects needing a simple local DB without having it managed by DBA, who still try to be good citizens and use Puppet... all get these defaults for version and data dir.

Change 662771 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid::testreduce: set mysql datadir to /var/lib/mysql

https://gerrit.wikimedia.org/r/662771

Change 662771 merged by Dzahn:
[operations/puppet@production] parsoid::testreduce: set mysql datadir to /var/lib/mysql

https://gerrit.wikimedia.org/r/662771

deployed this change on testreduce1001, ran puppet. This changed the contents of /etc/my.cnf] like this:

-datadir   = /srv/sqldata
+datadir   = /var/lib/mysql

scandium and prod parsoid hosts of course untouched

puppet did refresh the parsoid-rt service but please take a look as well

Please note there is now data in both /srv/sqldata and /var/lib/mysql, I applied the change to the config file as suggested by Manuel.

I confirmed mysql is running and the data dir is already set to /var/lib/mysql. I did not restart anything manually but config matches reality now:

MariaDB [(none)]> select @@datadir;
+-----------------+
| @@datadir       |
+-----------------+
| /var/lib/mysql/ |
+-----------------+

I am still seeing this error:

ssastry@testreduce1001:~$ sudo service parsoid-rt status
...
Feb 08 21:38:02 testreduce1001 nodejs[14116]: Unable to connect to database, error: Error: Packets out of order. Got: 1 Expected: 0

So, something else is still broken that is preventing the node.js mysql driver from talking with the database. Not sure if there is something other config that needs an update. Nothing changed even after a sudo service mysql restart.

systemctl status parsoid-rt does indeed show that nodejs[16869]: Unable to connect to database and that it is using credentials from /etc/testreduce/parsoid-rt.settings.js

Manually using them with mysql -h localhost -u testreduce -p testreduce and pasting the password... they work just fine though.

parsoid-rt service can't connect though. And it changed with the reboot apparently.

netstat shows mysqld is apparently listening on tcp6 but not on tcp:

 netstat -tulpen | grep mysql
tcp6       0      0 :::3306                 :::*                    LISTEN      112        18283544   15659/mysqld

I temp. added a bind-address = 0.0.0.0 to the [mysqld] section of /etc/my.cnf and restarted mysqld. This made it listen on IPv4-only instead of IPv6-only:

tcp        0      0 0.0.0.0:3306            0.0.0.0:*               LISTEN      112        18399110   20239/mysqld

totally expected this to be the issue and restarted parsoid-rt service.. but it STILL can't connect.

Manually using mysql client works before and after. Letting puppet revert my manual hack.

I wonder if nodejs could try to use the socket at /run/mysqld/mysqld.sock instead of TCP / localhost.

Change 662789 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[operations/puppet@production] parsoid-rt / parsoid-vd: Pass in database socket path in server config

https://gerrit.wikimedia.org/r/662789

Change 662789 merged by Dzahn:
[operations/puppet@production] parsoid-rt / parsoid-vd: Pass in database socket path in server config

https://gerrit.wikimedia.org/r/662789

Things work once nodejs is using the socket instead of connecting to localhost via TCP. Deployed the change above that switches to that.

This does not explain yet why connecting via localhost stopped working but it wasn't a bad thing to do anyways and unbreaks the test setup.

Dzahn renamed this task from mariadb failing on testreduce1001 to nodejs can't connect to mysqld via tcp/localhost any longer (was: mariadb failing on testreduce1001).Feb 8 2021, 11:57 PM
ssastry lowered the priority of this task from High to Low.Feb 9 2021, 12:05 AM

Since we were able to unbreak this by using socket path for connecting to the mysql, i am lowering priority. You guys can investigate the localhost issue separately.

@ssastry do you think we could close this task?