Page MenuHomePhabricator

Revive Labtestwikitech (formerly: Abolish labtestwikitech)
Closed, ResolvedPublic

Description

labtestwikitech.wikimedia.org is, and always has been, a huge nuisance. We should figure out a way to live without it.

Originally it was useful for testing openstack/mediawiki integration, but these days it's good for exactly two things:

A: (rarely) create new test accounts for codfw1dev
B: (daily) provide 2fa verification for labtesthorizon.wikimedia.org logins

Case A we can work around today by creating accounts by hand in ldap. That doesn't give a true user experience but I don't much care.

Case B is harder. A few options:

  1. Realize the vision of a centralized org-wide developer-account management tool (at which point we'd run a testing install of that instead)
  2. Make a standalone tool that exists only to verify the second factor, decoupled from any wiki functionality
  3. Make a stateless standalone tool that just always answers 'yes' to any 2fa validation request
  4. Disable use of 2fa for labtesthorizon

UPDATE: after further discussion we've determined that we need to nurse this site along for a while yet. This task has been re-titled to reflect that we need to get labtestwikitech back up and running soon.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I highly support this ticket. wikitech adds a bit of overhead to DB maintenance which is not nice but it's bearable. OTOH test wikitech adds unimaginable amount of overhead to our daily work. It's not listed in zarcillo, it has its own section, etc.

The fourth option seems plausible to me. If it's a test system, it doesn't need to have security of our production systems.

@bd808 I'm interested in your thoughts about this -- is that wiki good for anything that I'm forgetting? I would say 'striker 2fa' except in theory we have better test/dev options for striker these days.

@bd808 I'm interested in your thoughts about this -- is that wiki good for anything that I'm forgetting? I would say 'striker 2fa' except in theory we have better test/dev options for striker these days.

I will freely admit that I do not use labtestwiki or any of the other bits in the codfw1dev deployment often personally, but I do think that if we are going to continue to have a testing OpenStack environment it should include all of the parts that it reasonably can.

Originally it was useful for testing openstack/mediawiki integration, but these days it's good for exactly two things:

It is also the target of the OpenStack hooks which do page creation/deletion when new projects are created/deleted. These are exactly the things that wikitech does for the eqiad1 deployment are they not?

OTOH test wikitech adds unimaginable amount of overhead to our daily work. It's not listed in zarcillo, it has its own section, etc.

I suppose this is primarily because T167973: Move database for wikitech (labswiki) to a main cluster section was only done for wikitech and not also for labtestwiki?

OTOH test wikitech adds unimaginable amount of overhead to our daily work. It's not listed in zarcillo, it has its own section, etc.

I suppose this is primarily because T167973: Move database for wikitech (labswiki) to a main cluster section was only done for wikitech and not also for labtestwiki?

That would be correct, yes.

Andrew renamed this task from Abolish labtestwikitech to Revive Labtestwikitech (formerly: Abolish labtestwikitech).Jul 27 2022, 2:38 PM

As per the discussion above, I've retitled this task. We need labtestwikitech, and we currently don't have it.

The site currently fails due to '(Cannot access the database: MySQL server has gone away (clouddb2001-dev))' -- presumably due to config changes that didn't account for it existing. @Ladsgroup, do you know offhand what the breaking change was?

It's probably because grants of the new wikiuser is not added there (and was not during the incident for obvious reasons).

I'm not saying we should definitely abolish labtestwikitech but it's either should be at least en par with labswiki (moved to production or at least its db being upgraded out of stretch) or fully removed.

Change 822428 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/mediawiki-config@master] Fix labtestwiki database name servers

https://gerrit.wikimedia.org/r/822428

Change 822428 merged by jenkins-bot:

[operations/mediawiki-config@master] Fix labtestwiki database name servers

https://gerrit.wikimedia.org/r/822428

Mentioned in SAL (#wikimedia-operations) [2022-08-11T17:58:37Z] <taavi@deploy1002> Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:822428|Fix labtestwiki database name servers (T310795)]] (duration: 03m 39s)

I'm not saying we should definitely abolish labtestwikitech but it's either should be at least en par with labswiki (moved to production or at least its db being upgraded out of stretch) or fully removed.

My recollection is that it doesn't work to have the DB in production because it's hosted in codfw1dev which means the prod databases are typically read-only there.

After checking a run with tcpdump:

Screenshot_20220812_190525.png (1×2 px, 468 KB)
You can see how the client is closing connection after the HELO from the server. This is not a grant issue- the client seems to not like the initial question from the server- it must be a protocol or configuration issue. Either the client doesn't understand mysql_native_password protocol (it is outdated) or something else makes the client abort (TLS negotiation? enforcing mysql's-only caching_sha2_password authentication on client?

Sadly, this is something that has to be discovered ad hoc, because it seems client and server setup is very unlike production (TCP6, debian upstream package, 10.5, different configuration), so unsure which of those changes is causing the error. E.g. there are known mysql and mariadb incompatibilities and version number incompatibilities, but none that cannot be overcome by upgrading/downgrading or reconfiguring client library/connector or server.

At least the ipv6 thing shouldn't be an issue, since as far as I know mediawiki accesses via ipv4:

CommonSettings.php:$wgLBFactoryConf['hostsByName']['clouddb2002-dev'] = '10.192.20.6';

It seems mysql server is accepting connection on ipv6 (and mediawiki is trying to connect on ipv4). Didn't properly and fully check it but it looks like it. We can try changing it to 2620:0:860:118:10:192:20:6 and see if that fixes it.

It is also possible and likely that this is not the only issue.

Change 822683 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] profile::openstack::codfw1dev::db: install wmf-mariadb104

https://gerrit.wikimedia.org/r/822683

Change 822683 merged by Andrew Bogott:

[operations/puppet@production] profile::openstack::codfw1dev::db: install wmf-mariadb104

https://gerrit.wikimedia.org/r/822683

Change 822685 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] profile::openstack::codfw1dev::db: update ref to wmf-mariadb104

https://gerrit.wikimedia.org/r/822685

Change 822685 merged by Andrew Bogott:

[operations/puppet@production] profile::openstack::codfw1dev::db: update ref to wmf-mariadb104

https://gerrit.wikimedia.org/r/822685

I have rebuilt clouddb2002-dev with the 'official' package, wmf-mariadb104. I can also confirm that the db works fine on ipv4:

root@cloudweb2002-dev:/srv/mediawiki/private# mysql -h 10.192.20.6 -u wikiuser202206 -p
Enter password: 
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 87
Server version: 10.4.25-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:wikiuser202206@10.192.20.6 [(none)]>

I have that cuts down on the variables a bit.

I have that cuts down on the variables a bit

Sadly, as I mentioned on the tcdump analysis, the issue seems on client side- php client connector (although it maybe could be mitigated on server side, e.g. though configuration).

tagging @Reedy just in case you're in the mood for a challenge :)

@jcrespo I have long ago lost track of the proxy layer between MW and DB but isn't it likely that something is confused or misconfigured in between the two hosts? Or is there already special-purpose config that means MW is talking directly to the db with no middleware? This broke quite a while ago and IIRC @Ladsgroup had already proposed a possible cause for the breakage, having to do with some kind of update to the db config code.

...I should add that we're running into ever more work that's blocked for lack of this wiki :(

there already special-purpose config that means MW is talking directly to the db with no middleware

MW has its own internal load balancer/pooling logic- it shouldn't require haproxy. Of course it could be related to that- db config code (I said it looked client-related), but I am the wrong person to debug that - not a mw expert. I just sent the tcpdump to prove communication was happening and then the client dropped connection. Maybe you've discovered a mw bug? IDK

For anyone joining this ticket late, the app server is cloudweb2002-dev.wikimedia.org and the db server is clouddb2001-dev.codfw.wmnet

It fails in both web and CLI context from cloudweb2002-dev:

krinkle@cloudweb2002-dev:~$ mwscript eval.php --wiki labtestwiki
> $db = wfGetDB(DB_REPLICA);
> $db->query('SELECT 1;');
Caught exception Wikimedia\Rdbms\DBConnectionError: Cannot access the database: MySQL server has gone away (clouddb2002-dev)

It works when using mysql's own CLI however

krinkle@cloudweb2002-dev:~$ mysql -h 10.192.20.6 -u wikiuser202206 -p
# password from /srv/mediawiki/private/PrivateSettings.php
mysql:wikiadmin@10.192.20.6 [(none)]> use labtestwiki
SHOW TABLES

abuse_filter
page
revision
…

Confirm via eval.php that what failed before is indeed comparable:

$ host 10.192.20.6
clouddb2002-dev.codfw.wmnet

krinkle@cloudweb2002-dev:~$ mwscript eval.php --wiki labtestwiki
> return $wgLBFactoryConf;
# WARNING: private information

      ["clouddb2002-dev"]=>
      string(11) "10.192.20.6"

      ["s11"]=>
      array(1) {
        ["clouddb2002-dev"]=>
        int(1)
      }

      ["labtestwiki"]=>
      string(3) "s11"

I do note that it uses wikiadmin when using wfGetDB() in the MediaWiki CLI, not wikiuser202206.

However, using the mysql CLI with wikiadmin (and its, different, password) also works fine.

Trying to narrow down what's different between the two.

krinkle@cloudweb2002-dev:~$ mwscript mysql.php --wiki labtestwiki
MariaDB [labtestwiki]>

This works, but doesn't say much. This actually shells out to the same mysql CLI. But, it does run as the MW shell user (instead of me), and obtains host/user/credentials from MW. So at least that part is right.

krinkle@cloudweb2002-dev:~$ mwscript sql.php --wiki labtestwiki
Wikimedia\Rdbms\DBConnectionError from line 1475 of /srv/mediawiki/php-1.39.0-wmf.25/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: MySQL server has gone away (clouddb2002-dev)

This uses php-mysql and MW's own REPL, and there it fails again.
I confirmed again via eval.php that both DB_PRIMARY and DB_REPLICA fail. I imagine those resolve to the same test DB host, but could have helped identify a semantic issue since internally we do treat those connections different and may set different kinds of session settings.

[15:56 UTC] krinkle at mwmaint1002.eqiad.wmnet in ~
$ mwscript eval.php --wiki labtestwiki
> wfGetDB(DB_PRIMARY)->query('SELECT 1;');
> return wfGetDB(DB_PRIMARY)->query('SELECT @@hostname;')->fetchRow();
array(2) {
  [0]=>
  string(15) "clouddb2002-dev"
  ["@@hostname"]=>
  string(15) "clouddb2002-dev"
}

Interestingly, from Eqiad this is working fine. If that connects to a local equivalent instead, that might've meant something was off about the way clouddb2002-dev is configured in Codfw. But.. this wiki only exists in Codfw, and it's actually connecting to there.

krinkle@mwmaint2002:~$  mwscript eval.php --wiki labtestwiki
> wfGetDB(DB_PRIMARY)->query('SELECT 1;');
Caught exception Wikimedia\Rdbms\DBConnectionError: Cannot access the database: MySQL server has gone away (clouddb2002-dev)

And it's failing on mwmaint2002 the same as on cloudweb2002-dev. This suggests it isn't something specific to cloudweb2002-dev, but probably more broadly about how mw-related hosts are provisioned in Codfw, or something about the Codfw DB that is rejecting local Codfw connections.

Change 824764 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:openstack::codfw1dev::db: support TLS

https://gerrit.wikimedia.org/r/824764

Change 824764 merged by Andrew Bogott:

[operations/puppet@production] P:openstack::codfw1dev::db: support TLS

https://gerrit.wikimedia.org/r/824764

Mentioned in SAL (#wikimedia-cloud) [2022-08-19T17:06:24Z] <taavi> [codfw1dev] restart mariadb on clouddb2002-dev to pick up certificate config changes T310795

Andrew claimed this task.

Quick summary -- the app server was using TLS to talk to the DB. Even though both were in Dallas, the global setting for primary DC was set to eqiad, which causes medaiwiki to think it was talking cross-DC even if it isn't.

clouddb2002-dev wasn't configured for TLS which was causing weird sudden disconnects.

Now clouddb2002-dev /is/ configured for TLS, thanks to Taavi's patch, above. So the situation is still weird (thinking that traffic is cross-dc when it isn't) but things work so I'm happy.

Thank you all!