Page MenuHomePhabricator

Follow-up on wmf-config "ClusterConfig::isTest" method
Closed, ResolvedPublic

Description

Summary from chatting with @Joe today...

Background

the ClusterConfig utility class was recently added to wmf-config as part of change 749717 with the intention to ease configuration variance for MW-on-K8s, by avoiding duplicate logic and/or global variables that reflect $_SERVER['SERVERGROUP'].

To learn the possible values of this environment variable and what it is used for, refer to https://wikitech.wikimedia.org/wiki/MediaWiki_at_WMF#App_servers.

Change

this utility class abstracts the established "server group" concept through a list of one of more boolean "traits".

Examples of server group names:

  • appserver
  • kube-mw-web
  • api_appserver
  • kube-mw-api-ext
  • kube-mw-api-int
  • kube-mw-debug (servergroup)
  • mwdebug#### (hostname)

Examples of traits:

  • bool isK8s -> contains kube-
  • bool isApi -> contains api

Among the traits is:

  • bool isTest -> contains debug

Proposal

The "test" and "isTest" trait are imho confusingly named. Overall, given how we use these terms today, I'd say "This applies to test" feels to significantly different from (and much less scary/heavy weight) than "This is deployed to debug". Test feels expressly outside the production security realm. We tend to use "test" environment and "test" clusters in ways that is mutually exclusive with production traffic.

We instead tend to use the term "debug" for the above request contexts, for example:

We usually refer to WikimediaDebug backends as "debug" requests, not "test" requests. And within MediaWiki core, for example, we have various "debug" modes that can be used in production, however, "test" modes tend to refer to unit testing or CI/Jenkins context.

I suggest we rename isTest to isDebug to keep terminology more internally consistent, ease the learning curve, and to reduce chances of commits or changes being misinterpreted.

Behind the scenes, I also noticed that this trait actually returns true in kubernetes and not on the mwdebug hosts. This is because mwdebug currently provisions SERVERGROUP via cluster: appserver. I suggest we either provision this differently from Puppet (e.g. debug_appserver), or incorporate a hostname check in ClusterConfig::isDebug.

  • Rename isTest to isDebug.
  • Confirm it returns true via WikimediaDebug when selecting k8s.
  • Confirm it returns true via WikimediaDebug when selecting mwdebug1002 (eqiad)
  • Confirm it returns true via WikimediaDebug when selecting mwdebug2002 (codfw).

Event Timeline

I would suggest to actually "fix" mediawiki-config, as changing cluster in puppet is not as straightforward as we expected.

Change 976252 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[operations/mediawiki-config@master] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency

https://gerrit.wikimedia.org/r/976252

DAlangi_WMF changed the task status from Open to In Progress.Nov 29 2023, 10:33 AM

Change 976252 merged by jenkins-bot:

[operations/mediawiki-config@master] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency

https://gerrit.wikimedia.org/r/976252

Mentioned in SAL (#wikimedia-operations) [2023-12-11T08:21:20Z] <kharlan@deploy2002> Started scap: Backport for [[gerrit:976252|ClusterConfig: Rename isTest() to isDebug() for consistency (T347366)]], [[gerrit:981424|IPInfo: Add comment clarifying $wgIPInfoGeoIP2EnterprisePath (T304604)]]

Mentioned in SAL (#wikimedia-operations) [2023-12-11T08:22:42Z] <kharlan@deploy2002> kharlan and d3r1ck01: Backport for [[gerrit:976252|ClusterConfig: Rename isTest() to isDebug() for consistency (T347366)]], [[gerrit:981424|IPInfo: Add comment clarifying $wgIPInfoGeoIP2EnterprisePath (T304604)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

@Krinkle, hostnames seem to be different from how they were before (mwdebug1001, mwdebug1002 etc). Now we just have deploy1001, deploy2002 (the host I used today - in codfw DC).

So since the patch tries to look for "debug" in the hostname (as in mwdebugX00X) and doesn't see it for the "deploy2002" case, $cc->isDebug() returns false.

However, I accessed the k8s REPL shell to test our theory that the patch is doing what it's expected to do. On that environment (hostname like: mw-debug.codfw....), running $cc->isDebug() returns true as expected.

So we may have to improve on the patch a little but to check for more things. Maybe cases where we have "deploy1001" and "deploy2002" (as those are AFAIK are our debug hosts?). Let me know what you think.

Mentioned in SAL (#wikimedia-operations) [2023-12-11T08:43:22Z] <kharlan@deploy2002> Finished scap: Backport for [[gerrit:976252|ClusterConfig: Rename isTest() to isDebug() for consistency (T347366)]], [[gerrit:981424|IPInfo: Add comment clarifying $wgIPInfoGeoIP2EnterprisePath (T304604)]] (duration: 22m 02s)

Change 981954 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[operations/mediawiki-config@master] ClusterConfig: Followup on I955168f072315e0064c69a66483e61dfc23

https://gerrit.wikimedia.org/r/981954

Change 981954 abandoned by D3r1ck01:

[operations/mediawiki-config@master] ClusterConfig: Followup on I955168f072315e0064c69a66483e61dfc23

Reason:

Ssh'ing into an actual debug host: mwdebug2002.codfw.wmnet to test the parent change works. But the alias deployment.codfw.wmnet takes me somewhere else.

https://gerrit.wikimedia.org/r/981954

@Krinkle, hostnames seem to be different from how they were before (mwdebug1001, mwdebug1002 etc). Now we just have deploy1001, deploy2002 (the host I used today - in codfw DC).

So since the patch tries to look for "debug" in the hostname (as in mwdebugX00X) and doesn't see it for the "deploy2002" case, $cc->isDebug() returns false.

However, I accessed the k8s REPL shell to test our theory that the patch is doing what it's expected to do. On that environment (hostname like: mw-debug.codfw....), running $cc->isDebug() returns true as expected.

So we may have to improve on the patch a little but to check for more things. Maybe cases where we have "deploy1001" and "deploy2002" (as those are AFAIK are our debug hosts?). Let me know what you think.

So after chatting with Krinkle on IRC, it seems nowaways, "debug hosts" and "deployment hosts" are different?

In the past, when I use the alias: deployment.eqiad/codfw.wmnet, it'll take me to the right "debug host", either "mwdebug1001/mwdebug2001" but nowadays, it takes me to "deployX00X" instead. So with this new knowledge I've gotten, I think the child patch is not needed and this task can be resolved.

But I'll like to see a place that documents the difference between the debug hosts and deployment hosts (where we login to deploy changes) because up until now, I thought they were the same things or maybe I'm just mixing everything up, sorry.

Per https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Deploying_changes, I had to refresh my knowledge (again).

Notes:
Deployment host: deployment.codfw/eqiad.wmnet -> deploy2002/1002
Debug host: mwdebug2002/1002.codfw/eqiad.wmnet -> mwdebug2002/1002

So everything works correctly as expected.