Page MenuHomePhabricator

Provide an mwdebug functionality on kubernetes (mw-experimental)
Closed, ResolvedPublic

Description

Most of us are used to (and rely on) testing mediawiki itself, as well as components surrounding it (eg. memcached, envoy, etc) using the mwdebug servers. We want to replicate this functionality when we move mediawiki to kubernetes.

Update: this service is live and ready for testing on eqiad only: mw-experimental

What?

Create a separate kubernetes service or services

Requirements

  • have its own metrics and logging where engineers can look when testing changes there
    • probably by using its own servergroup
  • engineers can deploy experimental easily (eg directly editing/copy files)
  • do not alert on errors etc
  • service needs to stay up to date with mediawiki images running on production
  • route traffic through XWD

Proposal

Host /srv/mediawiki in predefined kubernetes nodes, and have mediawiki pods running on those nodes mount it directly via hostPath. Users interested in testing/editing files manually, can do so my simply ssh-ing to those hosts, and edit files as usual. Using XWD, the can test their code by selecting the host they are working on.

For the sake of simplicity, let's call this service mw-experimental.

Kubernetes parts

Deployment and Service

mw-experimental will be a new deployment, and a new service. Differences from mw-debug (an other mw-*):

  • hostPath: /srv/mediawiki will be mounted via hostPath, overriding what is in the images
  • NodePort: This will be a NodePort service
  • internalTrafficPolicy: Setting it to Local will ensure that the mw-experimental pod running on a host, will be the one to serve any requests reaching it
  • Affinity and/or Tolerations: specific nodes will be allowed to host a mw-experimental pod
  • Code Updates
    • NoScap: Mediawiki freshness of the mw code will depend on the latest mediawiki-multiversion image found in the host
    • Systemd timer: a cron could take care of that, and react (ie copy files from the image)
    • No need to make those hosts scap targets
  • One Pod: Each eligible host, will run exactly 1 mw-experimental pod

Application Configuration

  • opcache.validate_timestamps: php-fpm should have it enebled
  • SERVERGROUP: distinct servergroup
  • TBA?

Puppet/Infra

  • VMs: Just like we have done already with kask, those nodes could be just 2 VMs per DC
  • Pupper Profile: We can have a puppet profile where we define the snowflakey stuff, and use a hiera on/off switch
  • Users/Deployers/Permissions: We could have a designated user group, and manage the permissions of /srv/mediawiki accordingly
  • ATS/XWD
    • x-wikimedia-debug-routing: will be edited accordingly, forwarding for example to wikikube-worker1001.eqiad.wmnet:4888
    • We can go as far as creating a CNAME pointing to the hostnames of the k8s workers hosting mw-experimental, eventually making this fully transparent to users

Other info

Wow there, that sounds like another snowflake!

Well, this is not more of a snowflake than mwdebug servers used to be, but it is also a snowflake that if it is down or not functional, it is ok

Details

Related Changes in Gerrit:
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I'm not sure where to put this, but perhaps here is a good place:

When using WikimediaDebug, I almost always use mwdebugX00X hosts instead of k8s-mwdebug. The reason is that when profiling or debugging, I generally want a host that has been warmed up with the previous request. For example, the profile not should be dominated by wmf-config, EtcdConfig, and not have unrepresentative code paths and overhead from Autoloader class compiling and APCU misses across the code base. This is usually only the case on a cold hit when first navigating to a given wiki in a recent time period.

For the mwdebug hosts, I can do the action first without profiling (or even just viewing the Main_Page), and then profile the action on the same host. In practice, it seems k8s-mwdebug randomly distributes requests over multiple separate pods and it's not likely to get routed to the same pod as a previous request a few seconds earlier. Thus, requets tend to hit cold each time.

I could file this as a bug report, but since the mwdebug hosts are still there, this isn't an issue for me (yet). With this task representing the general replacement of mwdebug hosts on k8s, I figured I'd add it here as something to keep in mind. The "select in ATS/XWD" criteria on this task will probably suffice (assuming that the "mw-experimental" is either always available, or lazy awakened as needed). If it's something created ad-hoc via deploy/mwmaint CLI only, then it might make sense to re-write this need as something to improve in the mwdebug deployment instead (i.e. does it need to have more than 1 pod per DC?)

Thanks, @Krinkle - this is a good point.

Agreed that, as it exists today, we don't have the ability to target a specific - or at least stable - choice of pod externally via x-wikimedia-debug.

Given that there are only 2 pods per deployment, it should be possible to send some modest number of warmup requests over a short period and with fairly high probability have hit both pods. However, that's clearly more effort (and more awkward) than sending a single warmup request that is certain to hit the soon-to-be-profiled backend.

Also, while any sort of opaque performance testing (e.g., external measurement of throughput) should run within the WMF production network anyway and can thus always target a single pod IP, that does not help for this kind of use case.

I'm not seeing any obvious "quick wins" in ATS config or the like that would clearly improve the stability situation, but would need to look more closely.

In any case, it would indeed seem that the simplest and 100% effective solution is to reconsider the pod count on the mw-debug deployments - i.e., whether 1 is suitable, particularly since the experimental use case (T324003) is entirely separate.

jijiki renamed this task from Provide an mwdebug functionality on kubernetes to Provide an mwdebug functionality on kubernetes (mw-experimental).Jan 16 2025, 11:30 AM

This work will commence while we are in the process of ramping up traffic towards PHP8.1 T383845 as mw-experimental will be useful to both devs and SREs for the PHP8.1 migration. However, having a mw-experimental service is not a blocker completing the traffic migration to PHP8.1

This work will commence while we are in the process of ramping up traffic towards PHP8.1 T383845 as mw-experimental will be useful to both devs and SREs for the PHP8.1 migration. However, having a mw-experimental service is not a blocker completing the traffic migration to PHP8.1

T328921 is the MW code deprecation task that's waiting on the Wikimedia production task, T319432, which this already blocks. I don't think this should be marked as blocking both its parent and grandparent.

T328921 is the MW code deprecation task that's waiting on the Wikimedia production task, T319432, which this already blocks. I don't think this should be marked as blocking both its parent and grandparent.

you are right, my bad

NoScap: Mediawiki freshness of the mw code will depend on the latest mediawiki-multiversion image found in the host

This would be a new requirement in my understanding and possibly entail additional deployment complexities in the planned "single version container" future that we are actively working towards.

Change #1123048 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] WIP: introduce mw-experimental functionality

https://gerrit.wikimedia.org/r/1123048

Change #1147782 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] hieradata: add usernames for mw-expermental

https://gerrit.wikimedia.org/r/1147782

Change #1147787 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] admin_ng: add mw-experimental namespace

https://gerrit.wikimedia.org/r/1147787

Change #1148300 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] kubernetes::deployment_server: add new mw-experimental release

https://gerrit.wikimedia.org/r/1148300

Change #1148905 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] WIP: profile::kubernetes::node: Add script to pull and mount latest mw

https://gerrit.wikimedia.org/r/1148905

Change #1150760 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mw-experimental: initial commit (vanilla)

https://gerrit.wikimedia.org/r/1150760

Change #1150762 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mw-experimental: create new service

https://gerrit.wikimedia.org/r/1150762

Change #1151753 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] deployment:fix-staging-perm: update fix-staging-perms

https://gerrit.wikimedia.org/r/1151753

Change #1152005 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] hieradata: Make wikikube-worker2300 a mw-experimental worker

https://gerrit.wikimedia.org/r/1152005

Change #1151753 merged by Effie Mouzeli:

[operations/puppet@production] deployment:fix-staging-perm: update fix-staging-perms

https://gerrit.wikimedia.org/r/1151753

Change #1148905 abandoned by Effie Mouzeli:

[operations/puppet@production] WIP: profile::kubernetes::node: Add script to pull and mount latest mw

Reason:

will try again

https://gerrit.wikimedia.org/r/1148905

Change #1147782 merged by Effie Mouzeli:

[operations/puppet@production] profile::kubernetes::deployment_server: add usernames for mw-experimental #1

https://gerrit.wikimedia.org/r/1147782

Change #1148300 merged by Effie Mouzeli:

[operations/puppet@production] profile::kubernetes::deployment_server: add new mw-experimental release #2

https://gerrit.wikimedia.org/r/1148300

Change #1147787 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: add mw-experimental namespace with hostPath support #3

https://gerrit.wikimedia.org/r/1147787

Change #1123048 merged by Effie Mouzeli:

[operations/puppet@production] kubernetes:mediawiki_runner: introduce mw-experimental #5

https://gerrit.wikimedia.org/r/1123048

Mentioned in SAL (#wikimedia-operations) [2025-06-03T16:20:33Z] <jiji@deploy1003> Started scap sync-world: T276994: We merged a number of noop patches, sparing deployers the scary diffs

Mentioned in SAL (#wikimedia-operations) [2025-06-03T16:23:32Z] <jiji@deploy1003> Finished scap sync-world: T276994: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s)

Change #1152005 merged by Effie Mouzeli:

[operations/puppet@production] hieradata: Make wikikube-worker2100 a mw-experimental worker

https://gerrit.wikimedia.org/r/1152005

Change #1153594 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mcrouter ds: allow mw-mcrouter ds to run on mw-experimental nodes

https://gerrit.wikimedia.org/r/1153594

Change #1153611 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mediawiki: add tolerations

https://gerrit.wikimedia.org/r/1153611

Change #1153594 merged by jenkins-bot:

[operations/deployment-charts@master] mcrouter ds: allow mw-mcrouter ds to run on mw-experimental nodes

https://gerrit.wikimedia.org/r/1153594

Change #1153611 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: add tolerations

https://gerrit.wikimedia.org/r/1153611

Mentioned in SAL (#wikimedia-operations) [2025-06-04T15:02:46Z] <jiji@deploy1003> Started scap sync-world: T276994: Chart bump, noop

Mentioned in SAL (#wikimedia-operations) [2025-06-04T15:05:39Z] <jiji@deploy1003> Finished scap sync-world: T276994: Chart bump, noop (duration: 02m 52s)

Change #1150760 merged by jenkins-bot:

[operations/deployment-charts@master] mw-experimental: initial commit (vanilla)

https://gerrit.wikimedia.org/r/1150760

Change #1150762 merged by jenkins-bot:

[operations/deployment-charts@master] mw-experimental: create new service #6

https://gerrit.wikimedia.org/r/1150762

Change #1154069 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] x-wikimedia-debug-routing: add mw-experimental hosts

https://gerrit.wikimedia.org/r/1154069

Change #1154070 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/mediawiki-config@master] debug.json: add mw-experimental hosts

https://gerrit.wikimedia.org/r/1154070

Current status:

  • POC working on wikikube-worker2100.codfw.wmnet, tested with httpbb
  • curl -v --connect-to en.wikipedia.org:443:wikikube-worker2100.codfw.wmnet:4456 https://en.wikipedia.org/wiki/Special:BlankPage
  • httpbb --https_port 4456 /srv/deployment/httpbb-tests/appserver/* --hosts=10.192.15.17

Change #1159502 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] site.pp: add wikikube-worker-exp(1001|2001)

https://gerrit.wikimedia.org/r/1159502

Change #1159518 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] site.pp: make wikikube-worker-exp* k8s workers

https://gerrit.wikimedia.org/r/1159518

Change #1159502 merged by Effie Mouzeli:

[operations/puppet@production] site.pp: add wikikube-worker-exp(1001|2001)

https://gerrit.wikimedia.org/r/1159502

Change #1154069 merged by Effie Mouzeli:

[operations/puppet@production] x-wikimedia-debug-routing: add mw-experimental hosts

https://gerrit.wikimedia.org/r/1154069

Change #1154070 merged by jenkins-bot:

[operations/mediawiki-config@master] debug.json: add mw-experimental hosts

https://gerrit.wikimedia.org/r/1154070

Mentioned in SAL (#wikimedia-operations) [2025-06-17T14:46:49Z] <lucaswerkmeister-wmde@deploy1003> Started scap sync-world: Backport for [[gerrit:1160127|stats: Add buckets based on wikitext size; fix increment bug (T393400)]], [[gerrit:1154070|debug.json: add mw-experimental hosts (T276994)]]

Mentioned in SAL (#wikimedia-operations) [2025-06-17T14:49:02Z] <lucaswerkmeister-wmde@deploy1003> lucaswerkmeister-wmde, jiji, cscott: Backport for [[gerrit:1160127|stats: Add buckets based on wikitext size; fix increment bug (T393400)]], [[gerrit:1154070|debug.json: add mw-experimental hosts (T276994)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-06-17T15:02:48Z] <lucaswerkmeister-wmde@deploy1003> Finished scap sync-world: Backport for [[gerrit:1160127|stats: Add buckets based on wikitext size; fix increment bug (T393400)]], [[gerrit:1154070|debug.json: add mw-experimental hosts (T276994)]] (duration: 15m 59s)

Change #1159518 merged by Effie Mouzeli:

[operations/puppet@production] site.pp: make wikikube-worker-exp1001 a k8s worker

https://gerrit.wikimedia.org/r/1159518

Change #1160238 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] site.pp: make wikikube-worker-exp2001 a k8s worker

https://gerrit.wikimedia.org/r/1160238

Update: this service is live and ready for testing on eqiad only: mw-experimental, after implementing most of T396767

Change #1160238 merged by Effie Mouzeli:

[operations/puppet@production] site.pp: make wikikube-worker-exp2001 a k8s worker

https://gerrit.wikimedia.org/r/1160238

jijiki claimed this task.