Page MenuHomePhabricator

changeprop-jobqueue password leaked on phabricator
Closed, ResolvedPublicSecurity

Description

While troubleshooting a failed deployment for T331616: changeprop configuration for notificationGetStartedJob and notificationKeepGoingJob, I inadvertently included a field labeled "password"

changeprop-jobqueue, changeprop-production, Service (v1) has changed:
  # Source: changeprop/templates/service.yaml
  apiVersion: v1
  kind: Service
  metadata:
    name: changeprop-production
    labels:
      app: changeprop
-     chart: changeprop-0.10.21
+     chart: changeprop-0.10.22
      release: production
      heritage: Helm
  spec:
    type: NodePort
    select
                    port: 6379
                    password: {redacted}

This was in a comment on T331616, so subscribers to that task will have received an email with those contents. I deleted the comment within a few seconds of scanning the output and seeing password.

I'm sorry about that, I did not realize that there was the potential for sensitive information to appear in the diff of a helm deploy.

Details

Risk Rating
Medium
Author Affiliation
WMF Product

Event Timeline

sbassett changed Author Affiliation from N/A to WMF Technology Dept.
sbassett changed Risk Rating from N/A to Medium.
sbassett changed Author Affiliation from WMF Technology Dept to WMF Product.
akosiaris triaged this task as High priority.EditedMar 20 2023, 4:35 PM
akosiaris added a subscriber: elukey.

Hi @kostajh, I 'll be handling this one.

This is an internal service, however the firewall is pretty lax (effectively anything in production can connect to these redises), so I 'll prioritize as high

Drafting out the following plan to rollover that password

  1. Figuring out usages. Apparently those are
    • Netbox
    • Docker registry
    • api-gateway - via nutcracker
    • changeprop - via nutcracker
    • changeprop-jobqueue - via nutcracker
    • ORES
    • mwdebug - via nutcracker
    • redisLockManager
  2. Disable puppet in the cluster for ORES, docker-registry, mwdebug and redis_misc
  3. Merge password change in puppet private repo.
  4. Enable puppet on Pair 2 nodes (primaries and secondaries).
  5. Deploy change for ORES (@elukey, this is gonna cause an outage), docker-registry.
  6. Deploy nutcracker-enabled apps, check that they are correctly falling back to the nodes with the new password.
  7. Deploy changes for Netbox, redisLockManager

Hi @kostajh, I 'll be handling this one.

Ack. I'm very sorry about this extra work for you all. If I can help clean up the mess somehow, please let me know. For now, I'll review and think about documentation and other types of warnings to help prevent this type of thing in the future.

@akosiaris I would like to take some time to review what we can do for redisLockManager in mediawiki; maybe we can first change the password on the replicas, move redislockmanager to those when we change the password in mediawiki-config as well, then invert the replication direction and make those the masters?

It's too late to make better proposals, maybe we can take a 10 minutes outage of file uploads (which is what I expect unless we do such a dance).

@akosiaris I would like to take some time to review what we can do for redisLockManager in mediawiki; maybe we can first change the password on the replicas, move redislockmanager to those when we change the password in mediawiki-config as well, then invert the replication direction and make those the masters?

Not needed apparently. redisLockManager is split across the 2 pairs (2 instances on Pair 1, 1 on Pair 2). With @jijiki, on the last round of upgrades and rdb reboots we figured out that it's not so brittle after all, since we rebooted Pair 1 and saw no issues.

It's too late to make better proposals, maybe we can take a 10 minutes outage of file uploads (which is what I expect unless we do such a dance).

Yeah, I think we can survive with that.

akosiaris added a subscriber: Krinkle.

I think I can successfully resolve it. The aftermath of the emergency rollover was more or less

  • A small outage for ORES in codfw (it was using the wrong pair of redis misc nodes)
  • A hiccup in changeprop that took a while for backlog to get served
  • A small window where while running scap sync-file redisLockManager wasn't able to authenticate to any redis (design issue, not easy to fix while in an emergency rollover, had to bite the bullet)

Do we make this public? I 'll answer that tomorrow.

Adding a few more people for their information as services they own where impacted.

I am really sorry but I made the mistake today when I was asked to provide a phaste of changeprop's helmfile diff, I didn't really think about a password :(

Draft of the procedure that I should follow to redo what Alexandros already did a couple of days ago:

  • Familiarize with Redis nodes:
rdb10(09|11) ==> eqiad masters

rdb101[02] ===> eqiad replicas

rdb200[79] ===> codfw masters

rdb20(08|10) ===> codfw replicas


Pair 1: rdb1009 and rdb2007 and replicas: rdb1010, rdb2008

Pair2: rdb1011 and rdb2009 and replicas rdb1012, rdb2010
  • disable puppet on target nodes (non k8s deployments)
cumin 'A:netbox or A:docker-registry or A:ores or A:redis-misc-master or A:redis-misc-slave' 'disable-puppet "elukey - T332598"'
  • Stage the change in the puppet private repository to change the Redis password, and ask for a review before merging.
  • Merge the code change in puppet private.
  • Run puppet on deploy2002 to make sure that helmfile private files are updated + check with a simple helmfile diff for changeprop that the redis password is indeed picked up.
  • Update the Pair 2 redis hosts (master and replicas):
# masters first
cumin -m async 'rdb1011* or rdb2009*' 'enable-puppet "elukey - T332598"' 'run-puppet-agent' 

# wait a couple of mins then
cumin -m async 'rdb1012* or rdb2010*' 'enable-puppet "elukey - T332598"' 'run-puppet-agent'
  • Run puppet on ORES nodes and Docker registry nodes:
cumin 'A:docker-registry or A:ores' 'enable-puppet "elukey - T332598"' 'run-puppet-agent'
  • Deploy api-gateway, changeprop and changeprop-jobqueue. At this point you should see the above nodes used and not the ones not yet updated (because of nutcracker).
  • Go to deploy2002, cd in /srv/mediawiki-staging/private and change the Redis password in PrivateSettings.php. Commit like it was the puppet private repo, and then run scap sync-file private/PrivateSettings.php.

From this moment, mediawiki may have some troubles with file uploads since we haven't updated Pair 1 redises yes, so please go to the next step asap.

  • Now we move to Pair 1 Redises, where some clients do have nutcracker failover (on the client side):
# masters first
cumin -m async 'rdb1009* or rdb2007* 'enable-puppet "elukey - T332598"' 'run-puppet-agent' 

# then after a couple of mins
cumin -m async 'rdb1010* or rdb2008*' 'enable-puppet "elukey - T332598"' 'run-puppet-agent'
  • Run puppet on netbox nodes
cumin 'A:netbox' 'enable-puppet "elukey - T332598"' 'run-puppet-agent'

Overall pretty good write up of my actions on Monday.

Minor notes:

cumin -m async 'rdb1011* or rdb2009* or rdb1012* or rdb2010*' 'enable-puppet "elukey - T332598"' 'run-puppet-agent'

Split this in 2 groups. Masters first, replicas a little bit later. The reason is that replicas fetch the password via a puppetdb query and thus require for a puppet run first one the masters (and yes, kinda ew). On the plus side, you don't even need the forced puppet run on the replicas, nothing uses them.

After this you'll probably need to roll restart the redis instances on them to pick up the new password (to be verified with Alex).

No need for this one, once puppet runs, redises will be restarted (and prom exporters too)

Run puppet on ORES nodes and Docker registry nodes:

Docker registry will only issue cache misses that will be slightly visible in a grafana graph. No other consequence.

ORES will suffer a small meltdown. To be completely sure of the restart I embedded 'run-puppet-agent ; systemctl restart ores-uwsgi ; systemctl restart celery-ores-worker'. And yes, the inconcistency of where "ores" is placed in the name of those 2 services is killing me.

Now we move to Pair 1 Redises, where some clients do have nutcracker failover (on the client side):

No, we first deploy the thing with nutcracker (api-gateway, changeprop, changeprop-jobqueue). The reason for that is that nutcracker will have already ditched the Pair 2 redises due to inability to auth, we don't want it to also ditch the Pair 1 while running

Now we can go for enabling puppet on Pair 1 redises.

PrivateSettings.php - how should it be updated/deployed?

It's important to stage the change deploy2002:/srv/mediawiki-staging/private$ and once puppet has successfully ran on Pair 1, then scap sync-file private/PrivateSettings.php

Thanks a lot for the feedback, I have updated my plan accordingly :)

Some questions/comments:

  1. Netbox: does it require a uwsgi restart or puppet takes care of it already?
  1. As this happened already 2 times, will T332766 prevent the password for showing up in the first place in the diff? Because I don't think that a warning like mentioned in T332817 would be enough, we should prevent the secret to show up in the diff in the first place.
  1. Is there a way to do the same while decoupling the password change from the application deployments? Like having a rotating user so that each application can rotate to the new one independently or multiple password for the same user.

Some questions/comments:

  1. Netbox: does it require a uwsgi restart or puppet takes care of it already?

It will be the last step so we can check the puppet's output and roll restart if needed (I'll ping you)

  1. As this happened already 2 times, will T332766 prevent the password for showing up in the first place in the diff? Because I don't think that a warning like mentioned in T332817 would be enough, we should prevent the secret to show up in the diff in the first place.

In theory yes, but we should also verify if we use it elsewhere etc.. We may need to prioritize that task though. I am also planning to send an email to ops@ after the rollout completes, so more people are aware of the dangers playing with helmfile.

  1. Is there a way to do the same while decoupling the password change from the application deployments? Like having a rotating user so that each application can rotate to the new one independently or multiple password for the same user.

The multiple password for each user would be ideal, way less painful upgrades for sure, but it is not supported by the current version of Redis :( The rotating users may be a good idea but it requires some planning/tasking/etc.. to figure out how to best roll it out. Not sure if having different users would help us in this case, the killer feature would definitely be multiple password for each user.

  1. As this happened already 2 times, will T332766 prevent the password for showing up in the first place in the diff? Because I don't think that a warning like mentioned in T332817 would be enough, we should prevent the secret to show up in the diff in the first place.

A complicating factor here is nutcracker - as far as I know it can't ingest the value from environment variables, so we might be stuck with it being exposed in nutcracker alone. T333019 has been filed as part of the process of getting away from it, but it will take time.

New Redis password rolled out today :)

sbassett changed the visibility from "Custom Policy" to "Public (No Login Required)".Mar 29 2023, 3:10 PM
sbassett changed the edit policy from "Custom Policy" to "All Users".