changeprop-jobqueue password leaked on phabricator
Closed, ResolvedPublicSecurity
Actions

Assigned To

Authored By

	kostajh
	Mar 20 2023, 2:25 PM

Description

While troubleshooting a failed deployment for T331616: changeprop configuration for notificationGetStartedJob and notificationKeepGoingJob, I inadvertently included a field labeled "password"

changeprop-jobqueue, changeprop-production, Service (v1) has changed:
  # Source: changeprop/templates/service.yaml
  apiVersion: v1
  kind: Service
  metadata:
    name: changeprop-production
    labels:
      app: changeprop
-     chart: changeprop-0.10.21
+     chart: changeprop-0.10.22
      release: production
      heritage: Helm
  spec:
    type: NodePort
    select
                    port: 6379
                    password: {redacted}

This was in a comment on T331616, so subscribers to that task will have received an email with those contents. I deleted the comment within a few seconds of scanning the output and seeing password.

I'm sorry about that, I did not realize that there was the potential for sensitive information to appear in the diff of a helm deploy.

Details

Risk Rating: Medium
Author Affiliation: WMF Product

	Subject	Repo	Branch	Lines +/-
	cumin: add aliases for Redis Misc pairs	operations/puppet	production	+4 -0

Customize query in gerrit

Related Objects

Mentioned In: T333432: Update the Wikitech page of the Redis misc cluster with a sound procedure to rollout a new password
T332817: Improve awareness of sensitive content in helmfile output
T332766: Ingest redis password via environment variables
T332639: Make it more difficult to unintentionally paste secrets, passwords, and keys in Phabricator task descriptions, comments and pastes
Mentioned Here: T333019: Replace Nutcracker for Redis (Thumbor, API Gateway, Changeprop)
T332766: Ingest redis password via environment variables
T332817: Improve awareness of sensitive content in helmfile output
T331616: changeprop configuration for notificationGetStartedJob and notificationKeepGoingJob

Event Timeline

kostajh created this task.Mar 20 2023, 2:25 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 20 2023, 2:25 PM

akosiaris claimed this task.Mar 20 2023, 2:56 PM

akosiaris added a project: serviceops.Mar 20 2023, 3:43 PM

sbassett edited projects, added SecTeam-Processed, Vuln-Infoleak; removed Security-Team.Mar 20 2023, 4:22 PM

sbassett changed Author Affiliation from N/A to WMF Technology Dept.

sbassett changed Risk Rating from N/A to Medium.

sbassett changed Author Affiliation from WMF Technology Dept to WMF Product.

Hi @kostajh, I 'll be handling this one.

This is an internal service, however the firewall is pretty lax (effectively anything in production can connect to these redises), so I 'll prioritize as high

Drafting out the following plan to rollover that password

Figuring out usages. Apparently those are
- Netbox
- Docker registry
- api-gateway - via nutcracker
- changeprop - via nutcracker
- changeprop-jobqueue - via nutcracker
- ORES
- ~~mwdebug - via nutcracker~~
- redisLockManager
Disable puppet in the cluster for ORES, docker-registry, mwdebug and redis_misc
Merge password change in puppet private repo.
Enable puppet on Pair 2 nodes (primaries and secondaries).
Deploy change for ORES (@elukey, this is gonna cause an outage), docker-registry.
Deploy nutcracker-enabled apps, check that they are correctly falling back to the nodes with the new password.
Deploy changes for Netbox, redisLockManager

hnowlan subscribed.Mar 20 2023, 4:40 PM

In T332598#8710830, @akosiaris wrote:

Hi @kostajh, I 'll be handling this one.

Ack. I'm very sorry about this extra work for you all. If I can help clean up the mess somehow, please let me know. For now, I'll review and think about documentation and other types of warnings to help prevent this type of thing in the future.

@akosiaris I would like to take some time to review what we can do for redisLockManager in mediawiki; maybe we can first change the password on the replicas, move redislockmanager to those when we change the password in mediawiki-config as well, then invert the replication direction and make those the masters?

It's too late to make better proposals, maybe we can take a 10 minutes outage of file uploads (which is what I expect unless we do such a dance).

In T332598#8711034, @Joe wrote:

@akosiaris I would like to take some time to review what we can do for redisLockManager in mediawiki; maybe we can first change the password on the replicas, move redislockmanager to those when we change the password in mediawiki-config as well, then invert the replication direction and make those the masters?

Not needed apparently. redisLockManager is split across the 2 pairs (2 instances on Pair 1, 1 on Pair 2). With @jijiki, on the last round of upgrades and rdb reboots we figured out that it's not so brittle after all, since we rebooted Pair 1 and saw no issues.

It's too late to make better proposals, maybe we can take a 10 minutes outage of file uploads (which is what I expect unless we do such a dance).

Yeah, I think we can survive with that.

I think I can successfully resolve it. The aftermath of the emergency rollover was more or less

A small outage for ORES in codfw (it was using the wrong pair of redis misc nodes)
A hiccup in changeprop that took a while for backlog to get served
A small window where while running scap sync-file redisLockManager wasn't able to authenticate to any redis (design issue, not easy to fix while in an emergency rollover, had to bite the bullet)

Do we make this public? I 'll answer that tomorrow.

Adding a few more people for their information as services they own where impacted.

kostajh mentioned this in T332639: Make it more difficult to unintentionally paste secrets, passwords, and keys in Phabricator task descriptions, comments and pastes.Mar 20 2023, 8:40 PM

I am really sorry but I made the mistake today when I was asked to provide a phaste of changeprop's helmfile diff, I didn't really think about a password :(

hnowlan mentioned this in T332766: Ingest redis password via environment variables .Mar 22 2023, 11:05 AM

kostajh mentioned this in T332817: Improve awareness of sensitive content in helmfile output.Mar 22 2023, 8:05 PM

Draft of the procedure that I should follow to redo what Alexandros already did a couple of days ago:

Familiarize with Redis nodes:

rdb10(09|11) ==> eqiad masters

rdb101[02] ===> eqiad replicas

rdb200[79] ===> codfw masters

rdb20(08|10) ===> codfw replicas


Pair 1: rdb1009 and rdb2007 and replicas: rdb1010, rdb2008

Pair2: rdb1011 and rdb2009 and replicas rdb1012, rdb2010

disable puppet on target nodes (non k8s deployments)

cumin 'A:netbox or A:docker-registry or A:ores or A:redis-misc-master or A:redis-misc-slave' 'disable-puppet "elukey - T332598"'

Stage the change in the puppet private repository to change the Redis password, and ask for a review before merging.

Merge the code change in puppet private.

Run puppet on deploy2002 to make sure that helmfile private files are updated + check with a simple helmfile diff for changeprop that the redis password is indeed picked up.

Update the Pair 2 redis hosts (master and replicas):

# masters first
cumin -m async 'rdb1011* or rdb2009*' 'enable-puppet "elukey - T332598"' 'run-puppet-agent' 

# wait a couple of mins then
cumin -m async 'rdb1012* or rdb2010*' 'enable-puppet "elukey - T332598"' 'run-puppet-agent'

Run puppet on ORES nodes and Docker registry nodes:

cumin 'A:docker-registry or A:ores' 'enable-puppet "elukey - T332598"' 'run-puppet-agent'

Deploy api-gateway, changeprop and changeprop-jobqueue. At this point you should see the above nodes used and not the ones not yet updated (because of nutcracker).

Go to deploy2002, cd in /srv/mediawiki-staging/private and change the Redis password in PrivateSettings.php. Commit like it was the puppet private repo, and then run scap sync-file private/PrivateSettings.php.

From this moment, mediawiki may have some troubles with file uploads since we haven't updated Pair 1 redises yes, so please go to the next step asap.

Now we move to Pair 1 Redises, where some clients do have nutcracker failover (on the client side):

# masters first
cumin -m async 'rdb1009* or rdb2007* 'enable-puppet "elukey - T332598"' 'run-puppet-agent' 

# then after a couple of mins
cumin -m async 'rdb1010* or rdb2008*' 'enable-puppet "elukey - T332598"' 'run-puppet-agent'

Run puppet on netbox nodes

cumin 'A:netbox' 'enable-puppet "elukey - T332598"' 'run-puppet-agent'

Overall pretty good write up of my actions on Monday.

Minor notes:

cumin -m async 'rdb1011* or rdb2009* or rdb1012* or rdb2010*' 'enable-puppet "elukey - T332598"' 'run-puppet-agent'

Split this in 2 groups. Masters first, replicas a little bit later. The reason is that replicas fetch the password via a puppetdb query and thus require for a puppet run first one the masters (and yes, kinda ew). On the plus side, you don't even need the forced puppet run on the replicas, nothing uses them.

After this you'll probably need to roll restart the redis instances on them to pick up the new password (to be verified with Alex).

No need for this one, once puppet runs, redises will be restarted (and prom exporters too)

Run puppet on ORES nodes and Docker registry nodes:

Docker registry will only issue cache misses that will be slightly visible in a grafana graph. No other consequence.

ORES will suffer a small meltdown. To be completely sure of the restart I embedded 'run-puppet-agent ; systemctl restart ores-uwsgi ; systemctl restart celery-ores-worker'. And yes, the inconcistency of where "ores" is placed in the name of those 2 services is killing me.

Now we move to Pair 1 Redises, where some clients do have nutcracker failover (on the client side):

No, we first deploy the thing with nutcracker (api-gateway, changeprop, changeprop-jobqueue). The reason for that is that nutcracker will have already ditched the Pair 2 redises due to inability to auth, we don't want it to also ditch the Pair 1 while running

Now we can go for enabling puppet on Pair 1 redises.

PrivateSettings.php - how should it be updated/deployed?

It's important to stage the change deploy2002:/srv/mediawiki-staging/private$ and once puppet has successfully ran on Pair 1, then scap sync-file private/PrivateSettings.php

Thanks a lot for the feedback, I have updated my plan accordingly :)

Some questions/comments:

Netbox: does it require a uwsgi restart or puppet takes care of it already?

As this happened already 2 times, will T332766 prevent the password for showing up in the first place in the diff? Because I don't think that a warning like mentioned in T332817 would be enough, we should prevent the secret to show up in the diff in the first place.

Is there a way to do the same while decoupling the password change from the application deployments? Like having a rotating user so that each application can rotate to the new one independently or multiple password for the same user.

jijiki moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.Mar 27 2023, 1:22 PM

In T332598#8727900, @Volans wrote:

Some questions/comments:

Netbox: does it require a uwsgi restart or puppet takes care of it already?

It will be the last step so we can check the puppet's output and roll restart if needed (I'll ping you)

As this happened already 2 times, will T332766 prevent the password for showing up in the first place in the diff? Because I don't think that a warning like mentioned in T332817 would be enough, we should prevent the secret to show up in the diff in the first place.

In theory yes, but we should also verify if we use it elsewhere etc.. We may need to prioritize that task though. I am also planning to send an email to ops@ after the rollout completes, so more people are aware of the dangers playing with helmfile.

Is there a way to do the same while decoupling the password change from the application deployments? Like having a rotating user so that each application can rotate to the new one independently or multiple password for the same user.

The multiple password for each user would be ideal, way less painful upgrades for sure, but it is not supported by the current version of Redis :( The rotating users may be a good idea but it requires some planning/tasking/etc.. to figure out how to best roll it out. Not sure if having different users would help us in this case, the killer feature would definitely be multiple password for each user.

In T332598#8727900, @Volans wrote:

As this happened already 2 times, will T332766 prevent the password for showing up in the first place in the diff? Because I don't think that a warning like mentioned in T332817 would be enough, we should prevent the secret to show up in the diff in the first place.

A complicating factor here is nutcracker - as far as I know it can't ingest the value from environment variables, so we might be stuck with it being exposed in nutcracker alone. T333019 has been filed as part of the process of getting away from it, but it will take time.

New Redis password rolled out today :)

elukey closed this task as Resolved.Mar 27 2023, 4:23 PM

elukey mentioned this in T333432: Update the Wikitech page of the Redis misc cluster with a sound procedure to rollout a new password.Mar 29 2023, 9:37 AM

sbassett changed the visibility from "Custom Policy" to "Public (No Login Required)".Mar 29 2023, 3:10 PM

sbassett changed the edit policy from "Custom Policy" to "All Users".

changeprop-jobqueue password leaked on phabricatorClosed, ResolvedPublicSecurityActions

Description

Details

Related Objects

Event Timeline

changeprop-jobqueue password leaked on phabricator
Closed, ResolvedPublicSecurity
Actions