Page MenuHomePhabricator

Decision request - Who runs wikireplicas cookbooks
Open, In Progress, MediumPublic

Description

Problem

There are two maintenance tasks that are frequently required for Wiki Replicas:

  • running the sre.wikireplicas.add-wiki cookbook (when a new wiki is created, see docs)
  • running the sre.wikireplicas.update-views cookbook (when the view definitions are updated, see docs)

These tasks don't have a clear process around them, so sometimes they wait for weeks or months before somebody notices they need doing. In December 2024, this was discussed between @Marostegui and @fnegri, and @fnegri volunteered to take responsibility for those, but we should establish a process that does not rely on a single person.

An additional thing to consider is that both cookbooks at the moment apply changes to clouddb* hosts (managed by cloud-services-team) but also to the an-redacteddb* host (managed by Data-Platform-SRE).

Constraints and risks

  • running these tasks should not require any work from Data-Persistence
  • in the WMCS team only @fnegri at the moment knows the details of how these cookbooks work, the issues that can occur while running them, how to run the cookbook steps manually if required.
  • there is no clear "inbox" for requests to run the cookbooks, and running them is generally one step in a larger task. creating such "inbox" is not in scope for this decision request, but we should consider it after this task is resolved.

Decision record

Option 4 was selected. Implementation is pending (see sub-tasks).

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T382607_Who_runs_wikireplicas_cookbooks

Options

Option 1 (status quo)

@fnegri will run the cookbooks. When he's not around, someone from Data-Platform-SRE will have to step in.

Pros:

  • No additional effort required from the WMCS team

Cons:

  • Relies on a single person
  • No knowledge sharing
  • Could cause delays when @fnegri is not available

Option 2

The WMCS team member who is on clinic duty runs the cookbooks.

Only in case of issues, they reach out to @fnegri or if he's not available, to Data-Platform-SRE.

Pros:

  • Follows an established team process

Cons:

  • Coordination needed with Data-Platform-SRE because the cookbook also updates the an-redacteddb1001 host.

Option 3

We ask Data-Platform-SRE to take full responsibility for running those cookbooks. Only in case of issues, they reach out to cloud-services-team.

Pros:

  • This somewhat matches what was proposed in this table, under "Applying view changes".

Cons:

  • Data Platform SREs own their dedicated wikireplica host (an-redacteddb*) but have little context about public-facing wikireplica hosts (clouddb*) and the users and tools relying on them.

Option 4

We add an option to the cookbooks to specify which hosts should be targeted, so that each team (cloud-services-team and Data-Platform-SRE) can run the cookbooks when it's most convenient, and target only the hosts they manage.

Pros:

  • More isolation between the teams: we don't have to worry about impacting another team.

Cons:

  • Potential lack of alignment between the views in clouddb* hosts and the views in an-redacteddb* hosts.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
fnegri updated the task description. (Show Details)

Option #1 and option #2 are probably not something we in Data-Persistence would be comfortable with. Per the shared responsibilities document, we own the previous step (preparing everything for it), so owning also this step would be a significant change in our workflow.

@Marostegui I removed mentions of Data-Persistence from options #1 and #2, and added a constraint that whichever option we choose, it should not require any work from Data-Persistence.

My preference is for option 4, but I think options 2 and 3 are also viable.

I like option #4 too, but ideally hosts and scripts should be idempotent and should be able to run all at once to avoid those misalignment. I think we need Data-Platform-SRE to chime in here too.

running these tasks should not require any work from Data-Persistence

I guess this restriction was agreed upon? If not I'd like to rethink it, as afaik the cookbook does not really do changes in the infra, but just the replicated DBs themselves (and that's what Data-Persistence knows best, as in what the change is, what's expected to look like, how to test it, debug it, ...)
If it was agreed then that's ok :)

Potential lack of alignment between the views in clouddb* hosts and the views in an-redacteddb* hosts.

Can you elaborate a bit more on the consequences of this?
If it's not very troublesome, then 4 is my preference (specially when the "inbox" issue is resolved).

Note that I'm not very familiar with the potential issues that can arise from the cookbook, if the issues are commonly forwarded to Data-Persistence, then I'm ok with 2 and 3 too (as whomever runs them, will have to do the same thing no matter where the cookbook fails).

running these tasks should not require any work from Data-Persistence

I guess this restriction was agreed upon? If not I'd like to rethink it, as afaik the cookbook does not really do changes in the infra, but just the replicated DBs themselves (and that's what Data-Persistence knows best, as in what the change is, what's expected to look like, how to test it, debug it, ...)
If it was agreed then that's ok :)

Those cookbooks are for the views, which is something we've never touched, designed or maintained. We do all the stuff that is required to the point where those cookbooks can do their magic.

Those cookbooks are for the views, which is something we've never touched, designed or maintained. We do all the stuff that is required to the point where those cookbooks can do their magic.

Ack, thanks!
I think I misread Data Engineering with Data Persistence (from the table https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#Who_admins_what), are they involved in this task? (should they?)

Those cookbooks are for the views, which is something we've never touched, designed or maintained. We do all the stuff that is required to the point where those cookbooks can do their magic.

Ack, thanks!
I think I misread Data Engineering with Data Persistence (from the table https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#Who_admins_what), are they involved in this task? (should they?)

They are involved because they own an-redacteddb hosts which have views and they are installed via the same cookbook.

They are involved because they own an-redacteddb hosts which have views and they are installed via the same cookbook.

Isn't that Data Platform SRE? Are those the same team? (too many team renamings going around xd)

I think I am also confused with all the renaming - I don't know anymore!

My understanding:

Data-Engineering are responsible for the views definition, which columns should be redacted etc. ("View creation / updates" in the table).

Data-Platform-SRE are a separate team, and they own the an-redacteddb host, so they need to make sure the add-wiki and update-views cookbooks are run periodically on that host (either by them, or by WMCS).

I'm not very familiar with the potential issues that can arise from the cookbook, if the issues are commonly forwarded to Data-Persistence

The last time we had some issues running the cookbook, I addressed them with help from Data-Platform-SRE (@BTullis). I don't think Data-Persistence was involved, excluding maybe some quick questions/suggestions on IRC.

...

The last time we had some issues running the cookbook, I addressed them with help from Data-Platform-SRE (@BTullis). I don't think Data-Persistence was involved, excluding maybe some quick questions/suggestions on IRC.

Thanks! That's helpful, then yep, I think option 4 sounds best to me, with 2 and 3 being second.

fnegri changed the task status from Open to In Progress.Jan 15 2025, 3:05 PM

#3 feels like the right solution to me, but I could understand it if they don't have the bandwidth to take on the extra work.

Could we have some kind of hybrid between #3 and #4 where wmcs builds/maintains the cookbooks but the job of running the cookbooks (and knowing when to run them) is the responsibility of Data-Platform? (This proposal is not really different from what's written as #4 except for the clear assignment of awareness after the fact.)

aborrero changed the task status from In Progress to Stalled.Mar 20 2025, 3:09 PM
aborrero subscribed.

Stall: pending a meeting or chat with other potential owners.

fnegri changed the task status from Stalled to In Progress.Apr 16 2025, 4:49 PM
fnegri added a subscriber: taavi.

This ticket has been stalled for a while, I'll try to resurrect it as I would really like to get it to some form of resolution.

I'm also cc-ing @taavi who has just rejoined the WMCS team and has also worked on improving these cookbooks.

So far we have three votes for option #4 (myself, @dcaro and @Marostegui) and one vote for option #3 (@Andrew).

Before declaring that #4 as the preferred solution, I will try to answer a few open questions from the comments above.

Could we have some kind of hybrid between #3 and #4 where wmcs builds/maintains the cookbooks

I agree we should clarify who is the maintainer of the cookbooks (the actual Python code), and I'm fine with them being officially maintained by WMCS (others are always welcome to submit patches).

but the job of running the cookbooks (and knowing when to run them) is the responsibility of Data-Platform?

I still think that we should modify the cookbooks so that WMCS can run them for clouddbs, and Data Platform SRE can run them for an-redacteddb. In this way we enforce a proper separation of ownership for these servers: WMCS will run the cookbook for clouddbs whenever they like and will not risk breaking an-redacteddb servers at unexpected times, and vice versa Data Platform SRE will not risk breaking clouddb servers at unexpected times.

I like option #4 too, but ideally hosts and scripts should be idempotent and should be able to run all at once to avoid those misalignment.

To address the risk of misalignment, I think that my preferred way would be to have a timer that runs a dry-run of the cookboook (or the underlying script) every few hours and triggers and alert or an email if there is any pending action, so that somebody can manually run the cookbooks. In the future, we might even automate this step if we're confident the cookbooks are working fine 99% of the time, but I think this is not going to happen too soon.

I think we need Data-Platform-SRE to chime in here too.

I would also like to get a +1 from Data-Platform-SRE to the approach described in option #4, because that would mean that WMCS will stop doing any kind of update to an-redacteddb. This should not be much extra work for them, and has the benefit they would have full control on when the cookbooks are run on their server.

Thanks @fnegri for all of your hard work on this issue and apoloies for the delay in getting back to you.

I still think that we should modify the cookbooks so that WMCS can run them for clouddbs, and Data Platform SRE can run them for an-redacteddb. In this way we enforce a proper separation of ownership for these servers: WMCS will run the cookbook for clouddbs whenever they like and will not risk breaking an-redacteddb servers at unexpected times, and vice versa Data Platform SRE will not risk breaking clouddb servers at unexpected times.

I like option #4 too, but ideally hosts and scripts should be idempotent and should be able to run all at once to avoid those misalignment.

To address the risk of misalignment, I think that my preferred way would be to have a timer that runs a dry-run of the cookboook (or the underlying script) every few hours and triggers and alert or an email if there is any pending action, so that somebody can manually run the cookbooks. In the future, we might even automate this step if we're confident the cookbooks are working fine 99% of the time, but I think this is not going to happen too soon.

Yes, I am happy with this approach, too. Let's implement the same dry-run script to notify Data-Platform-SRE as well, for any pending changes to an-redacteddb1001.

I think we need Data-Platform-SRE to chime in here too.

I would also like to get a +1 from Data-Platform-SRE to the approach described in option #4, because that would mean that WMCS will stop doing any kind of update to an-redacteddb. This should not be much extra work for them, and has the benefit they would have full control on when the cookbooks are run on their server.

That is a +1 from me to option #4.

We can still have a switch to the cookbook to apply to all servers if we wish. That way, we would not be not prevented from keeping an-redacteddb1001 in sync with the clouddb* servers, if there is a particularly big change and we are collaborating on deployment.

For a bit of background, I know that the Data-Engineering team has a potential project that would make an-redacteddb1001 obsolete. However, it has not been prioritized yet, which is why we had to refresh the old host (clouddb1021) with an-redacteddb1001. So while we still need that host for now, perhaps it won't be around to be a problem for all that long. I'll speak to @JAllemandou and others about how realistic this is in the next quarter or two.

Change #1206196 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove the Data Platform SRE team from the contactgroup for wikireplicas

https://gerrit.wikimedia.org/r/1206196

Change #1206196 merged by Btullis:

[operations/puppet@production] Remove the Data Platform SRE team from the contactgroup for wikireplicas

https://gerrit.wikimedia.org/r/1206196