Problem
There are two maintenance tasks that are frequently required for Wiki Replicas:
- running the sre.wikireplicas.add-wiki cookbook (when a new wiki is created, see docs)
- running the sre.wikireplicas.update-views cookbook (when the view definitions are updated, see docs)
These tasks don't have a clear process around them, so sometimes they wait for weeks or months before somebody notices they need doing. In December 2024, this was discussed between @Marostegui and @fnegri, and @fnegri volunteered to take responsibility for those, but we should establish a process that does not rely on a single person.
An additional thing to consider is that both cookbooks at the moment apply changes to clouddb* hosts (managed by cloud-services-team) but also to the an-redacteddb* host (managed by Data-Platform-SRE).
Constraints and risks
- running these tasks should not require any work from Data-Persistence
- in the WMCS team only @fnegri at the moment knows the details of how these cookbooks work, the issues that can occur while running them, how to run the cookbook steps manually if required.
- there is no clear "inbox" for requests to run the cookbooks, and running them is generally one step in a larger task. creating such "inbox" is not in scope for this decision request, but we should consider it after this task is resolved.
Decision record
Option 4 was selected. Implementation is pending (see sub-tasks).
Options
Option 1 (status quo)
@fnegri will run the cookbooks. When he's not around, someone from Data-Platform-SRE will have to step in.
Pros:
- No additional effort required from the WMCS team
Cons:
- Relies on a single person
- No knowledge sharing
- Could cause delays when @fnegri is not available
Option 2
The WMCS team member who is on clinic duty runs the cookbooks.
Only in case of issues, they reach out to @fnegri or if he's not available, to Data-Platform-SRE.
Pros:
- Follows an established team process
Cons:
- Coordination needed with Data-Platform-SRE because the cookbook also updates the an-redacteddb1001 host.
Option 3
We ask Data-Platform-SRE to take full responsibility for running those cookbooks. Only in case of issues, they reach out to cloud-services-team.
Pros:
- This somewhat matches what was proposed in this table, under "Applying view changes".
Cons:
- Data Platform SREs own their dedicated wikireplica host (an-redacteddb*) but have little context about public-facing wikireplica hosts (clouddb*) and the users and tools relying on them.
Option 4
We add an option to the cookbooks to specify which hosts should be targeted, so that each team (cloud-services-team and Data-Platform-SRE) can run the cookbooks when it's most convenient, and target only the hosts they manage.
Pros:
- More isolation between the teams: we don't have to worry about impacting another team.
Cons:
- Potential lack of alignment between the views in clouddb* hosts and the views in an-redacteddb* hosts.