Page MenuHomePhabricator

Create VictorOps config for new Data Platform SRE team
Closed, ResolvedPublic

Description

Hello O11y!

Per IRC conversation with @herron , we're onboarding a the new Data Platform SRE team into AlertManager and we need help creating the VictorOps config.

Thanks for your time and please let us know if you need additional info.

Event Timeline

Hey @bking, thanks for the task! Could you please point me towards the current team roster in order to get the ball rolling for team/account creations?

I'm wondering whether it would be a good idea to re-use the existing Data Engineering team and analytics route key, or make a new one. The reason I mention it is that, technically speaking, the Data Engineering team no longer has any embedded SREs (or Victorops seats) any more, since we moved to the Data-Platform-SRE team.

We had only integrated a couple of checks with this routing key and the recent team reorganisation has changed the landscape a little again.

Also, I believe that @Gehel was also thinking about whether or not we should have separate routing keys (or even teams) for supporting Discovery-Search systems (e.g. Elasticsearch, WDQS,WCQS etc) vs Data-Engineering systems (Hadoop, Hive, Presto, Superset, Druid etc). It might even be worth discussing this more widely, with people such as @WDoranWMF too, as we're reviewing how the alerting and Ops Week will work in the new team structure.

Sorry for muddying the water. :-)

herron changed the task status from Open to Stalled.Aug 15 2023, 4:59 PM

Thanks for the info. With this in mind I'm going to stall this victorops setup task while the details of the desired alerting/paging/team layout are decided. Once that's sorted please update with the desired team name(s) and members and we'll work on setting that up. Thanks!

To help move this forward, I've created a a DPE contact plan. Once the stakeholders are in agreement, we should be able to re-engage with this ticket. Sorry for the delay!

Gehel triaged this task as Low priority.Nov 22 2023, 9:50 AM
BTullis changed the task status from Stalled to Open.Feb 8 2024, 1:34 PM
BTullis claimed this task.
BTullis raised the priority of this task from Low to Medium.

Could someone from the observability team rename the analytics routing key in VictorOps to data-platform please?
I don't have the necessary rights to do so.
We will be changing the Icinga/Alertmanager configuration to match. Thanks.

image.png (691×397 px, 32 KB)

Could someone from the observability team rename the analytics routing key in VictorOps to data-platform please?

{{done}}

Thanks @fgiunchedi.

The new routing key is in place in the Alertmanager configuration. I have now merged this https://gerrit.wikimedia.org/r/c/operations/puppet/+/989900/5/modules/alertmanager/templates/alertmanager.yml.erb#331

I realise that there is still a small change we should make to the Icinga configuration, to update the victorops-analytics contactgroup.
https://github.com/wikimedia/operations-puppet/blob/production/modules/nagios_common/files/contactgroups.cfg#L119-L122

I'll make that change, then we can call this ticket done.

Change 1006047 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Rename victorops-analytics to victorops-data-platform

https://gerrit.wikimedia.org/r/1006047

Change 1006047 merged by Btullis:

[operations/puppet@production] Rename victorops-analytics to victorops-data-platform

https://gerrit.wikimedia.org/r/1006047

This final change is now deployed, so our victorops notification channel is available in both Alertmanager and Icinga.
Now we have to decide which services should be hooked up to page us and what our schedule / rotations should be like.