Page MenuHomePhabricator

Change the analytics-alerts email alias to a mailman distribution list
Closed, ResolvedPublic

Assigned To
Authored By
xcollazo
Aug 17 2022, 8:05 PM
Referenced Files
F35641648: image.png
Oct 27 2022, 3:21 PM
F35623756: image.png
Oct 25 2022, 8:32 AM
F35623742: image.png
Oct 25 2022, 8:32 AM
F35595338: image.png
Oct 18 2022, 10:17 AM
Tokens
"Like" token, awarded by Dzahn.

Description

The SRE team would like to remove analytics-alerts from the Exim aliases.

After some discussion we have decided to use mailman for this.

We will also change the name of the list to data-engineering-alerts@lists.wikimedia.org

Event Timeline

to the SRE on clinic duty: this is configured in [puppetmaster1001:/srv/private/modules/privateexim/files/wikimedia.org (but not sure if analytics sre want to confirm this or do it themselves rather)

in general it could be nice if this group alias was also moved to ITS to a Google group. They can then make some of the group members group admins and it won't need SRE tickets anymore. (T122144 ff)

`

This is how to get a list of all existing members: [mx1001:~] $ sudo exim4 -bt analytics-alerts@wikimedia.org. Asking ITS to make it a Google group should be just an email to techsupport@ if that is an option.

Migrating it to mailman3 would help if the volume is not too large. cc. @Ottomata

jbond triaged this task as Medium priority.Sep 6 2022, 2:44 PM
jbond removed a project: Infrastructure-Foundations.
EChetty raised the priority of this task from Medium to High.Sep 27 2022, 2:24 PM

Asking ITS to make it a Google group should be just an email to techsupport@ if that is an option.

Agreed that would be more efficient. Will discuss with team.

But right now I keep missing alerts that I could help on because I am not a member of this mail list.

Can we please move forward on this task?

CC @EChetty, @MNadrofsky

team membership confirmed per https://www.mediawiki.org/wiki/Platform_Engineering_Team/Data_Value_Stream


@xcollazo I added you just now. You can now expect to start receiving those emails. It would be preferred if we can keep the ticket open though to move this over to Google. That way we can decentralize control of the alias and future edits won't need SRE clinic duty anymore. Cheers

Thank you @Dzahn!

( Side note: I have confirmed that we can make the list public if we choose to move it to Google Groups. As an example, platform-eng-alerts, which currently only covers alerts from the platform-eng Airflow instance, has a public list at https://groups.google.com/a/wikimedia.org/g/platform-eng-alerts. )

CC @odimitrijevic

@xcollazo ITS can create the group and then give admin ship to your team so that you can self-manage it.

@Dzahn: we discussed moving the list today and there was concern on whether we could make the content of the list public, thus my note above confirming that we can.

I'll update this task once we make a decision.

@xcollazo There are 2 possible routes you can go. Both result in your team being able to self-manage the list.

a) Google group (I am not actually sure about all the possible settings there regarding what is public and what is not but ITS would be there to advice on the private vs public options)

b) Mailman (lists.wikimedia.org) - This is the recommended and traditional way if you actually want a public list with archives and where volunteers can subscribe. There are also many different settings regarding what is public vs private (subscribing, archives, viewing list of members etc) but you would also be the list admins and could configure it as you see fit.

I would recommend you pick either of the options but we don't keep the current situation which means having to ping a root user to make an edit each time.

Cheers

Ack @Dzahn, thank you for the context and options! Will discuss with team and get back to you.

We discussed this and figured that @BTullis, and a new SRE that is joining us soon, should own the list. They are about embedded SREs in our team and have access to add and remove members; I had just missed that.

I have also updated Data Engineering's Onboarding Wiki to reflect the proper procedure as to not put unnecessary workload on other SREs.

Closing. Thanks again @Dzahn !

Keeping the existing setup was the one possible outcome I had tried to prevent here :(

@BTullis Is there any way we could get this out of the exim aliases? ...pleaaasse....

Ok @Dzahn - I'm sorry, I didn't realise that moving this out of the Exim aliases was important to you.

I thought you were just trying to cut down on work for the core SRE team. I'm happy to reopen and coordinate with ITS to move this alias to Google Groups if you prefer.

Thanks @Ladsgroup - noted. So it seems that we have five topic or team based -alerts lists on mailman already:

  • betacluster-alerts
  • discovery-alerts
  • multimedia-alerts
  • qa-alerts (quality-assurance)
  • sd-alerts (structured-data)

...as well as three others that have moved to Google Groups. https://groups.google.com/all-groups?q=alerts

  • platform-eng-alerts
  • bacula-alerts
  • performance-team-alerts

image.png (333×685 px, 38 KB)

After checking, I can see that analytics-alerts is the only -alerts list still remaining in the Exim aliases. @Dzahn has expressed a strong desire for us to remove this alias from the Exim configuration.

So we could go with the majority and move it to mailman. I believe I can do this and set myself as an initial admin, then add more administrators to the list, despite not technically being a mailman administrator. I can use sudo as per: https://wikitech.wikimedia.org/wiki/Mailman#Create_a_mailing_list - @Ladsgroup - can you confirm whether this is correct please, or would I need to contact a mailman administrator first? We'll also need subsequent changes to the Icinga and Alertmanager configurations to make analytics-alerts@lists.wikimedia.org the target address instead of analytics-alerts@wikimedia.org - Both of these changes I can do if this is the way we go.

@odimitrijevic - are you happy for us to make this change? If you would rather that we use a Google Group for any reason, as opposed to mailman, or if you have a strong desire to keep the status quo, please let us know.

Thanks @Ladsgroup - noted. So it seems that we have five topic or team based -alerts lists on mailman already:

  • betacluster-alerts
  • discovery-alerts
  • multimedia-alerts
  • qa-alerts (quality-assurance)
  • sd-alerts (structured-data)

...as well as three others that have moved to Google Groups. https://groups.google.com/all-groups?q=alerts

  • platform-eng-alerts
  • bacula-alerts
  • performance-team-alerts

image.png (333×685 px, 38 KB)

After checking, I can see that analytics-alerts is the only -alerts list still remaining in the Exim aliases. @Dzahn has expressed a strong desire for us to remove this alias from the Exim configuration.

So we could go with the majority and move it to mailman. I believe I can do this and set myself as an initial admin, then add more administrators to the list, despite not technically being a mailman administrator. I can use sudo as per: https://wikitech.wikimedia.org/wiki/Mailman#Create_a_mailing_list - @Ladsgroup - can you confirm whether this is correct please, or would I need to contact a mailman administrator first? We'll also need subsequent changes to the Icinga and Alertmanager configurations to make analytics-alerts@lists.wikimedia.org the target address instead of analytics-alerts@wikimedia.org - Both of these changes I can do if this is the way we go.

Mostly, don't worry about sudo, I can create it for you. Just noting that it will be "whatever-name-alerts@lists.wikimedia.org" (note the "lists" part), while we are here, we can simply start it with data-engineering-alerts@ to reflect the new name and update tools to point to the new address.

Another thing. Mailman2 had many many issues but mailman3 (the current infra) is much easier to use and handle.

BTullis renamed this task from Add xcollazo@wikimedia.org to the analytics-alerts mailing list to Change the analytics-alerts email alias to a mailman distribution list.Oct 20 2022, 3:47 PM
BTullis updated the task description. (Show Details)

Mostly, don't worry about sudo, I can create it for you. Just noting that it will be "whatever-name-alerts@lists.wikimedia.org" (note the "lists" part), while we are here, we can simply start it with data-engineering-alerts@ to reflect the new name and update tools to point to the new address.

@Ladsgroup - could you go ahead with this please? As you suggest we would like it to be: data-engineering-alerts@lists.wikimedia.org
Please would you add me as initial admin? Am I able to promote other team members to admin, or would you like a full list of admins now?

Thank you very much for doing this, @BTullis

@BTullis Done! Please check your mail.

[lists1001:~] $ sudo mailman-wrapper create --owner btullis@wikimedia.org data-engineering-alerts@lists.wikimedia.org
Created mailing list: data-engineering-alerts@lists.wikimedia.org

Pretty sure you can add new admins now or later.

Great! Many thanks. I'll leave this ticket open until we've migrated Icinga, Alertmanager to this new address, then removed from the exim aliases.
I've encouraged all of the Data Engineering team to subscribe and I'll promote those who wish to owners or moderators.

The very last step would then be to remove the line from the puppetized exim aliases in the private repo.

Change 845030 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the email address for data-engineering alerts

https://gerrit.wikimedia.org/r/845030

For Icinga there is already a related CR changing the way that the timers work, so I have asked @fgiunchedi to update the email addresses used in that patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/843885

Change 845030 merged by Btullis:

[operations/puppet@production] Update the email address for data-engineering alerts

https://gerrit.wikimedia.org/r/845030

Change 848262 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update remaining references to analytics-alerts email

https://gerrit.wikimedia.org/r/848262

Change 848262 merged by Btullis:

[operations/puppet@production] Update remaining references to analytics-alerts email

https://gerrit.wikimedia.org/r/848262

Change 848269 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/refinery@master] Update the email used for alerting the data engineering team

https://gerrit.wikimedia.org/r/848269

So far, all of the inbound messages to this mailing list have been held for moderation, requiring an admin to accept or reject them.
That's not at all suitable for an alerting system, so I have been trying to find the right settings to correct this behaviour.

I've added to regex matches to the 'Accept these non-members here' list here: https://lists.wikimedia.org/postorius/lists/data-engineering-alerts.lists.wikimedia.org/settings/message_acceptance

image.png (352×786 px, 23 KB)

@Ladsgroup , @Dzahn - Is there any other specific guidance for making this list more suitable for use as an alerting system?

I've never been too comfortable with the fact that hosts send messages from a non-routable domain, i.e. .eqiad.wmnet

image.png (118×483 px, 11 KB)

...but I haven't yet spent any real effort in thinking how it could be made better.

@Dzahn - would it be acceptable for us to use the exim aliases file to forward the old alias to the new list temporarily? i.e. to have the following in puppetmaster1001:/srv/private/modules/privateexim/files/wikimedia.org

analytics-alerts: data-engineering-alerts@lists.wikimedia.org

This is the same way that the ops list is configured, it seems.

...for a while? There's some discussion on https://gerrit.wikimedia.org/r/848269 about the fact that updating all of oozie's email addresses might be time-consuming and we're already planning to deprecate that system.
Therefore, leaving the existing alias in place for oozie, whilst forwarding to the new list might be a good interim solution.

What you did for accepting non-members looks good to me. I haven't seen any held messages anymore. So It's working?

I'm happy to make that an alias for now. Don't know about Daniel.

@BTullis Yea, that is accetable. It's still progress over managing the group members in the exim file directly. So thank you for that!

By the way, there is still the Google option. It would only be a one-time request to ITS. I want to point out it would NOT mean that you have to keep asking them for each edit. They would just delegate admin to you once and you would completely self-manage.

The advantage would be that in this case the email address does not change at all. So you wouldn't have to update it anywhere.

OK, thanks all. I'll make that change to the exim aliases file: analytics-alerts: data-engineering-alerts@lists.wikimedia.org instead of the user list.

When oozie has been decommissioned and there are no more systems using analytics-alerts@wikimedia.org then I'll remove it altogether.

Change 848269 abandoned by Btullis:

[analytics/refinery@master] Update the email used for alerting the data engineering team

Reason:

We are keeping the alias for now, so we don't need to update oozie.

https://gerrit.wikimedia.org/r/848269

After modifying the alias I also needed to set the following option in mailman.

image.png (92×737 px, 7 KB)

Change 896137 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Remove the install-crds parameter frlom spark-operator

https://gerrit.wikimedia.org/r/896137

Change 896137 merged by jenkins-bot:

[operations/deployment-charts@master] Remove the install-crds parameter frlom spark-operator

https://gerrit.wikimedia.org/r/896137

Sorry, these two patches are unrelated to this patch. Added by mistake.