Page MenuHomePhabricator

Update the From: addresses of all email from DPE pipelines so that they use routable addresses
Closed, ResolvedPublic

Description

We have decided to move the Data Engineering Alerts email distribution list:

  • from: data-platform-engineering@lists.wikimedia.org which is running on mailman
  • to: data-platform-alerts@wikimedia.org which is a Google Group

In order to make this transition, we have to make sure that all email sent by the various pipelines components we have is using a routable From: address.

For example, we currently receive lots of email from refine@an-launcher1002.eqiad.wmnet

This email would not be accepted by the Google Group, whereas we have been able to tweak the mailman configuratio to accept it.

This ticket is about identifying and updating all of the places where email is sent by the DPE servers and making sure that a valid From: address is used.

Event Timeline

Gehel triaged this task as High priority.Feb 29 2024, 9:28 AM

Change 1007576 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Allow the systemd-timer-mail-wrapper to override the sender

https://gerrit.wikimedia.org/r/1007576

Change 1007577 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Allow systemd::timer::job to send from a custom address

https://gerrit.wikimedia.org/r/1007577

Change 1007578 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Allow kerberos::systemd::timer to use a custom email sender

https://gerrit.wikimedia.org/r/1007578

I have started writing the patches to permit this overriding of the sender address in the systemd timer based email alerts.
I broke it down into three separate patches.

  • Update the systemd-timer-mail-wrapper.py to add support for a --mail-from argument, which sets the From: and Sender: headers to this value, rather than the existing behaviour.
  • Update the systemd::timer::job defined type to add this new argument, if the relevant parameter is set, to the ExecStart option in the systemd unit that is rendered on disk by puppet.
  • Update the kerberos::systemd_timer defined type, adding a matching parameter that allows us to set this sender address.

These three patches should constitute a noop change, but I would like to make sure that the Infrastructure-Foundations team is happy with this approach before I proceed.
I have added @MoritzMuehlenhoff, @jhathaway, and @ayounsi as reviewers from that team.

The next step for me, if this approach is OK, will be to update the various places there the systemd::timer:job and kerberos::ststemd_timer resources are instantiated with the send_mail parameter being true, on all of the Data-Platform servers.

Defined types:

  • profile::analytics::refinery::job::gobblin_job
  • profile::analytics::refinery::job::spark_job
  • profile::analytics::refinery::job::import_wikibase_dumps_config
  • profile::analytics::refinery::job::java_job

Classes

  • profile::analytics::refinery::job::sqoop_mediawiki
  • profile::analytics::refinery::job::data_check
  • profile::analytics::refinery::job::data_purge
  • profile::analytics::refinery::job::hdfs_cleaner

There are probably lots more that are related to the Dumps architecture as well, but I haven't gone through those yet.

Change 1007576 merged by Btullis:

[operations/puppet@production] Change the default systemd timer email source to noreply@wikimedia.org

https://gerrit.wikimedia.org/r/1007576

After some discussion on https://gerrit.wikimedia.org/r/1007576 we decided to change the default From: and Sender: addresses for all systemd timers to be SYSTEMDTIMER <noreply@wikimedia.org>

This can be overridded with a new --mail-from parameter, which we can set on the systemd::timer::job and kerberos::systemd_timer types, but we won't be required to do so in order to make the emails from timers be accepted by Google Groups.

Change 1011342 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Use a routable sender address for email from Airflow

https://gerrit.wikimedia.org/r/1011342

Change #1011342 merged by Btullis:

[operations/puppet@production] Use a routable sender address for email from Airflow

https://gerrit.wikimedia.org/r/1011342

Although the default email address for systemd timers has changed to noreply@wikimedia.org we are still receiving some mail from the Refine and RefineMonitor jobs with unroutable domains.
I'll carry on investigating why this might be.

image.png (975×1 px, 158 KB)

Oh right, it seems that RefinerySource has its own built-in email sending code, so it's not using systemd-timer-mail-wrapper at all, despite currently being launched from a systemd timer.

The code for sending email is here.
The place where the default from address is sent is here.

I'll look to see what's the lightest-weight change that we can make, but also bear in mind that we wish to migrate these refinery jobs from systemd to airflow anyway.
It might be that there are already plans to change the way that the output is captured and sent by email, as part of the Airflow migration.

Change #1014001 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the from address of all email from refinery jobs.

https://gerrit.wikimedia.org/r/1014001

Change #1014004 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/refinery/source@master] Update the from address of refine reports to be routable

https://gerrit.wikimedia.org/r/1014004

Change #1014001 merged by Btullis:

[operations/puppet@production] Update the from address of all email from refinery jobs.

https://gerrit.wikimedia.org/r/1014001

I have now deployed this patch to refinery, so that all refinery reports should come from Refinery <noreply@wikimedia.org> in future.

I'll keep an eye out for new refinery report emails to verify that it is working and I will also look through our code to see if there are any other places where we might be using the FQDN.

Change #1014491 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the from_address for burrow notification emails

https://gerrit.wikimedia.org/r/1014491

I can confirm receipt of a new refinemonitor report, using the new email address.

image.png (93×690 px, 14 KB)

Change #1014491 merged by Btullis:

[operations/puppet@production] Update the from_address for burrow notification emails

https://gerrit.wikimedia.org/r/1014491

I believe that this is all done, with the exception of this patch to refinery-source, which updates the default email address: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1014004

However, we have overridden it everywhere in puppet. I made a note to yone working on a airflow migration in T307505 that they might like to override the email address as well.

Change #1014004 merged by jenkins-bot:

[analytics/refinery/source@master] Update the from address of refine reports to be routable

https://gerrit.wikimedia.org/r/1014004

Change #1007577 merged by Btullis:

[operations/puppet@production] Allow systemd::timer::job to send from a custom address

https://gerrit.wikimedia.org/r/1007577

The changes here mean that when we get informed that a systemd timer fails on cloud VPS we don't know where the issue actually happened since the host name is not included in the mail body and now also not in the sender address anymore.

In production we still see the host name in the sender address but in cloud it lacks that information completely so you can only guess where the timer runs that has the issue.

Can we please include the hostname in the mail body then by default?

Change #1007578 abandoned by Btullis:

[operations/puppet@production] Allow kerberos::systemd::timer to use a custom email sender

Reason:

We don't really need this change any more

https://gerrit.wikimedia.org/r/1007578