Page MenuHomePhabricator

Could not send confirmation email: Unknown error in PHP's mail() function.
Open, MediumPublicPRODUCTION ERROR

Description

Error
normalized_message
Could not send confirmation email: Unknown error in PHP's mail() function.
exception.trace
Impact

Causing warnings during authentication (account creation).

Notes

Seems like this is related to: T325131: Setup sendmail on k8s container

Details

Request URL
https://en.wikipedia.org/w/index.php?returnto=*&title=*
Related Changes in Gerrit:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Breakdown in Logstash: mediawiki dashboard for message:"Could not send confirmation email" over the lasty 90 days:

Screenshot 2025-01-06 at 15.40.51.png (1×1 px, 168 KB)

It's all from http_method:POST requests handled in Codfw data centre (not surprising). It's almost all in kube-mw-web, although a few API requests in mw-api-ext as well. Again, not terribly surprising as that's how account creations are distributed. Lastly, in terms of domain name, it's mostly en.wikipedia.org, ar.wikipedia.org, and es.wikipedia.org but with a long tail. Again, that seems probable as being representative of all account registrations (not verified).

It seems fairly evenly distributed and likely caused by issues in a shared service or external factor. @Tgr mentioned it might be that some email providers that receive our emails may bouncing or otherwise causing a failure.

What I've not yet looked at is the timing. Do the spikes correlate with spikes in sign ups, or are these spikes also spikes in proportion of sign ups? In other words, doe we fail to send a higher proportion of sign up emails during those spikes?

I'll keep looking at narrowing down factors from MediaWiki, but tagging SRE InfraFun meanwhile.

Breakdown in Logstash: mediawiki dashboard for message:"Could not send confirmation email" over the lasty 90 days:

Screenshot 2025-01-06 at 15.40.51.png (1×1 px, 168 KB)

[…]

Account creation rate (increase per 24h), same as the Logstash screenshot. Modified from https://grafana.wikimedia.org/d/000000004/authentication-metrics

Screenshot 2025-01-06 at 16.35.41.png (1×1 px, 253 KB)

If possible it would be helpful to obtain more information than the generic error message, for instance it is not clear if we receive an error from our mail server, or whether there is an initial connection error?

Change #1110871 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] UserMailer: hack: preserve multiple error callbacks

https://gerrit.wikimedia.org/r/1110871

Change #1110871 merged by jenkins-bot:

[mediawiki/core@master] UserMailer: hack: preserve multiple error callbacks

https://gerrit.wikimedia.org/r/1110871

jhathaway triaged this task as Medium priority.Jan 27 2025, 3:56 PM

If possible it would be helpful to obtain more information than the generic error message, for instance it is not clear if we receive an error from our mail server, or whether there is an initial connection error?

If I'm reading the code correctly, "unknown error" means the PHP mail() function returns false but doesn't trigger an error. The internet says mail() tends to do this for a variety of reasons. Not much can be done about it on the application side, other than switching to Symfony Mailer or something.

Btw there was a huge spike in these errors today, that's very likely to be infrastructure-related.

Screenshot Capture - 2025-02-03 - 17-16-10.png (552×1 px, 47 KB)

Task description
Log message
Could not send confirmation email: Unknown error in PHP's mail() function.

The log message we see is formulated by mediawiki/EmailNotificationSecondaryAuthenticationProvider.php as Could not send confirmation email: {status}, where {status} comes from mediawiki/UserMailer.php, which in turn gets it from the PHP engine as-is. For example, it might say PHP Warning: mail(): "From" header missing or PHP Warning: Could not execute mail delivery program '/usr/sbin/sendmail'. Or, if mail() returned false without emitting any such warning, we append Unknown error in PHP's mail() function. (from the php-mail-error-unknown message key).

The latter is what we see, so it's not that we're missing one of several warnings, but it seems there are none.

@CDanis' patch improves this logic to make sure warning strings are not lost:

  • .. when receiving more than one warning from the mail function.
  • .. when sending more than one email in a batch and perhaps the previous one failed differently than the last one.
  • .. when there are non-mail warnings that might overwrite the last mail-related warning.
Change by @CDanis:

[mediawiki/core@master] UserMailer: hack: preserve multiple error callbacks

https://gerrit.wikimedia.org/r/1110871

This patch has been live for a few weeks, but the log messages still look the same. Which means:

  • No batching. Indeed, confirmation emails (during account creation), and user-to-user emails from (via Special:EmailUser feature) are generally singular, the loop is over an array of 1 email, no chance to overwrite a previous iteration.
  • In theory overwrites from multiple warnings can happen even from a single mail. But, I'm not aware of any PHP built-in functions that can emit multiple warnings. Generally the way they are written in C is that just before they return false, they write one line to stderr. There may be other potential early branches in that C function that we would qualify for also, but as soon as first condition is met, we get one error message and the function returns.
  • If the MediaWiki code in mediawiki/UserMailer.php calls other functions that can emit warnings in the same try-catch block, then the old code could overwrite mail warnings if they happen last. The catch block is pretty narrow: set_error_handler(); foreach { mail(); } restore_error_handler(). When calling mail() we do use $recip->toString(), and self::quotedPrintable( $subject ). These don't seem to emit any warnings, but if they did that could be a concern. Having said that, they would be called before mail() not after so unless there is more than one email, that wouldn't affect us.

If possible it would be helpful to obtain more information than the generic error message, for instance it is not clear if we receive an error from our mail server, or whether there is an initial connection error?

The error message shown above comes directly from the PHP engine. I'm not aware of other messages or signals available to us in the application runtime. I suggest checking the next hop, which I believe is either the local sendmail/syslog on the container/k8s host, or the mail server itself. The timestamps should hopefully allow for some correlation. Ping me if you need help gathering MediaWiki samples from Logstash.

@jhathaway Is there more we can find out from the receiver end (i.e. local sendmail and/or the mail server)? Feel free to ping me here or on IRC if I can help with anything on MediaWiki/Logstash details to help correlate.

@Krinkle I spent some time trying to correlate between the mediawiki log timestamps and our email logs, but I couldn't get them to line up in a sensible way. We could try tcpdumping, but I'm not sure if mediawiki is even attempting to connect. My understanding is that we are using msmtp, T325131, perhaps it is returning an interesting error code? From my brief reading of PHP's source code, it appears to discard the error code, perhaps we could patch it to log the code, e.g.:

diff --git i/ext/standard/mail.c w/ext/standard/mail.c
index 003fad3ea18..acc73320b6a 100644
--- i/ext/standard/mail.c
+++ w/ext/standard/mail.c
@@ -538,6 +538,7 @@ PHPAPI int php_mail(const char *to, const char *subject, const char *message, co
 		{
+			php_error_docref(NULL, E_WARNING, "Sendmail program exited with non-zero exit status '%d'", ret);
 			MAIL_RET(0);
 		} else {
 			MAIL_RET(1);
 		}
 	} else {

Searching for Unknown error in PHP's mail() function yields about 1000 errors in the last 30 days:

Could not send confirmation email: Unknown error in PHP's mail() function.837
Could not email {user}351
Could not send account creation email: Unknown error in PHP's mail() function.5
Failed executing job: TranslationNotificationsEmailJob1
[Job: TranslationNotificationsEmailJob][...] Unknown error in PHP's mail() function.1

Could not email {user} is from EmailAuth so those are existing users locked out of their accounts. The rest are account creations / email confirmations. User impact depends on whether it would affect the same user repeatedly or is more random (in which case, annoying but not a huge deal).

I spent some time trying to correlate between the mediawiki log timestamps and our email logs, but I couldn't get them to line up in a sensible way.

What would help with that? We are not currently logging the target email address but it would be trivial to add. Here are the last few errors:
{P76958}

What would help with that? We are not currently logging the target email address but it would be trivial to add. Here are the last few errors:

If the sendmail program, msmtp in our case, is not successfully delivering the emails to our mail server then we won't have any logs to correlate. So I think getting the exit code into our logs should be the first step. I tried cutting a patch to PHP with that aim, https://github.com/php/php-src/pull/18765. If that looks good to folks, perhaps we could run a patched version in production, until it is accepted upstream?

I tried cutting a patch to PHP with that aim, https://github.com/php/php-src/pull/18765. If that looks good to folks, perhaps we could run a patched version in production, until it is accepted upstream?

Thanks for upstreaming the patch!

Not sure who would be able to review it if we don't want to wait for upstream. @tstarling maybe?

I see the patch got merged upstream, I assume that meets the "looks good" bar?

I see the patch got merged upstream, I assume that meets the "looks good" bar?

indeed, @MoritzMuehlenhoff added a reminder to add the patches to the next backport, T360995#10942948

I've backported @jhathaway's upstream fixes onto PHP 8.1.33 and the packages are ready to go (but not yet included in our apt repo). If there's interest, and folks are available to monitor the result, I can aim to deploy those this week.

Edit: So they exist outside of the build host, I've attached the resulting patches, corresponding to upstream commits c5e7490963a1c53ec12c09264d0dbe1654ec8cfc and 6a7561203de89031147c0de6fe09cad905f82984. These apply cleanly on top of the existing series for the Debian PHP team's debian/main/8.1 branch as of debian/8.1.33-1 and all related tests pass (after reenabling the test suite in debian/rules).

Thanks! We are definitely interested. By monitoring do you just mean checking if this error becomes less frequent / the error message becomes more useful (can do) or testing email sending after the deployment (can do as well if you ping me)?

Thanks, @Tgr!

Thanks! We are definitely interested. By monitoring do you just mean checking if this error becomes less frequent / the error message becomes more useful (can do)

Yes, just broadly anything you might think of to confirm that the patches are working as expected. I don't have a great mental model of what the observable effect of the error-handling correctness fix might look like up in MediaWiki side of things.

or testing email sending after the deployment (can do as well if you ping me)?

Ah, that's a good idea as well. Would this just be via Special:EmailUser, or something else?

In any case, I'd propose we aim for the 17:00 UTC MediaWiki infrastructure window if that works (this requires an update to the base PHP 8.1 image, which is preferable to do during a dedicated window).

If it works as intended, I think the only change is Logstash error messages (link) getting more informative.

All email sending goes through the same code, so I think testing Special:UserEmail (or EmailAuth, signup etc) should be enough. For less frequent errors, I can look at EmailAuth stats after a day and see if verification success rates changed. Maybe do the same for email confirmation, not sure how we log that.

In any case, I'd propose we aim for the 17:00 UTC MediaWiki infrastructure window if that works

The Wednesday one, right? Sounds good.

That sounds good - thanks, @Tgr. Also yes, apologies, I should mentioned explicitly I meant Wednesday 2025-08-06.

Change #1175951 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/docker-images/production-images@master] php8.1: rebuild to pick up 8.1.33-1+wmf11u2

https://gerrit.wikimedia.org/r/1175951

Mentioned in SAL (#wikimedia-operations) [2025-08-06T17:03:54Z] <swfrench-wmf> reprepro include php8.1_8.1.33-1+wmf11u2 in component/php81 - T383047

Change #1175951 merged by Scott French:

[operations/docker-images/production-images@master] php8.1: rebuild to pick up 8.1.33-1+wmf11u2

https://gerrit.wikimedia.org/r/1175951

Mentioned in SAL (#wikimedia-operations) [2025-08-06T17:10:40Z] <swfrench-wmf> built and published php8.1 production image stack at 8.1.33-1-s3 - T383047

Mentioned in SAL (#wikimedia-operations) [2025-08-06T17:11:21Z] <swfrench@deploy1003> Started scap sync-world: Deployment to pick up new 8.1.33-1-s3 production images - T383047

Mentioned in SAL (#wikimedia-operations) [2025-08-06T17:32:52Z] <swfrench@deploy1003> swfrench: Deployment to pick up new 8.1.33-1-s3 production images - T383047 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-06T17:56:05Z] <swfrench@deploy1003> Finished scap sync-world: Deployment to pick up new 8.1.33-1-s3 production images - T383047 (duration: 45m 10s)

The error-handling improvements should be live as of a bit before 18:00 UTC today. Many thanks to @Tgr for hanging around to verify the behavior of email-related workflows.

Seems to be working:
Could not send confirmation email: Sendmail exited with non-zero exit code 74
Thanks a lot @jhathaway and @Scott_French for fixing the error reporting issue!

(74 is EX_IOERR, which typically means some sort of disk error. We'll see if all errors are caused by this, or other error codes show up as well.)

So far:
exit code 74 (EX_IOERR): 72
exit code 65 (EX_DATAERR): 7
exit code 64 (EX_USAGE): 1

Do we want to keep using this task for debugging those, or close it (since the issue that caused "Unknown error" to be reported was fixed) and file separate tasks?

We get the correct exit codes now, which is fabulous. However, what is less fabulous as @Tgr notes is that the majority of the error codes are 74, which is used by msmtp to indicate either a disk I/O error or a network I/O error, both of which are broad categories. So, we really need more debug info, how do we obtain it?

Ideas:

  1. Modify msmtp to emit more specific exit codes
  2. Configure msmtp to log to a file, then slurp those files up via rsyslog
  3. Wrap the msmtp command with a script, which sends stderr and stdout to a logger process configured to ship logs to a remote rsyslog, something akin to https://gist.github.com/solidsnack/6090947

My preference is (3), which I think would be easier to implement than (2), while giving us the same data. One issue with (3), is avoiding logging users email addresses, perhaps we could redact them in the wrapper program?

I don't know enough about mailservers to have an opinion on the options (other than that (1) sounds very painful since all we'd get is an integer), but I wouldn't worry much about logging email addresses - they are logged in other parts of the call stack anyway.

I wouldn't worry much about logging email addresses

In some unrelated conversation someone pointed out that some of the emails we send are quite sensitive - donors, partners etc. So let me rephrase that, I wouldn't worry about logging email addresses for emails sent by MediaWiki. Not sure to what extent those are separate from various other kinds of emails.

So far:
exit code 74 (EX_IOERR): 72
exit code 65 (EX_DATAERR): 7
exit code 64 (EX_USAGE): 1

Now it's
exit code 74 (EX_IOERR): 549
exit code 65 (EX_DATAERR): 94
exit code 64 (EX_USAGE): 1

Just noting I just got Sendmail exited with non-zero exit code 74 twice in a row when attempting to email the Oversight user (!) on enwiki. (I then switched to sending via normal email.)

Edit: Actually, I was just notified both attempts /did/ get sent to the Oversight mailing list despite the error message.

Another user would have gotten the errors also. We have about 3-4 emails to the OS team per OS request because people think it's not getting through when it is.

Got the errors too today, three times in a row but the emails were sent.

Getting "Sendmail exited with non-zero exit code 74" using en-Wikimail to another established user.

Just adding that I'm getting this too, over the past three hours, using "email this user" from both meta and enwp, with mail going to two different users.

Update: got this error over a period of several hours on 10/19. When I tried again on 10/20, it had resolved.

I just observed this error at VRT wiki.

Would it make sense to retry once in MediaWiki on sendmail errors?

This is still happening, ~3.3k per in last 4 weeks:

image.png (550×2 px, 159 KB)

What would be the next steps? I assume this error is something that can happen - sendmail can fail due to various reasons. @Tgr - do we do any kind of retries here?

Last 24 hours - 127 instances of message: Could not send confirmation email: Sendmail exited with non-zero exit code 74 - https://logstash.wikimedia.org/goto/1425bc3e85340cb5965e3757623deb8c

Saw the issue in the enwiki UI (non-reproducible):

Screenshot 2025-12-17 at 12.50.25 PM.png (470×2 px, 93 KB)

It's level: WARNING but causes confirmation code to expire; a user need to start everything over.

Screenshot 2025-12-17 at 1.05.40 PM.png (682×2 px, 93 KB)

What would be the next steps?

Either do one of the debugging options @jhathaway mentioned in T383047#11075860, or add retry functionality.