Page MenuHomePhabricator

LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

We often get PHP Warning: socket_sendto(): Host lookup failed [-10002]: Host name lookup failure messages. Looks like the DNS resolution fails from time to time for some reason?

That error message is strangely useless but apparently HHVM returns 10000 + h_errno from gethostbyname and 1 is [[https://github.com/freebsd/freebsd/blob/master/include/netdb.h#L153|HOST_NOT_FOUND]]. The host is stored as an instance property, so this is probably the LegacyHandler equivalent of T151428.

-10002 would be:

# define TRY_AGAIN  2   /* Non-Authoritative Host not found,
                   or SERVERFAIL.  */

Request ID: XV609wpAAEoAAItTxNkAAADP
Request URL: zh.wikipedia.org/w/api.php...

message
[XV609wpAAEoAAItTxNkAAADP] /w/api.php?format=json&formatversion=2&errorformat=plaintext&action=query&redirects=&converttitles=&prop=extracts|pageimages|pageprops&exsentences=5&piprop=thumbnail|name&pilicense=any&explaintext=&pithumbsize=320&titles=%E8%A8%B1%E6%8C%AF%E5%BE%97&uselang=zh-tw   ErrorException from line 218 of /srv/mediawiki/php-1.34.0-wmf.19/includes/debug/logger/monolog/LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002
trace
#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array, array)
#1 /srv/mediawiki/php-1.34.0-wmf.19/includes/debug/logger/monolog/LegacyHandler.php(218): socket_sendto(resource, string, integer, integer, string, integer)
#2 /srv/mediawiki/php-1.34.0-wmf.19/vendor/monolog/monolog/src/Monolog/Handler/AbstractProcessingHandler.php(39): MediaWiki\Logger\Monolog\LegacyHandler->write(array)
#3 /srv/mediawiki/php-1.34.0-wmf.19/vendor/monolog/monolog/src/Monolog/Handler/WhatFailureGroupHandler.php(35): Monolog\Handler\AbstractProcessingHandler->handle(array)
#4 /srv/mediawiki/php-1.34.0-wmf.19/vendor/monolog/monolog/src/Monolog/Logger.php(344): Monolog\Handler\WhatFailureGroupHandler->handle(array)
#5 /srv/mediawiki/php-1.34.0-wmf.19/vendor/monolog/monolog/src/Monolog/Logger.php(637): Monolog\Logger->addRecord(integer, string, array)
#6 /srv/mediawiki/php-1.34.0-wmf.19/includes/GlobalFunctions.php(1013): Monolog\Logger->info(string, array)
#7 /srv/mediawiki/php-1.34.0-wmf.19/includes/api/ApiMain.php(1691): wfDebugLog(string, string, string)
#8 /srv/mediawiki/php-1.34.0-wmf.19/includes/api/ApiMain.php(535): ApiMain->logRequest(double)
#9 /srv/mediawiki/php-1.34.0-wmf.19/includes/api/ApiMain.php(504): ApiMain->executeActionWithErrorHandling()
#10 /srv/mediawiki/php-1.34.0-wmf.19/api.php(86): ApiMain->execute()
#11 /srv/mediawiki/w/api.php(3): include(string)
#12 {main}
Impact

6 hits in the last hour, 7 hits in the last 24 hours, not seen before that.

Notes

Since it's not happening frequently not reverting the train, but blocking next week's train.

Event Timeline

Looking up for "Host lookup failed" over 7 days, there are 11 hits

mw1277 2
mw1279 2
mw1261 1
mw1262 1
mw1263 1
mw1264 1
mw1265 1
mw1276 1
mw1278 1

All near 15:30 UTC.

Messages

6 [{exception_id}] {exception_url} ErrorException from line 218 of /srv/mediawiki/php-1.34.0-wmf.19/includes/debug/logger/monolog/LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002

3 [{exception_id}] {exception_url} ErrorException from line 47 of /srv/mediawiki/php-1.34.0-wmf.19/vendor/liuggio/statsd-php-client/src/Liuggio/StatsdClient/Sender/SocketSender.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002

2 [{exception_id}] {exception_url} ErrorException from line 47 of /srv/mediawiki/php-1.34.0-wmf.19/vendor/liuggio/statsd-php-client/src/Liuggio/StatsdClient/Sender/SocketSender.php: PHP Warning: socket_sendto(): Host lookup failed [-10002]: Host name look

My only guess is that DNS resolution failed for some reason.

Jdforrester-WMF changed the task status from Resolved to Invalid.Aug 26 2019, 2:46 PM
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

That'd be Invalid rather than Resolved then. ;-)

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:05 PM

Updating projects tags, workboards and trace to match T255578:

exception.trace
#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/php-1.35.0-wmf.36/includes/debug/logger/monolog/LegacyHandler.php(217): socket_sendto(resource, string, integer, integer, string, integer)
#2 /srv/mediawiki/php-1.35.0-wmf.36/vendor/monolog/monolog/src/Monolog/Handler/AbstractProcessingHandler.php(39): MediaWiki\Logger\Monolog\LegacyHandler->write(array)
#3 /srv/mediawiki/php-1.35.0-wmf.36/vendor/monolog/monolog/src/Monolog/Handler/WhatFailureGroupHandler.php(35): Monolog\Handler\AbstractProcessingHandler->handle(array)
#4 /srv/mediawiki/php-1.35.0-wmf.36/vendor/monolog/monolog/src/Monolog/Logger.php(344): Monolog\Handler\WhatFailureGroupHandler->handle(array)
#5 /srv/mediawiki/php-1.35.0-wmf.36/vendor/monolog/monolog/src/Monolog/Logger.php(642): Monolog\Logger->addRecord(integer, string, array)
#6 /srv/mediawiki/php-1.35.0-wmf.36/includes/GlobalFunctions.php(998): Monolog\Logger->info(string, array)
#7 /srv/mediawiki/php-1.35.0-wmf.36/includes/api/ApiMain.php(1688): wfDebugLog(string, string, string)
#8 /srv/mediawiki/php-1.35.0-wmf.36/includes/api/ApiMain.php(527): ApiMain->logRequest(double)
#9 /srv/mediawiki/php-1.35.0-wmf.36/includes/api/ApiMain.php(496): ApiMain->executeActionWithErrorHandling()
#10 /srv/mediawiki/php-1.35.0-wmf.36/api.php(84): ApiMain->execute()
#11 /srv/mediawiki/w/api.php(3): require(string)
#12 {main}
thcipriani added a subscriber: thcipriani.

One helpful step might be to have a log of what hostname resolution is failure -- is this always limited to mwlog1001.eqiad.wmnet? And why would that fail so frequently/regularly? Tentatively tagging SRE -- this looks to be ramping up a bit -- here is the 90-day view of this message in the logs:

socket-sendto-fails.png (115×373 px, 4 KB)

-10002 would be:

# define TRY_AGAIN  2   /* Non-Authoritative Host not found,
                   or SERVERFAIL.  */

What's the current DNS query retry policy on mediawiki?

On MediaWiki side it uses socket_sendto() on a socket that already got previously opened:

includes/debug/logger/monolog/LegacyHandler.php
socket_sendto(
    $this->sink, $text, strlen( $text ), 0, $this->host, $this->port
);

In PHP source code (I looked at PHP 7.2.31), the host is converted by:

ext/sockets/sockets.c
PHP_FUNCTION(socket_sendto)
{
...
        case AF_INET:
...
            if (! php_set_inet_addr(&sin, addr, php_sock)) {
                RETURN_FALSE;
            }

php_set_inet_addr is to convert the hostname:

ext/sockets/sockaddr_conv.c
/* Sets addr by hostname, or by ip in string form (AF_INET)  */
int php_set_inet_addr(struct sockaddr_in *sin, char *string, php_socket *php_sock) /* {{{ */
{
    struct in_addr tmp;
    struct hostent *host_entry;

    if (inet_aton(string, &tmp)) {
        sin->sin_addr.s_addr = tmp.s_addr;
    } else {
        if (strlen(string) > MAXFQDNLEN || ! (host_entry = php_network_gethostbyname(string))) {
            /* Note: < -10000 indicates a host lookup error */
#ifdef PHP_WIN32
            PHP_SOCKET_ERROR(php_sock, "Host lookup failed", WSAGetLastError());
#else
            PHP_SOCKET_ERROR(php_sock, "Host lookup failed", (-10000 - h_errno));
#endif
            return 0;
        }
        if (host_entry->h_addrtype != AF_INET) {
            php_error_docref(NULL, E_WARNING, "Host lookup failed: Non AF_INET domain returned on AF_INET socket");
            return 0;
        }
        memcpy(&(sin->sin_addr.s_addr), host_entry->h_addr_list[0], host_entry->h_length);
    }

    return 1;
}
/* }}} */

Which is how we get our -10002 error raise:

PHP_SOCKET_ERROR(php_sock, "Host lookup failed", (-10000 - h_errno));

php_network_gethostbyname:

main/network.c
PHPAPI struct hostent*  php_network_gethostbyname(char *name) {
#if !defined(HAVE_GETHOSTBYNAME_R)
    return gethostbyname(name);
#else
...
}

I am just assuming we have HAVE_GETHOSTBYNAME_R and hence that PHP just invokes gethostbyname() and does not handle TRY_AGAIN. That is for the socket part.

PHP has a much more complicated function to find a dns record dns_get_record in ext/standard/dns.c which does handle all h_errno but really just bails out:

ext/standard/dns.c
PHP_FUNCTION(dns_get_record)
{

                    case TRY_AGAIN:
                        php_error_docref(NULL, E_WARNING, "A temporary server error occurred.");
                        break;
...
                RETURN_FALSE;

So hmm... Looks like a PHP issue.

DNS will fail from time to time, for any of a number of reasons. In fact, it's so central that I ask in SRE interviews "why does UDP make sense for DNS, and what are the drawbacks".

But besides the fact that:

  1. DNS failures will happen, no matter how hard we try
  2. Network will fail too
  3. Our failure rate for name resolution is well within what is reasonable, or even desirable

The solution to this problem is to stop resolving names in PHP as much as possible. PHP, for multiple reasons, is just incredibly bad at this job, as it will be unable to do any type of memoization between requests. As a consequence, there is no way to tolerate failure if not by retrying again.,

If as I suspect the problem we're seeing is an exposure of those inherent limitation, the solution is one of the following:

  • use IPs in mediawiki-config as we used to do pre-HHVM (see also the comment from Timo at T239862#5713786
  • allow a different, smarter part of the stack to manage dns resolution for us. For example, we demand DNS discovery resolution for calling other services mostly to envoyproxy

In the specific case, we're trying to resolve mwlog1001.eqiad.wmnet, so it's not even that we get any value from resolving the IP at runtime.

Change 661732 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Remove a couple of useless DNS lookups from mediawiki-config

https://gerrit.wikimedia.org/r/661732

...
In the specific case, we're trying to resolve mwlog1001.eqiad.wmnet, so it's not even that we get any value from resolving the IP at runtime.

Definitely and that is straightforward. That is what I have used when WMCS DNS was unreliable a few years ago.

Change 661732 merged by jenkins-bot:
[operations/mediawiki-config@master] Remove a couple of useless DNS lookups from mediawiki-config

https://gerrit.wikimedia.org/r/661732

Mentioned in SAL (#wikimedia-operations) [2021-02-10T19:20:41Z] <thcipriani@deploy1001> Synchronized wmf-config/ProductionServices.php: [[gerrit:661732|Remove a couple of useless DNS lookups from mediawiki-config]] T231025 (duration: 01m 10s)

thcipriani assigned this task to Joe.

Specific error is no longer showing up. If there's a PHP-specific workaround needed here, feel free to reopen. Otherwise calling this resolved.