Page MenuHomePhabricator

wdqs-updater does not recover after an UnknownHostException
Closed, ResolvedPublic

Description

wdqs1004 has networking issues (T188045). During those issues, we can see name resolution exceptions (see below) after which all logs stop. The updater resumes its job only when restarted. While we should obviously fix the underlying root cause (name resolution not working), we should also ensure that wdqs-updater is robust to those kind of errors.

Feb 27 08:46:58 wdqs1004 bash[1664]: 08:46:58.592 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2018-02-22T21:33:42Z (next: 20180222213342|672293634) at (8.2, 2.5, 0.9) updates per secon
d and (913.9, 275.3, 98.6) milliseconds per second
Feb 27 08:46:58 wdqs1004 bash[1664]: 08:46:58.603 [main] ERROR org.wikidata.query.rdf.tool.Update - Error during updater run.
Feb 27 08:46:58 wdqs1004 bash[1664]: java.lang.RuntimeException: java.net.UnknownHostException: www.wikidata.org
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.fetchRecentChanges(WikibaseRepository.java:267)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.change.RecentChangesPoller.fetchRecentChanges(RecentChangesPoller.java:301)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.change.RecentChangesPoller.batch(RecentChangesPoller.java:314)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.change.RecentChangesPoller.nextBatch(RecentChangesPoller.java:144)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.change.RecentChangesPoller.nextBatch(RecentChangesPoller.java:33)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.Updater.nextBatch(Updater.java:269)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.Updater.run(Updater.java:156)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.Update.main(Update.java:80)
Feb 27 08:46:58 wdqs1004 bash[1664]: Caused by: java.net.UnknownHostException: www.wikidata.org
Feb 27 08:46:58 wdqs1004 bash[1664]: at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
Feb 27 08:46:58 wdqs1004 bash[1664]: at java.net.InetAddress.getAllByName(InetAddress.java:1192)
Feb 27 08:46:58 wdqs1004 bash[1664]: at java.net.InetAddress.getAllByName(InetAddress.java:1126)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:111)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:84)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.getJson(WikibaseRepository.java:421)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.fetchRecentChanges(WikibaseRepository.java:264)
Feb 27 08:46:58 wdqs1004 bash[1664]: ... 7 common frames omitted
Feb 27 08:46:58 wdqs1004 bash[1664]: Exception in thread "main" java.lang.RuntimeException: java.net.UnknownHostException: www.wikidata.org
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.fetchRecentChanges(WikibaseRepository.java:267)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.change.RecentChangesPoller.fetchRecentChanges(RecentChangesPoller.java:301)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.change.RecentChangesPoller.batch(RecentChangesPoller.java:314)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.change.RecentChangesPoller.nextBatch(RecentChangesPoller.java:144)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.change.RecentChangesPoller.nextBatch(RecentChangesPoller.java:33)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.Updater.nextBatch(Updater.java:269)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.Updater.run(Updater.java:156)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.Update.main(Update.java:80)
Feb 27 08:46:58 wdqs1004 bash[1664]: Caused by: java.net.UnknownHostException: www.wikidata.org
Feb 27 08:46:58 wdqs1004 bash[1664]: at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
Feb 27 08:46:58 wdqs1004 bash[1664]: at java.net.InetAddress.getAllByName(InetAddress.java:1192)
Feb 27 08:46:58 wdqs1004 bash[1664]: at java.net.InetAddress.getAllByName(InetAddress.java:1126)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:111)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:84)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.getJson(WikibaseRepository.java:421)
Feb 27 08:46:58 wdqs1004 bash[1664]: at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.fetchRecentChanges(WikibaseRepository.java:264)
Feb 27 08:46:58 wdqs1004 bash[1664]: ... 7 more

Details

Related Changes in Gerrit:

Event Timeline

Aklapper renamed this task from wdqs-updater does not recover after an UnkownHostException to wdqs-updater does not recover after an UnknownHostException.Feb 28 2018, 9:32 AM

I remember that we made UnknownHostException fatal because for some reason it was sticky in Java - I think it was T111576 - so that if the lookup failed once, it kept failing until Java process is restarted. So we wanted to make Updater exit and service launch it again of unknown host.

A thread dump of the stuck process: https://phabricator.wikimedia.org/P6771

Note that we should fix the sticky DNS resolution issue instead of exiting the app. I do remember a similar problem, but not the solution... I'll rack my brain and see if I can come up with something...

The failed DNS resolution being sticky is probably an issue with negative caching in the JVM being either too high or infinite. We should be able to configure it via https://docs.oracle.com/javase/8/docs/technotes/guides/net/properties.html#ncnt . But that definitely requires a good integration test. We could steal some idea from https://github.com/schuch/dnslookup-integration-test

Smalyshev triaged this task as Medium priority.Mar 2 2018, 10:08 PM

We could probably set networkaddress.cache.negative.ttl to 0 - I do not see where we would have any use for negative DNS caching at all.

We could probably set networkaddress.cache.negative.ttl to 0 - I do not see where we would have any use for negative DNS caching at all.

I'm not exactly sure which cases are covered by networkaddress.cache.negative.ttl. It probably covers both failed queries, and queries with negative response. In case of a negative response by DNS, caching the result is the correct behaviour, to reduce load on DNS. In both cases, caching will enable immediate failure which is in general a good thing (in our case, we probably don't care). So setting it to a small value (10 seconds?) is probably better.

All that DNS caching behaviour in Java is not well documented and I don't understand it as well as I should, so I'd prefer to have a test scenario before changing the current behaviour.

It probably covers both failed queries, and queries with negative response.

That's the thing - there's no case for Updater to look up URLs which would give negative response. It ever goes only to Wikidata (or whatever server is used to d/l updates) and if that fails, it's an error.

It probably covers both failed queries, and queries with negative response.

That's the thing - there's no case for Updater to look up URLs which would give negative response. It ever goes only to Wikidata (or whatever server is used to d/l updates) and if that fails, it's an error.

There is no nominal use case. There are error scenarios. Yes if the updater gets a negative response, it is because something outside of its control went wrong (broken DNS, broken configuration, broken network, ...). That does not mean that we should discount this scenario.

In the end, the difference between no negative cache and a small negative cache is small, so it does not matter much. The real issue is that it seems that we have an infinite negative cache and that's really wrong!

Change 416622 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Set negative DNS cache TTL to 5s

https://gerrit.wikimedia.org/r/416622

Jolokia has been removed from the deployments, so now the updater should crash on DNS errors and be restarted by systemd. It is non trivial to test, but I might be able to reproduce the issue on wdqs-test by playing with iptables...

Smalyshev claimed this task.

Change 416622 merged by jenkins-bot:
[wikidata/query/rdf@master] Set negative DNS cache TTL to 5s

https://gerrit.wikimedia.org/r/416622