Page MenuHomePhabricator

WDQS tests unstable with some thread leak errors
Closed, DeclinedPublic

Description

Lately, we've got some stability issues with WDQS tests, I noticed several times CI tests failing, e.g.:
https://integration.wikimedia.org/ci/job/wikidata-query-rdf-maven/13/console

20:23:43 Started J0 PID(1751@ci-jessie-wikimedia-863527).
20:23:53 [20:23:43.569] Suite: org.wikidata.query.rdf.tool.WikibaseGeoExtensionIntegrationTest
20:23:53   1> #logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
20:23:53   2> Oct 20, 2017 8:23:49 PM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
20:23:53   2> SEVERE: 1 thread leaked from TEST scope at circleSearchWithDistance(org.wikidata.query.rdf.tool.WikibaseGeoExtensionIntegrationTest): 
20:23:53   2>    1) Thread[id=51, name=HttpClient@1941102000-51, state=RUNNABLE, group=TGRP-WikibaseGeoExtensionIntegrationTest]
20:23:53   2>         at (empty stack)
20:23:53   2> Oct 20, 2017 8:23:49 PM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
20:23:53   2> INFO: Starting to interrupt leaked threads:
20:23:53   2>    1) Thread[id=51, name=HttpClient@1941102000-51, state=TERMINATED, group={null group}]
20:23:53   2> Oct 20, 2017 8:23:49 PM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
20:23:53   2> INFO: All leaked threads terminated.
20:23:53 [20:23:48.814] ERROR   0.83s | WikibaseGeoExtensionIntegrationTest.circleSearchWithDistance <<<
20:23:53    > Throwable #1: com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from TEST scope at circleSearchWithDistance(org.wikidata.query.rdf.tool.WikibaseGeoExtensionIntegrationTest): 
20:23:53    >    1) Thread[id=51, name=HttpClient@1941102000-51, state=RUNNABLE, group=TGRP-WikibaseGeoExtensionIntegrationTest]
20:23:53    >         at (empty stack)
20:23:53    > 	at__randomizedtesting.SeedInfo.seed([8DC53E56CE368CDE:CC5F489B3434F99B]:0)
20:23:53 [20:23:53.068] Completed [1/8 (1!)] in 9.50s, 14 tests, 1 error <<< FAILURES!
20:23:53

We may want to figure out what is causing this, maybe out shutdown procedure leaves something not properly closed.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Gehel, do you have any thoughts on this?

No idea what's going on here. I have seen the same error a few times as well. It looks like the leaked thread is an HttpClient, so my guess is that in some cases RdfRepository.httpClient isn't closed. But I can't understand in which code path this could happen. Looking deeper into HttpClient itself, it looks like HttpClient.executor is shutdown correctly (as a LifeCycle bean). The executor itself is a QueuedThreadPool, which has some non trivial logic to poison the threads and wait for completion. Increasing log level on org.eclipse.jetty.util.thread might give us some insight on what is happening there.

Side note, I'm not a big fan of Jetty's HTTP client. I think that Apache HC has a simpler interface (maybe it's just me being more used to it).

Side note 2: The test classes hierarchy is a bit more complex than I'd like it to be. Implementing RdfRepositoryForTesting as a Rule instead of a parent-defined property would IMHO make the code easier to read, and maybe easier to debug.

Blazegraph uses Jetty and I think I tried to use Apache with it and there were some problems. So I use Jetty to communicate to Blazegraph.

Also, the problem unfortunately is not very well reproducible, it comes and goes, though it shows up often enough to be annoying. But of course, once I start debugging it locally, it vanishes.

Smalyshev lowered the priority of this task from Medium to Low.Dec 7 2017, 10:25 PM

Change 396380 had a related patch set uploaded (by Gehel; owner: Gehel):
[wikidata/query/rdf@master] Extract RdfRepositoryForTesting to a JUnit @Rule

https://gerrit.wikimedia.org/r/396380

Change 396380 merged by jenkins-bot:
[wikidata/query/rdf@master] Extract RdfRepositoryForTesting to a JUnit @Rule

https://gerrit.wikimedia.org/r/396380

Smalyshev claimed this task.

I'll close it for now and reopen if it turns out this patch didn't help.

Smalyshev removed a project: Patch-For-Review.
Smalyshev removed a subscriber: gerritbot.

I did not really expect this patch to help in any way. But... digging into https://gerrit.wikimedia.org/r/#/c/399864/ I see that the HC client in WikibaseRepository isn't closed properly. This is fixed in that CR, but since it includes a bunch of other stuff as well, we might want to extract it. I'll have a look into it.

Change 401705 had a related patch set uploaded (by Gehel; owner: Guillaume Lederrey):
[wikidata/query/rdf@master] Properly close HC client in WikibaseRepository

https://gerrit.wikimedia.org/r/401705

Change 401705 merged by jenkins-bot:
[wikidata/query/rdf@master] Properly close HC client in WikibaseRepository

https://gerrit.wikimedia.org/r/401705

I'm not entirely confident that the above patch will solve all our issues, but it should at least solve some. We can close this task and re-open if we find other occurrences of failed integration tests.

I think the leaks come from combination of RandomizedTesting, which insists on policing threads (not sure why randomization should come with thread policing?), some classes we use in tests using thread pools (Metrics has its own pools, Http clients too, also Zookeeper) and even if we meticulously shut them down, some of them may be doing it asynchronously which means by the time RandomizedTesting gets to check threads, sometimes shutdown has not been completed yet.
I think ensuring we do not have thread leaks is important, but putting it into RandomizedTesting is probably a wrong place to check it. @Gehel, do you know any better way? Absent that, I would just turn off the thread police in RandomizedTesting (or migrate to some other solution that doesn't mix concerns) and ensure threads not leaking in some other way that works better with tests.

I think we can close it for now, it seems to me to be more of RandomizedTesting race conditions issue than anything else. We may still want to have some other way of tracking the leaks...