Page MenuHomePhabricator

Work around glibc timer aliasing in Excimer
Closed, ResolvedPublic

Description

T389734 noted emergency timeout exceptions occurring when the specified time had not elapsed. This was due to glibc delivering a timer event with the wrong sigevent (BZ#32833).

The workaround in MW was to disable Lua profiling. The bug will still happen at a lower rate. Excimer and LuaSandbox are usable outside of MediaWiki and should not have such bugs in them. It's not acceptable to wait for a fix to be merged in glibc and then to wait for the glibc fix to filter down to all our users.

The broad implementation options are:

  1. Delayed deletion. Don't call timer_delete() until we are pretty sure that there are no events in flight.
  2. Event validation. Discard bad events based on some property of the event.
  3. Reimplement the glibc timer handler using SIGEV_THREAD_ID.

Option 3 would be similar to this patch, although not so similar as to violate the author's copyright, since Excimer is not GPL licensed. Per my comment there, it would be convenient for us if the handler thread were joinable, because that guarantees that no events are delivered after timer_delete(), allowing us to simplify the calling code. It is not possible to use a joinable thread in glibc because POSIX specifies that the thread is not joinable.

SIGEV_THREAD_ID is specific to Linux, although most other UNIX-based operating systems can use the kqueue implementation. The documentation says "This flag is intended only for use by threading libraries" which would be fine if our threading library was doing a decent job. I suppose we can call Excimer a threading library.

Event Timeline

@tstarling wrote in the task description:

[…] The broad implementation options are:

  1. Delayed deletion.
  2. Event validation.
  3. Reimplement the glibc timer handler using SIGEV_THREAD_ID.

Do you have a preferred or recommended approach?

I personally lack the C-experience required to take a defensible position on a technical or idiomatic level. But, from where I'm standing option 2 (validation) seems most appealing as it sounds like a simple change that I believe I can review and understand to be "obviously correct" (C. A. R. Hoare quote), which I can't say about the others.

This isn't the right decision framework per-se, because my experience in C is so low that I'm sure this would exclude many reasonable and benefitial choices. Anyway, if you're asking me, that's what I think. With a bit more time, I'd be happy to review one of the others as well if you'd recommend that instead.

I'll list a few more thoughts on each option, but please read these as the thoughts of a novice exposing their known-poor intuition for the purpose of learning. Correct, adjust, and relativise at will.

  1. Delayed deletion. I'd worry about how much to wait or what future event to wait for, how to know that it is the right amount of time or that the future event indeed happens, memory leaks. And, if the amount of time is more than required, cases that require many short successive Lua-invocations might start to consume significants amounts of memory while we wait for them to garbage collect. Note that I'm not saying this is definitely the case, perhaps the amount of time can be kept short and the memory insignificant. Rather, these are the concerns I'd be looking for and might need help to allay.
  2. Event validation. The only thing that stands out here is the increase in memory for the added field. I think this would single-digit bytes, and we're not holding on to thousands or millions of these in a given request. In the case of MediaWiki, we'd be talking about one of each at a time from LuaSandbox, Request timeout, Rdbms critical section, and perhaps two for Arc Lamp. That's less than 10, making for perhaps a whopping 100 bytes more memory? Negligible for us, if true. Perhaps less so at the kernel level for general-purpose code, but we have a more narrow scope.
  3. Reimplement the glibc timer. This would involve an amount of code and references to (for me, new) concepts that I can't say I'd understand any time soon. I don't oppose it, but it's something to weigh I suppose. SIGEV_THREAD_ID seems well-documented and I'd be fine with us using it.

I went with option 3, it was in progress by the time you wrote your comment.

Maybe you're imagining that validation would be like

handle_event(void * data) {
    if (((mydata*)data)->magic != 0xdeadbeef) {
        return;
    }
    ....
}

But when the bug happens, the data we get from glibc is exactly the value we set up for the emergency timer. Any magic value you want to put in the object will be associated with the emergency timer. Nothing associated with that integer is independent information on where the event originated.

The simple way to validate is to store the expected expiration clock time in the structure. In the handler, call clock_gettime() and if we're still before the expiration time, ignore the event. But it's not a complete solution, there is still aliasing, especially if events are close together.

The complex way to validate is to use SIGEV_THREAD_ID so that we can get a siginfo instead of a sigevent. siginfo is a larger structure and it has the kernel timer ID in it, which is not reused. But this is basically option 3.

Change #1137092 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/php/excimer@master] [WIP] Rewrite timer backend

https://gerrit.wikimedia.org/r/1137092

Change #1137858 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/libs/RequestTimeout@master] Keep a pool of emergency timers

https://gerrit.wikimedia.org/r/1137858

Change #1137092 merged by jenkins-bot:

[mediawiki/php/excimer@master] Rewrite POSIX timer backend

https://gerrit.wikimedia.org/r/1137092

Change #1137858 merged by jenkins-bot:

[mediawiki/libs/RequestTimeout@master] Keep a pool of emergency timers

https://gerrit.wikimedia.org/r/1137858

Change #1155806 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/libs/RequestTimeout@master] Prepare for release 2.0.3

https://gerrit.wikimedia.org/r/1155806

Change #1155806 merged by jenkins-bot:

[mediawiki/libs/RequestTimeout@master] Prepare for release 3.0.0

https://gerrit.wikimedia.org/r/1155806