Page MenuHomePhabricator

Possible race condition in webservice HSET/HDEL
Open, HighPublic

Description

See T122509#1906294 for details. In short: I think it's possible the post-exit script of an old webservice to HDEL the server/port information of a newly started webservice.

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw added subscribers: valhallasw, yuvipanda.
Restricted Application added a project: Cloud-Services. · View Herald TranscriptDec 28 2015, 4:59 PM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
coren added a subscriber: coren.Dec 28 2015, 7:21 PM

The only way around this that I can think of is for the post-exit script to first check that the kv that is actually there really /is/ the one it expects to delete before going ahead and deleting it.

Because it's not atomic, it's not perfectly raceproof - but it protects against variable delay before the post-exit script starts (which may be considerably delayed for a number of reasons).

scfc added a subscriber: scfc.Dec 28 2015, 7:39 PM

We could use a different data structure in Redis. Currently:

127.0.0.1:6379> HGETALL prefix:admin
1) ".*"
2) "http://tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs:47897"
127.0.0.1:6379>

.* is a regular expression that IIRC was intended to allow more advanced uses. If we instead use prefix:admin as a list instead of a hash, I think we could use LREM prefix:admin 1 http://tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs:47897 to (only) remove the "old" entry.

scfc added a comment.Dec 28 2015, 7:53 PM

I'm not sure if my memory serves me correctly, but I think we previously did just that with portgranter, i. e. only remove the entry that was created, and we ran into problems when the removal part did not execute for whatever reason and we had two (or more) entries in Redis. And so – I think – the "remove all entries" solution was chosen so that there is a clean slate to build on.

I can confirm this is what is happening. When I try to reschedule crosswatch, the following happens:

valhallasw@tools-proxy-01:~$ tail -f /var/lib/redis/tools-proxy-01-6379.aof | grep -e 'crosswatch' -C 5
.*
*4
$4
HSET
$17
prefix:crosswatch
$2
.*
$59
http://tools-webgrid-generic-1403.tools.eqiad.wmflabs:58753
*3
$4
HDEL
$17
prefix:crosswatch
$2
.*

First the new location is set, then the old one is removed.

valhallasw triaged this task as High priority.May 27 2016, 1:14 PM
valhallasw moved this task from Triage to Backlog on the Toolforge board.