Page MenuHomePhabricator

Possible race condition in webservice HSET/HDEL
Closed, ResolvedPublic

Description

See T122509#1906294 for details. In short: I think it's possible the post-exit script of an old webservice to HDEL the server/port information of a newly started webservice.

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw added subscribers: valhallasw, yuvipanda.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

The only way around this that I can think of is for the post-exit script to first check that the kv that is actually there really /is/ the one it expects to delete before going ahead and deleting it.

Because it's not atomic, it's not perfectly raceproof - but it protects against variable delay before the post-exit script starts (which may be considerably delayed for a number of reasons).

We could use a different data structure in Redis. Currently:

127.0.0.1:6379> HGETALL prefix:admin
1) ".*"
2) "http://tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs:47897"
127.0.0.1:6379>

.* is a regular expression that IIRC was intended to allow more advanced uses. If we instead use prefix:admin as a list instead of a hash, I think we could use LREM prefix:admin 1 http://tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs:47897 to (only) remove the "old" entry.

I'm not sure if my memory serves me correctly, but I think we previously did just that with portgranter, i. e. only remove the entry that was created, and we ran into problems when the removal part did not execute for whatever reason and we had two (or more) entries in Redis. And so – I think – the "remove all entries" solution was chosen so that there is a clean slate to build on.

I can confirm this is what is happening. When I try to reschedule crosswatch, the following happens:

valhallasw@tools-proxy-01:~$ tail -f /var/lib/redis/tools-proxy-01-6379.aof | grep -e 'crosswatch' -C 5
.*
*4
$4
HSET
$17
prefix:crosswatch
$2
.*
$59
http://tools-webgrid-generic-1403.tools.eqiad.wmflabs:58753
*3
$4
HDEL
$17
prefix:crosswatch
$2
.*

First the new location is set, then the old one is removed.

dcaro claimed this task.
dcaro subscribed.

This seems resolved somehow, no more issues since then, will reopen if new issues arise.

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!