See T122509#1906294 for details. In short: I think it's possible the post-exit script of an old webservice to HDEL the server/port information of a newly started webservice.
Description
Related Objects
- Mentioned In
- T190696: [Gsoc 2018] Proposal for Toolforge webservice command Improvement
T190638: GSoC 2018 proposal for Improvements for the Toolforge 'webservice' command
T177603: Proposal: Improvements for the Toolforge 'webservice' command
T175768: Improvements for the Toolforge 'webservice' command
T122509: investigate webservices crash 28 dec 2015
T133090: tools.suggestbot web requests fail after a period of time - Mentioned Here
- T122509: investigate webservices crash 28 dec 2015
Event Timeline
The only way around this that I can think of is for the post-exit script to first check that the kv that is actually there really /is/ the one it expects to delete before going ahead and deleting it.
Because it's not atomic, it's not perfectly raceproof - but it protects against variable delay before the post-exit script starts (which may be considerably delayed for a number of reasons).
We could use a different data structure in Redis. Currently:
127.0.0.1:6379> HGETALL prefix:admin 1) ".*" 2) "http://tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs:47897" 127.0.0.1:6379>
.* is a regular expression that IIRC was intended to allow more advanced uses. If we instead use prefix:admin as a list instead of a hash, I think we could use LREM prefix:admin 1 http://tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs:47897 to (only) remove the "old" entry.
I'm not sure if my memory serves me correctly, but I think we previously did just that with portgranter, i. e. only remove the entry that was created, and we ran into problems when the removal part did not execute for whatever reason and we had two (or more) entries in Redis. And so – I think – the "remove all entries" solution was chosen so that there is a clean slate to build on.
I can confirm this is what is happening. When I try to reschedule crosswatch, the following happens:
valhallasw@tools-proxy-01:~$ tail -f /var/lib/redis/tools-proxy-01-6379.aof | grep -e 'crosswatch' -C 5 .* *4 $4 HSET $17 prefix:crosswatch $2 .* $59 http://tools-webgrid-generic-1403.tools.eqiad.wmflabs:58753 *3 $4 HDEL $17 prefix:crosswatch $2 .*
First the new location is set, then the old one is removed.