Page MenuHomePhabricator

investigate webservices crash 28 dec 2015
Closed, ResolvedPublic

Description

Around 13:50 UTC, a whole bunch of webservices decided to stop. Until 13:49:14 it's all wikiviewstats (T122506), but then we get:

0 2015-12-28 13:49:28 tools-webgrid-lighttpd-1206.eqiad.wmflabs tools.rxy lighttpd-rxy 261226
0 2015-12-28 13:49:32 tools-webgrid-lighttpd-1410.eqiad.wmflabs tools.expose-data lighttpd-expose-data 257452
0 2015-12-28 13:49:35 tools-webgrid-lighttpd-1407.eqiad.wmflabs tools.bibleversefinder lighttpd-bibleversefinder 257453
0 2015-12-28 13:49:38 tools-webgrid-lighttpd-1407.eqiad.wmflabs tools.raun lighttpd-raun 257455
0 2015-12-28 13:49:42 tools-webgrid-lighttpd-1406.eqiad.wmflabs tools.contributionsurveyor lighttpd-contributionsurveyor 257456
0 2015-12-28 13:49:46 tools-webgrid-lighttpd-1208.eqiad.wmflabs tools.jembot lighttpd-jembot 403097
0 2015-12-28 13:49:50 tools-webgrid-lighttpd-1205.eqiad.wmflabs tools.usergraph lighttpd-usergraph 261227
0 2015-12-28 13:49:54 tools-webgrid-lighttpd-1408.eqiad.wmflabs tools.whois lighttpd-whois 257467
0 2015-12-28 13:49:56 tools-webgrid-lighttpd-1201.eqiad.wmflabs tools.wle lighttpd-wle 257468
0 2015-12-28 13:49:59 tools-webgrid-lighttpd-1409.eqiad.wmflabs tools.twl lighttpd-twl 257469
0 2015-12-28 13:50:02 tools-webgrid-lighttpd-1205.eqiad.wmflabs tools.avicbot twitter.sh 484979
0 2015-12-28 13:50:03 tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs tools.voiceintro lighttpd-voiceintro 257470
0 2015-12-28 13:50:07 tools-webgrid-lighttpd-1207.eqiad.wmflabs tools.supercount lighttpd-supercount 257471
0 2015-12-28 13:50:10 tools-webgrid-lighttpd-1207.eqiad.wmflabs tools.potd-feed lighttpd-potd-feed 261228
0 2015-12-28 13:50:13 tools-webgrid-lighttpd-1406.eqiad.wmflabs tools.pagecounts lighttpd-pagecounts 257473
0 2015-12-28 13:50:17 tools-webgrid-lighttpd-1410.eqiad.wmflabs tools.ios-crashes lighttpd-ios-crashes 257474
0 2015-12-28 13:50:21 tools-webgrid-lighttpd-1204.eqiad.wmflabs tools.awb lighttpd-awb 257475
0 2015-12-28 13:50:24 tools-webgrid-lighttpd-1401.eqiad.wmflabs tools.wikiradio lighttpd-wikiradio 425377
0 2015-12-28 13:50:27 tools-webgrid-lighttpd-1409.eqiad.wmflabs tools.dewkin lighttpd-dewkin 257476
0 2015-12-28 13:50:31 tools-webgrid-lighttpd-1404.eqiad.wmflabs tools.forrestbot lighttpd-forrestbot 257477
0 2015-12-28 13:50:35 tools-webgrid-lighttpd-1205.eqiad.wmflabs tools.authors lighttpd-authors 257478
0 2015-12-28 13:50:39 tools-webgrid-lighttpd-1415.tools.eqiad.wmflabs tools.fronds lighttpd-fronds 257479
0 2015-12-28 13:50:43 tools-webgrid-lighttpd-1405.eqiad.wmflabs tools.giraffe lighttpd-giraffe 257480
0 2015-12-28 13:50:48 tools-webgrid-lighttpd-1408.eqiad.wmflabs tools.gorlingor lighttpd-gorlingor 257483
0 2015-12-28 13:50:53 tools-webgrid-lighttpd-1408.eqiad.wmflabs tools.alkamidbot lighttpd-alkamidbot 257484
0 2015-12-28 13:50:56 tools-webgrid-lighttpd-1401.eqiad.wmflabs tools.articleplaceholderwiki lighttpd-articleplaceholderwiki 257485
0 2015-12-28 13:51:00 tools-webgrid-lighttpd-1206.eqiad.wmflabs tools.wdq2sparql lighttpd-wdq2sparql 257486
0 2015-12-28 13:51:05 tools-webgrid-lighttpd-1201.eqiad.wmflabs tools.phetools lighttpd-phetools 479826
(...)
0 2015-12-28 14:15:42 tools-webgrid-lighttpd-1401.eqiad.wmflabs tools.wmcounter lighttpd-wmcounter 258246

which is when @coren force-rescheduled all webservice jobs.

taking lighttpd-rxy as an example, we find:

2015-12-28T13:49:27.901690 No running webservice job found, attempting to start it

in service.log, and

2015-12-28 13:49:27: (server.c.1512) server stopped by UID = 0 PID = 32558
2015-12-28 13:49:30: (log.c.166) server started
2015-12-28 14:15:45: (log.c.166) server started

in error.log. PID 32558 of course doesn't exist anymore.

After restarting, the redis links were not updated correctly, although it's not immediately obvious from this why. Both tools-proxy-01's and tools-proxy-02's redis logs are empty (!) but that might be because things are working (the old log files only contain error situations)

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw added subscribers: valhallasw, coren.
Restricted Application added subscribers: StudiesWorld, Sadads, Aklapper. · View Herald Transcript

I'm leaving the 'why webservers crashed', and moving to 'why was the proxy incorrectly forwarding' now, and will use magnus-toolserver for that (we got IRC pings this tool was offline).

SGE logs: (keep in mind these are job *endings* not starts)

valhallasw@tools-bastion-01:~/accountingtools$ cat dec15 | grep lighttpd | python merlijn_stdin.py | grep magnus-toolserver
0 2015-12-22 16:54:17 tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs tools.magnus-toolserver lighttpd-magnus-toolserver 252747
25 1970-01-01 00:00:00 tools-webgrid-lighttpd-1406.eqiad.wmflabs tools.magnus-toolserver lighttpd-magnus-toolserver 257504
25 1970-01-01 00:00:00 tools-webgrid-lighttpd-1405.eqiad.wmflabs tools.magnus-toolserver lighttpd-magnus-toolserver 257504
0 2015-12-28 13:51:23 tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs tools.magnus-toolserver lighttpd-magnus-toolserver 257504
25 1970-01-01 00:00:00 tools-webgrid-lighttpd-1409.eqiad.wmflabs tools.magnus-toolserver lighttpd-magnus-toolserver 485054

The job is currently running on

valhallasw@tools-bastion-01:~/accountingtools$ qstat -u 'tools.magnus-toolserver' -xml
(...)
      <queue_name>webgrid-lighttpd@tools-webgrid-lighttpd-1408.eqiad.wmflabs</queue_name>

Webserver logs are difficult due to the size (10GB...)

valhallasw@tools-proxy-02:/var/lib/redis$ grep tools-proxy-02-6379.aof -e 'prefix:magnus-toolserver' -B 2 -A 5 | less
(reformatted for readability)
HMSET prefix:magnus-toolserver .* http://tools-webgrid-lighttpd-1404.tools.eqiad.wmflabs:37122
HSET  prefix:magnus-toolserver .* http://tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs:49805
HDEL  prefix:magnus-toolserver .*
HSET  prefix:magnus-toolserver .* http://tools-webgrid-lighttpd-1409.tools.eqiad.wmflabs:39426
HDEL  prefix:magnus-toolserver .*
HSET  prefix:magnus-toolserver .* http://tools-webgrid-lighttpd-1408.tools.eqiad.wmflabs:42235

The first one is off (the webservice has not been on -1404 since 2015-12-21) but this might be a replication issue. The later ones seem OK, but there's one weird thing: the order is set, set, delete, set, delete, and I think the last delete might match up with the ''second'' rather than the ''third'' delete. In other words, this is what I think happens:

  1. The webservice crashes on tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs,
  2. The webservice is restarted on tools-webgrid-lighttpd-1409.eqiad.wmflabs,
  3. Redis is updated to refer to -1409,
  4. the post-exit script for the webservice on 1413 calls HDEL and removes the proxy information
valhallasw claimed this task.

I'm reasonably convinced T122515: Possible race condition in webservice HSET/HDEL is the culprit, and 5 months later there is little investigation that can still be done.