Page MenuHomePhabricator

Web service failing to start
Closed, ResolvedPublic

Description

Over the past week I've noticed SuggestBot's web services fail with an error that the service is not defined. This has previously not been an issue. When I log in to check on lighttpd's status I often find the job to be queued and waiting. Today I remembered to check a bit further and noticed that the job can't be scheduled, per the job status pasted below. Not sure what's going on here, I notice there's quite a few web server execute hosts that are not used because they don't offer enough memory, but at the same time I can't control the memory usage.

Would appreciate if this can be looked into and fixed, or some instructions on how to alleviate the problem if there's something I can do.

tools.suggestbot@tools-bastion-02:~$ qstat -j 5421306
job_number: 5421306
exec_file: job_scripts/5421306
submission_time: Sun Apr 17 15:33:15 2016
owner: tools.suggestbot
uid: 51172
group: tools.suggestbot
gid: 51172
sge_o_home: /data/project/suggestbot
sge_o_log_name: tools.suggestbot
sge_o_path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
sge_o_shell: /bin/bash
sge_o_workdir: /data/project/suggestbot
sge_o_host: tools-bastion-02
account: sge
stderr_path_list: NONE:NONE:/data/project/suggestbot/error.log
hard resource_list: h_vmem=4g,release=trusty
mail_list: tools.suggestbot@tools.wmflabs.org
notify: FALSE
job_name: lighttpd-suggestbot
stdout_path_list: NONE:NONE:/data/project/suggestbot/error.log
stdin_path_list: NONE:NONE:/dev/null
jobshare: 0
hard_queue_list: webgrid-lighttpd
env_list:
script_file: /usr/local/bin/tool-lighttpd
scheduling info: queue instance "continuous@tools-exec-1206.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=2.485000 (= 2.485000 + 0.50 * 0.000000 with nproc=4) >= 1.75

queue instance "giftbot@tools-exec-gift.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=8.650000 (= 8.650000 + 0.50 * 0.000000 with nproc=2) >= 2.00
queue instance "mailq@tools-exec-1206.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=2.485000 (= 2.485000 + 0.50 * 0.000000 with nproc=4) >= 2.25
queue instance "task@tools-exec-1206.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=2.485000 (= 2.485000 + 0.50 * 0.000000 with nproc=4) >= 1.75
queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=3.462500 (= 3.462500 + 0.50 * 0.000000 with nproc=4) >= 2.75
queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=3.032500 (= 3.032500 + 0.50 * 0.000000 with nproc=4) >= 2.75
queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=2.802500 (= 2.802500 + 0.50 * 0.000000 with nproc=4) >= 2.75
queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=2.770000 (= 2.700000 + 0.50 * 0.560000 with nproc=4) >= 2.75
queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=6.300000 (= 6.300000 + 0.50 * 0.000000 with nproc=4) >= 2.75
queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1415.tools.eqiad.wmflabs" dropped because it is disabled
cannot run in queue "cyberbot" because it is not contained in its hard queue list (-q)
cannot run in queue "webgrid-generic" because it is not contained in its hard queue list (-q)
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1206.eqiad.wmflabs" because it offers only hf:release=precise
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1207.eqiad.wmflabs" because it offers only hf:release=precise
cannot run in queue "mailq" because it is not contained in its hard queue list (-q)
cannot run in queue "task" because it is not contained in its hard queue list (-q)
cannot run in queue "continuous" because it is not contained in its hard queue list (-q)
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1203.eqiad.wmflabs" because it offers only hf:release=precise
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1205.eqiad.wmflabs" because it offers only hf:release=precise
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1407.eqiad.wmflabs" because it offers only hc:h_vmem=3.142G
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1402.eqiad.wmflabs" because it offers only hc:h_vmem=3.142G
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1202.eqiad.wmflabs" because it offers only hf:release=precise
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1404.eqiad.wmflabs" because it offers only hc:h_vmem=3.142G
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1201.eqiad.wmflabs" because it offers only hf:release=precise
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1408.eqiad.wmflabs" because it offers only hc:h_vmem=3.142G
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1208.eqiad.wmflabs" because it offers only hf:release=precise
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1209.eqiad.wmflabs" because it offers only hf:release=precise
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1210.eqiad.wmflabs" because it offers only hf:release=precise
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1409.eqiad.wmflabs" because it offers only hc:h_vmem=3.142G
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1403.eqiad.wmflabs" because it offers only hc:h_vmem=3.142G
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1204.eqiad.wmflabs" because it offers only hf:release=precise
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1410.eqiad.wmflabs" because it offers only hc:h_vmem=3.142G
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1405.eqiad.wmflabs" because it offers only hc:h_vmem=3.142G
(-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1406.eqiad.wmflabs" because it offers only hc:h_vmem=3.142G

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

So, to summarize,

  1. tools-webgrid-lighttpd-1401, -1411, -1412, -1413, -1414 are overloaded. This is still the case,
  2. tools-webgrid-lighttpd-1415 is disabled (T132878: tools-webgrid-lighttpd-1415 disabled)
  3. all other hosts reported a lack of free memory. This seems to have resolved itself (all hosts currently report 1-2GB memory usage).
  1. and 2) are clear issues, but I'm not sure how to figure out what caused 3).
chasemp claimed this task.
chasemp subscribed.

things look ok atm and from the timing I'm going to assume T132879 was the cause for now, and this looks to be running fine now