Page MenuHomePhabricator

ToolForge Stretch grid: not able to restart wsexport webservice
Closed, ResolvedPublic

Description

On Stretch, I am not able to restart the wsexport lighttpd werbservice on the grid. I get the following error:

$ webservice restart
Restarting webservice...............ERROR: Pod resisted shutdown

It seems to be related to task T217025 but, because I do not have root rights, I could not force delete the job.

Event Timeline

aborrero triaged this task as Medium priority.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-cloud) [2019-03-20T10:03:13Z] <arturo> T218546 force job deletion 788036

Mentioned in SAL (#wikimedia-cloud) [2019-03-20T10:10:39Z] <arturo> manually killing zombie procs in tools-sgewebgrid-lightttpd-0920 (T218546)

So I tried to investigate a bit more what was happening here.

The tool job was running on the tools-sgewebgrid-lightttpd-0920 server (https://tools.wmflabs.org/sge-status/#host-tools-sgewebgrid-lighttpd-0920)
According to SGE, the status was deleting and qstat reported it at dr state, probably due to unresponsiveness to any standard deletion command.

I ssh'd to tools-sgewebgrid-lightttpd-0920 to see what was going on. The server load was normal, syslog normal, dmesg normal, htop normal, etc.

@Tpt explained to me a bit about the tool (pdf rendering), so I stared at the tools processes themselves. Specifically, @Tpt mentions the tool being heavy in processing and IO.

All procs were apparently running as expected:

           ├─sge_execd(758)─┬─sge_shepherd(7234)───lighttpd(7236)─┬─php-cgi(7246)─┬─php-cgi(7247)
           │                │                                     │               └─php-cgi(7248)
           │                │                                     └─php-cgi(7249)─┬─php-cgi(7250)
           │                │                                                     └─php-cgi(7251)
           │                ├─sge_shepherd(13117)───lighttpd(13120)─┬─php-cgi(13130)─┬─php-cgi(15714)
           │                │                                       │                └─php-cgi(19146)
           │                │                                       └─php-cgi(13133)─┬─php-cgi(10675)───sh(22281)───xvfb-run(22282)─┬─Xvfb(22294)
           │                │                                                        │                                              └─ebook-convert(22297)─┬─{QDBusConnection}(22391)
           │                │                                                        │                                                                     ├─{Qt bearer threa}(22390)
           │                │                                                        │                                                                     ├─{ebook-convert}(22389)
           │                │                                                        │                                                                     ├─{ebook-convert}(22392)
           │                │                                                        │                                                                     ├─{ebook-convert}(22393)
           │                │                                                        │                                                                     ├─{ebook-convert}(22394)
           │                │                                                        │                                                                     └─{ebook-convert}(22395)
           │                │                                                        └─php-cgi(23411)───sh(23702)───xvfb-run(23703)─┬─Xvfb(23715)
           │                │                                                                                                       └─ebook-convert(23718)
[...]

And looking at them for details:

root@tools-sgewebgrid-lighttpd-0920:~# ps aux | grep [w]sexport
tools.w+ 13120  0.0  0.0  64020  6084 ?        Ds   Mar13   1:05 /usr/sbin/lighttpd -f /var/run/lighttpd/wsexport -D
tools.w+ 22281  0.0  0.0   4276   692 ?        S    Mar15   0:00 sh -c xvfb-run -a ebook-convert '/mnt/nfs/labstore-secondary-tools-project/wsexport/tool/temp/ws-c0_____________-10675427817801.epub' '/mnt/nfs/labstore-secondary-tools-project/wsexport/tool/temp/ws-c0_____________-10675464820306.pdf' --page-breaks-before / --paper-size a5 --margin-bottom 32 --margin-top 40 --margin-left 24 --margin-right 24 --pdf-page-numbers --preserve-cover-aspect-ratio
tools.w+ 22282  0.0  0.0   4276  1444 ?        S    Mar15   0:00 /bin/sh /usr/bin/xvfb-run -a ebook-convert /mnt/nfs/labstore-secondary-tools-project/wsexport/tool/temp/ws-c0_____________-10675427817801.epub /mnt/nfs/labstore-secondary-tools-project/wsexport/tool/temp/ws-c0_____________-10675464820306.pdf --page-breaks-before / --paper-size a5 --margin-bottom 32 --margin-top 40 --margin-left 24 --margin-right 24 --pdf-page-numbers --preserve-cover-aspect-ratio
tools.w+ 22297  0.1  9.1 2533752 752240 ?      Dl   Mar15   9:20 /usr/bin/python2.7 /usr/bin/ebook-convert /mnt/nfs/labstore-secondary-tools-project/wsexport/tool/temp/ws-c0_____________-10675427817801.epub /mnt/nfs/labstore-secondary-tools-project/wsexport/tool/temp/ws-c0_____________-10675464820306.pdf --page-breaks-before / --paper-size a5 --margin-bottom 32 --margin-top 40 --margin-left 24 --margin-right 24 --pdf-page-numbers --preserve-cover-aspect-ratio
tools.w+ 23702  0.0  0.0   4276   696 ?        S    Mar15   0:00 sh -c xvfb-run -a ebook-convert '/mnt/nfs/labstore-secondary-tools-project/wsexport/tool/temp/ws-c0_Nobiliaire_et_armorial_de_Bretagne_Tome_I-234111605413169.epub' '/mnt/nfs/labstore-secondary-tools-project/wsexport/tool/temp/ws-c0_Nobiliaire_et_armorial_de_Bretagne_Tome_I-23411401103445.pdf' --page-breaks-before / --paper-size a5 --margin-bottom 32 --margin-top 40 --margin-left 24 --margin-right 24 --pdf-page-numbers --preserve-cover-aspect-ratio
tools.w+ 23703  0.0  0.0   4276  1552 ?        S    Mar15   0:00 /bin/sh /usr/bin/xvfb-run -a ebook-convert /mnt/nfs/labstore-secondary-tools-project/wsexport/tool/temp/ws-c0_Nobiliaire_et_armorial_de_Bretagne_Tome_I-234111605413169.epub /mnt/nfs/labstore-secondary-tools-project/wsexport/tool/temp/ws-c0_Nobiliaire_et_armorial_de_Bretagne_Tome_I-23411401103445.pdf --page-breaks-before / --paper-size a5 --margin-bottom 32 --margin-top 40 --margin-left 24 --margin-right 24 --pdf-page-numbers --preserve-cover-aspect-ratio
tools.w+ 23718  0.0  0.4 146732 39912 ?        D    Mar15   0:01 /usr/bin/python2.7 /usr/bin/ebook-convert /mnt/nfs/labstore-secondary-tools-project/wsexport/tool/temp/ws-c0_Nobiliaire_et_armorial_de_Bretagne_Tome_I-234111605413169.epub /mnt/nfs/labstore-secondary-tools-project/wsexport/tool/temp/ws-c0_Nobiliaire_et_armorial_de_Bretagne_Tome_I-23411401103445.pdf --page-breaks-before / --paper-size a5 --margin-bottom 32 --margin-top 40 --margin-left 24 --margin-right 24 --pdf-page-numbers --preserve-cover-aspect-ratio

Most of the procs are in D or S state, which means (from ps(1):

  • D uninterruptible sleep (usually IO)
  • S interruptible sleep (waiting for an event to complete)

Now, looking again at dmesg, I found this:

root@tools-sgewebgrid-lighttpd-0920:~# dmesg -T | tail
[...]
[Sat Mar 16 01:06:30 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[...]

I just manually deleted the job and killed the zombie procs in the server. The tool was able to restart again without issues.

So the timeline makes sense:

  • 2019-03-15: tool procs started
  • 2019-03-16: NFS issue, leaving all procs in sleep mode forever
  • 2019-03-20: (today) investigation and actions being taken

Conclusion: NFS connectivity issues or lack of response from the server can get procs stucks, specially for this kind of tools doing heavy IO on NFS. Stuck procs requires manual intervention in both SGE and the exec node itself.