Page MenuHomePhabricator

Problems with web proxy for Tool Labs
Closed, ResolvedPublic

Description

There are issues with the web proxy system:

For tool "persondata" I'm always getting a "No webservice" page, but a webservice job is running (see "qstat"; tried with "webservice start" and "webservice2 start").

For tool "wikihistory" I'm alsways getting a 404 without content ("HTTP/1.1 404 Not Found"; "Content-Length: 0") no matter if a webservice is running or not.

Event Timeline

APPER raised the priority of this task from to Unbreak Now!.
APPER updated the task description. (Show Details)
APPER added a project: Toolforge.
APPER added subscribers: APPER, yuvipanda.

Regarding the first case (persondata), there is no entry in the proxy list on the proxies. I had compared that list on tools-webproxy just a few days ago, but didn't think about tools-webproxy-01 and tools-webproxy-02, the new scheme. I uploaded my (updated) script to check as F64569 (requires membership in the admin group to execute) and found that a number of webservices were not registered with the proxies. Restarting them showed that they probably hang on portgrabber trying to register with tools-webproxy which has been deleted since. I'll submit a patch to fix this in a jiffy.

Aaargh, looks like this is my fault for deleting tools-webproxy without patching portgrabber and associated things....

Change 195219 had a related patch set uploaded (by Tim Landscheidt):
Tools: Remove tools-webproxy from list of proxies

https://gerrit.wikimedia.org/r/195219

Change 195219 merged by Yuvipanda:
Tools: Remove tools-webproxy from list of proxies

https://gerrit.wikimedia.org/r/195219

Am attempting to run puppet on all the affected nodes via salt now.

Alright, so that's done. What now?

So after the quick merge & deploy by @yuvipanda, they no longer hang, but do not register on tools-webproxy-01/tools-webproxy-02:

scfc@tools-webproxy-02:~$ echo KEYS prefix:persondata | nc -C localhost 6379
scfc@tools-webproxy-02:~$

This is … odd.

Looking at /var/log/proxylistener, they both stop at around the same time with:

2015-03-09 05:30:25,515 Set redis key prefix:xtools-articleinfo with key/value .*:http://tools-webgrid-04:14102

Could this be one of these conntrack issues or something similar where we hit a networking limit? I don't know enough about that for debugging, but if there's someone knowledgeable around, I rather have them look at it before blindly restarting :-).

Looking at /var/log/syslog, at Mar 9 06:01:01 the resolv.conf patch was deployed.

scfc@tools-webproxy-02:~$ sudo lsof -p 1008 | fgrep TCP | wc -l
1019
scfc@tools-webproxy-02:~$

(1008 = proxylistener.) 1024?

Looks that way:

scfc@tools-webproxy-02:~$ sudo cat /proc/1008/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             31554                31554                processes 
Max open files            1024                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       31554                31554                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
scfc@tools-webproxy-02:~$

"Max open files". Needs to be increased.

Change 195221 had a related patch set uploaded (by Yuvipanda):
tools: Increase proxylistener nofile limits

https://gerrit.wikimedia.org/r/195221

Change 195221 merged by Yuvipanda:
tools: Increase proxylistener nofile limits

https://gerrit.wikimedia.org/r/195221

@yuvipanda: Do you want to restart the proxylistener service or try to change the limits on the running process?

Changing the limits on the running process would be good, if possible. If not, we can do a qmod based restart and then pick up stranglers from redis key list?

Alright, looks like we can't really change the limits on the running process...

I just compiled prlimit. Should I give it a try?

As a side note, we should find a way to get rid of this persistent connections architecture of proxylistener...

Wow, that was troublesome until I finally ended up with:

root@tools-webproxy-02:~# LD_PRELOAD=/home/scfc/usr/lib/libsmartcols.so /home/scfc/usr/bin/prlimit -p 1008 --nofile=8192 
root@tools-webproxy-02:~# cat /proc/1008/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             31554                31554                processes 
Max open files            8192                 8192                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       31554                31554                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
root@tools-webproxy-02:~#

Restarted webservice for persondata, and it's up. Now upping the limit on tools-webproxy-01, and the rest of the missing webservices, and then the second part of this task.

tools-webproxy-01's proxylistener also now has 8192 (even has the same pid, had to look twice :-)).

wikihistory was in the lot that I had to restart because it wasn't registered on the proxy; now http://tools.wmflabs.org/wikihistory/ looks alright to me. @APPER, can you confirm?

So I'm guessing the root cause was the nofile limit, and not the tools-webproxy resolution failure?

You're probably right, but I didn't test that (I think the Perl code will not abort if the proxy is unknown or not reachable; don't know if the other scripts throw exceptions).

I wonder how we should monitor this... Perhaps have a 'heartbeat' port for the webproxy?

Filing follow up tasks to prevent this from happening again.

We could actually kill two birds with one stone by having the monitoring test do this:
(a) register a status page
(b) check that the status page is reachable
(c) drop the registration.

This would check not only that the registration system works but also that the proxy... proxies.

Thanks everyone for fixing this. Everything works now.

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald Transcript