Page MenuHomePhabricator

VE is not loading on Beta Cluster, getting 503s
Closed, ResolvedPublic

Description

VE is not loading on Beta Cluster, getting 503s

Event Timeline

Ryasmeen renamed this task from VE not loading on Beta Cluster, getting 503s to VE is not loading on Beta Cluster, getting 503s.Jun 28 2018, 8:41 PM
Ryasmeen updated the task description. (Show Details)

Looks like RB is timing out trying to connect to parsoid:

krenair@deployment-cache-text04:~$ curl http://deployment-restbase01.deployment-prep.eqiad.wmflabs:7231/en.wikipedia.beta.wmflabs.org/v1/page/html/14thjulyFF
{"type":"https://mediawiki.org/wiki/HyperSwitch/errors/internal_http_error","method":"get","detail":"Error: ESOCKETTIMEDOUT","uri":"http://deployment-parsoid09.deployment-prep.eqiad.wmflabs:8000/en.wikipedia.beta.wmflabs.org/v3/page/pagebundle/14thjulyFF/112456"}

Tried restarting parsoid service on deployment-parsoid09 and then getting the URI above.
Based on tail -f /srv/log/parsoid/main.log | grep -v ChangePropagation it did try wt2html for that page. I haven't managed to get it to repeat that, or do it for other pages.

Confirmed that Parsoid is on b068bb51d29e294a4f4a875ae829cca8cf314205 in both prod and beta.
beta:

deployment-tin$ curl http://deployment-parsoid09.deployment-prep.eqiad.wmflabs:8000/_version
{"name":"parsoid","version":"0.9.0","sha":"b068bb51d29e294a4f4a875ae829cca8cf314205"}

and prod:

deployment:~$ for wtp in `grep wtp /etc/dsh/group/parsoid`; do echo -n "Querying $wtp: "; curl "http://$wtp:8000/_version"; echo; done;
Querying wtp1025.eqiad.wmnet: {"name":"parsoid","version":"0.9.0","sha":"b068bb51d29e294a4f4a875ae829cca8cf314205"}
[...etc...]

Tried restarting parsoid service on deployment-parsoid09 and then getting the URI above.
Based on tail -f /srv/log/parsoid/main.log | grep -v ChangePropagation it did try wt2html for that page. I haven't managed to get it to repeat that, or do it for other pages.

That could be an indication that the task queues aren't being cleared fast enough by the workers ... but I'm not sure. If they were full, they would fail hard.

Requests that get handled directly by the server (ie. 302s/404s) were always replied to promptly. I restarted the service and, at least for the moment, it is restored.

Deskana claimed this task.

I'll take it.