Page MenuHomePhabricator unreachable (503 error) after migration to eqiad
Closed, ResolvedPublic

Description currently gives the following response:

Request: GET, from via deployment-cache-mobile03 deployment-cache-mobile03 ([]:3128), Varnish XID 723150036
Forwarded for:,
Error: 503, Service Unavailable at Mon, 31 Mar 2014 18:09:57 GMT

Version: unspecified
Severity: blocker



Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:00 AM
bzimport set Reference to bz63315.
bzimport added a subscriber: Unknown Object (MLST).

Request: GET, from via deployment-cache-text02 deployment-cache-text02 ([]:3128), Varnish XID 105897343
Forwarded for:,
Error: 503, Service Unavailable at Mon, 31 Mar 2014 18:17:47 GMT

Request: GET, from via deployment-cache-bits01 deployment-cache-bits01 ([]:80), Varnish XID 90403984
Forwarded for:
Error: 503, Service Unavailable at Mon, 31 Mar 2014 18:40:55 GMT

pages while logged out (no cookies) are basically served or are hitting cache(?), but bits also doesn't work.
sometimes also the connection times out

Change 122436 had a related patch set uploaded by Hashar:
beta: lower # of procs on jobrunner

The CirrusSearch update job kicked it and started parsing the whole simplewiki which is a big large for the beta cluster. Due to our jobrunner (deployment-jobrunner01) being configured like production (launching a lot of jobs), the jobs were starving the application servers by querying /w/api.php ...

I lowered the number of job runners with

There might be some other issue.

I tried restarting both apaches, without much success. Eventually killed the parsoid daemon which was spamming the application server as well.

The root cause is definitely parsoid doing a lot of queries on the Api service.

So Parsoid was attempting to parse all of simplewiki. I have stopped the daemon and restarted it. Monitoring /var/log/parsoid/parsoid.log it is all quiet on that front now so the API application servers are no more hammered.

Also bits might be

Also bits might be fully loaded by now.

I think the issue is solved now. Root cause was Parsoid attempting to fetch a bunch of page info from the API server for some reasons. Restarting Parsoid apparently stopped the spam.

(In reply to Daniel Zahn from comment #9)

would that make obsolete or not

That one lower the number of jobs run in parallel on the jobrunner01 instance. Unrelated but still a good thing to have, the instance is less powerful than our prod servers.

Change 122436 merged by Alexandros Kosiaris:
beta: lower # of procs on jobrunner