Investigate job queue retries for Parsoid jobs
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	• GWicke
	Jan 6 2015, 6:32 PM

Description

In https://wikitech.wikimedia.org/wiki/Incident_documentation/20150103-Parsoid requests to a specific page seemed to be retried a large number of times. Since requests for this page locked up parsoid workers, this led to the parsoid cluster being quickly overloaded.

We should check why such a large number of retries are happening. Things to look into:

Varnish backend retries on timeout (both frontend and backend)
Parsoid job retries

Possibly related: T73853

Related Objects

Mentioned Here: T73853: Retry counts not working / jobs re-executed beyond retry limits

Event Timeline

• GWicke created this task.Jan 6 2015, 6:32 PM

• GWicke raised the priority of this task from to Needs Triage.

• GWicke updated the task description. (Show Details)

• GWicke added projects: Parsoid, Services, MediaWiki-Core-JobQueue.

• GWicke subscribed.

• GWicke updated the task description. (Show Details)Feb 7 2015, 5:21 AM

• GWicke set Security to None.

This again happened today .. with another bug in Parsoid causing an infinite loop on 2 enwiki pages (and yesterday on 3 plwiki pages). The job queue relentlessly retried the two pages over the last 8 hours which required 2 restarts and a hotfix to be deployed to fix the infinite loop.

ssastry triaged this task as Medium priority.Mar 24 2015, 4:47 PM

I thought this was looked into and there was nothing odd about job retries (actually I recall finding/fixing a bug that made it 1 less than it should be).

aaron closed this task as a duplicate of T73853: Retry counts not working / jobs re-executed beyond retry limits.Jun 3 2015, 9:59 PM

Investigate job queue retries for Parsoid jobsClosed, DuplicatePublicActions

Description

Related Objects

Event Timeline

Investigate job queue retries for Parsoid jobs
Closed, DuplicatePublic
Actions