Page MenuHomePhabricator

Provide resource for db access in grid
Closed, DeclinedPublic

Description

Currently long maintences are done at meriadb10. Also future updates and so on. Having a sql resource would prevent sge scripts to run while database is down/broken and schedules them later when database is available again.

On toolserver there were many sge different resources defined. On Labs there will probably be only three db servers, so having one resource for replicated db, one for tools-db and one for postgre should be enough.

This resource should set to 0 while maintenance is done or replication is broken. This could be done manually or by a load sensor script.

For this the resource must be a simple indicator and does not need to be consumable. On toolserver the resource was consumable to limit the number of queries run at the same time to prevent heavy peak usage. I don't know if this is needed on labs, too.


Version: unspecified
Severity: normal

Details

Reference
bz68881

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:37 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz68881.
coren removed coren as the assignee of this task.Mar 25 2015, 7:36 PM
coren triaged this task as Low priority.
coren set Security to None.

Today dewiki has high replag since about 12 hours (>3hours replag).

Many of my sge jobs are currently testing replag and rescheduling themselves (return code 99) since hours now. This is the recommended behaviors as told by db-admins long time ago. I still think there should be a resource for this.

valhallasw added a subscriber: valhallasw.

Unfortunately, we don't have the in-house knowledge to implement and maintain such a custom resource. I think 'check-and-reschedule' is a sane workaround, which has the added advantage it can work with any combination of checks one wishes to use.