Page MenuHomePhabricator

Database replication problems - production and labs (tracking)
Closed, ResolvedPublic

Description

This is a tracking task to monitor replication problems in the WMF infrastructure, such as:

  • Replication broken or stopped to any server
  • Data or schema differences between a master and some or all of its slaves
  • Constant or intermittent replication lag degrading the service

This tasks are normally handled by DBA team (part of SRE), requiring many times assistance from Performance-Team, Analytics, Cloud-Services, and the many Product teams.

NOTE: If the problem you are experiencing is about Wiki Replica databases in Cloud-Services (*.{analytics,web}.db.svc.eqiad.wmflabs, *.labsdb), use the Data-Services tag instead; Wiki Replica hosts have their own set of issues including sanitization and multiple user account handling, so even if it is a replica service, the issue may not be replication itself.

Details

Reference
bz48930

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.
StatusSubtypeAssignedTask
Resolvedjcrespo
ResolvedSpringle
Declinedcoren
Resolvedcoren
Declinedjcrespo
DuplicateNone
Resolvedcoren
ResolvedRyanLane
Resolved chasemp
Resolvedcoren
Resolvedcoren
ResolvedNone
Resolvedcoren
Resolvedcoren
Invalidcoren
Resolvedcoren
ResolvedSpringle
Declinedcoren
Resolvedcoren
DeclinedNone
StalledNone
Resolved chasemp
Resolvedcoren
DuplicateNone
Declinedcoren
ResolvedNone
ResolvedNone
Resolvedcoren
ResolvedSpringle
Resolvedcoren
InvalidSpringle
ResolvedSpringle
ResolvedSpringle
Declinedcoren
Resolvedcoren
ResolvedSpringle
ResolvedSpringle
Resolvedcoren
Declinedcoren
Resolvedcoren
ResolvedNone
Resolvedcoren
Resolvedjcrespo
Resolvedjcrespo
Declinedjcrespo
DuplicateNone
Resolvedjcrespo
DeclinedNone
Resolved Marostegui
DeclinedNone
DuplicateNone
Resolvedjcrespo
Resolvedjcrespo
DeclinedNone
Resolvedsrodlund
DeclinedNone
Resolvedjcrespo
Resolved chasemp
Declinedjcrespo
InvalidNone
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolved chasemp

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Danny_B renamed this task from (Tracking) Database replication services to Database replication services (tracking).May 27 2016, 6:01 PM
Danny_B removed a subscriber: wikibugs-l-list.
jcrespo renamed this task from Database replication services (tracking) to Database replication services - production and labs (tracking).Nov 15 2016, 4:38 PM
jcrespo removed a project: Wikimedia-Labs-General.
jcrespo updated the task description. (Show Details)
jcrespo renamed this task from Database replication services - production and labs (tracking) to Database replication problems - production and labs (tracking).Nov 15 2016, 4:56 PM
jcrespo claimed this task.

Resolving this meta-ticket. With the introduction of ROW-based replication before filterin, no recurring issue happened. The few issues are no longer related to replication problems, but pending operational issues. Fixing as, with the current architecture, it is unlikely to have recurring data drift issues again, and even if those happened, a full data reload is now possible, making it absolutely solvable.