Page MenuHomePhabricator

Database replication problems - production and labs (tracking)
Closed, ResolvedPublic

Description

This is a tracking task to monitor replication problems in the WMF infrastructure, such as:

  • Replication broken or stopped to any server
  • Data or schema differences between a master and some or all of its slaves
  • Constant or intermittent replication lag degrading the service

This tasks are normally handled by DBA team (part of Operations), requiring many times assistance from Performance, Analytics, Cloud-Services, and the many Product teams.

NOTE: If the problem you are experiencing is about Wiki Replica databases in Cloud-Services (*.{analytics,web}.db.svc.eqiad.wmflabs, *.labsdb), use the Data-Services tag instead; Wiki Replica hosts have their own set of issues including sanitization and multiple user account handling, so even if it is a replica service, the issue may not be replication itself.

Details

Reference
bz48930

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.
StatusAssignedTask
Resolvedjcrespo
ResolvedSpringle
Declinedcoren
Resolvedcoren
Declinedjcrespo
DuplicateNone
Resolvedcoren
ResolvedRyanLane
Resolvedchasemp
Resolvedcoren
Resolvedcoren
ResolvedNone
Resolvedcoren
Resolvedcoren
Invalidcoren
Resolvedcoren
ResolvedSpringle
Declinedcoren
Resolvedcoren
DeclinedNone
StalledNone
Resolvedchasemp
Resolvedcoren
DuplicateNone
Declinedcoren
ResolvedNone
ResolvedNone
Resolvedcoren
ResolvedSpringle
Resolvedcoren
InvalidSpringle
ResolvedSpringle
ResolvedSpringle
Declinedcoren
Resolvedcoren
ResolvedSpringle
ResolvedSpringle
Resolvedcoren
Declinedcoren
Resolvedcoren
ResolvedNone
Resolvedcoren
Resolvedjcrespo
Resolvedjcrespo
Declinedjcrespo
DuplicateNone
Resolvedjcrespo
DeclinedNone
ResolvedMarostegui
DeclinedNone
DuplicateNone
Resolvedjcrespo
Resolvedjcrespo
DeclinedNone
Opensrodlund
DeclinedNone
Resolvedjcrespo
Resolvedchasemp
Declinedjcrespo
InvalidNone
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolvedchasemp

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 26 2015, 8:38 AM
jcrespo moved this task from Triage to Backlog on the DBA board.Jul 7 2015, 5:27 PM
zhuyifei1999 moved this task from Triage to Tracking on the Cloud-Services board.Jul 16 2015, 6:01 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 16 2015, 9:57 PM
scfc changed the status of subtask T50628: Provide replication lag as a database function from Resolved to Declined.Apr 24 2016, 3:42 AM
Danny_B renamed this task from (Tracking) Database replication services to Database replication services (tracking).May 27 2016, 6:01 PM
Danny_B removed a subscriber: wikibugs-l-list.
jcrespo renamed this task from Database replication services (tracking) to Database replication services - production and labs (tracking).Nov 15 2016, 4:38 PM
jcrespo removed a project: Wikimedia-Labs-General.
jcrespo updated the task description. (Show Details)
jcrespo updated the task description. (Show Details)Nov 15 2016, 4:49 PM
jcrespo renamed this task from Database replication services - production and labs (tracking) to Database replication problems - production and labs (tracking).Nov 15 2016, 4:56 PM
jcrespo added a subtask: Restricted Task.Nov 15 2016, 5:43 PM
jcrespo moved this task from Backlog to Meta/Epic on the DBA board.Nov 15 2016, 7:04 PM
Beta16 removed a subscriber: Beta16.Mar 9 2017, 8:29 AM
bd808 updated the task description. (Show Details)Oct 18 2017, 2:22 AM
jcrespo closed this task as Resolved.Aug 13 2018, 2:30 PM
jcrespo claimed this task.

Resolving this meta-ticket. With the introduction of ROW-based replication before filterin, no recurring issue happened. The few issues are no longer related to replication problems, but pending operational issues. Fixing as, with the current architecture, it is unlikely to have recurring data drift issues again, and even if those happened, a full data reload is now possible, making it absolutely solvable.