Page MenuHomePhabricator

Get notified whenever we have 2 or more db masters per rack
Open, MediumPublic

Description

https://fault-tolerance.toolforge.org/map?cluster=db-masters is a great tool to identify masters per rack.
Let's try to get an email or phab task whenever we have more than 2 master (for now) per rack so we can act on it.

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Ready on the DBA board.

I really want to solve this problem in a more holistic way (and include candidate masters and other things too).

Over the weekend brain dump: Making sure databases don't overlap in racks as much as possible is a variation of graph coloring problem. Every database is a node, if there is a conflict, it should be an edge between the nodes (all masters should have edges to each other, and all candidates too and the masters, plus all replicas of each section should be connected to each other replicas but not other sections' replicas). Then each rack would be a color. If the graph is properly colored, then there is not overlap. But obviously we have a lot of conflicts in production. This turns the problem into another variation of graph coloring called "color fixing" (or color fixing an improperly colored graph). For which I found these papers:

  • Garnero, Valentin; Junosza-Szaniawski, Konstanty; Liedloff, Mathieu; Montealegre, Pedro; Rzążewski, Paweł (2018-02-08). "Fixing improper colorings of graphs". Theoretical Computer Science. 711: 66–78. doi:10.1016/j.tcs.2017.11.013. ISSN 0304-3975.
  • De Biasi, Marzio; Lauri, Juho (2019-05-01). "On the complexity of restoring corrupted colorings". Journal of Combinatorial Optimization. 37 (4): 1150–1169. doi:10.1007/s10878-018-0342-2. ISSN 1573-2886.

The problem optimizes to the least number of changes needed (=keeping number of moving databases to another section or rack to minimum). As they proved in the first paper (or at least that's my understanding of it), this is a NP-complete problem which means either we can do an approximate solution (via greedy algorithms) or the solution might not be computationally feasible for the size of the graph (exhaustive search or backtracking). I also need to read the papers in more depth.

I'll play with writing some code that might or might not work. I hope given that the changes shouldn't be too many, backtracking algorithms might work.

@FCeratto-WMF Suggested we could look into z3 or other SMT. Writing it here so I don't forget.