Page MenuHomePhabricator

Physical location SPOF because of database server distribution on a single rack (D1)
Closed, ResolvedPublic

Description

The following database servers are in eqiad D1:
db1073
db1072
db1066
db1065

  • All together receive, 75% of the configured read enwiki traffic, and are the only "high end servers", with 160GB of ram. Also, they are the only API servers. Pplease note that writes are, in comparison, negligible.

db1071
db1070

  • 100% of configured read s5 traffic (dewiki and datawiki). Please note that writes are, in comparison, negligible.

db1069

  • Sanitarium host, all writes from all shards go here

db1068
db1064

  • 83% of configured read s4 traffic (commons)

dbstore1002
dbstore1001

  • Primary and secondary go-to systems in case of failure, and the analytics host. They replicate all data from all shards and all misc servers.

Event Timeline

jcrespo created this task.Sep 9 2015, 7:47 PM
jcrespo claimed this task.
jcrespo raised the priority of this task from to Low.
jcrespo updated the task description. (Show Details)
jcrespo added a subscriber: jcrespo.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 9 2015, 7:47 PM
jcrespo set Security to None.
Restricted Application added a subscriber: Matanya. · View Herald TranscriptSep 9 2015, 7:48 PM
jcrespo moved this task from Triage to Backlog on the DBA board.Sep 10 2015, 10:59 AM
Volans claimed this task.EditedMay 19 2016, 9:35 AM
Volans added a subscriber: Volans.

With the new coredb server of T135253 we can re-distribute the load, here my proposal for the assignement of servers:

# Current racking + proposed new servers distribution (dbNEW[1-3])

# S1
db1057: C2 # master
db1051: C2
db1052: C2
db1053: C2
db1055: C2
db1065: D1
db1066: D1
db1072: D1
db1073: D1
db1080: A2
db1083: B1
db1089: C3

# S2
db1018: B1 # master
db1021: B1
db1024: B1
db1036: B2
db1054: C2
db1060: C2
db1063: D1
db1067: D1
db1074: A2
db1076: B1
db1090: C3

# S3
db1075: A2 # master
db1015: A2
db1035: B2
db1038: B2
db1044: B2
db1077: B1
db1078: C3

# S4
db1042: B2 # master
db1019: B1
db1040: B2
db1056: C2
db1059: C2
db1064: D1
db1068: D1
db1081: A2
db1084: B1
db1091: D2

# S5
db1049: B1 # master
db1026: B1
db1045: B2
db1070: D1
db1071: D1
db1082: A2
db1087: C2
db1092: D2

# S6
db1050: B2 # master
db1022: B1
db1023: B1
db1030: B1
db1037: B2
db1061: D1
db1085: B3
db1088: C2
db1093: D2

# S7
db1041: B2 # master
db1028: B1
db1033: B2
db1034: B2
db1039: B2
db1062: D1
db1079: A2
db1086: B3
db1094: D2
jcrespo moved this task from Backlog to In progress on the DBA board.May 19 2016, 9:45 AM

Updated with the actual hostnames.

@Cmjohnson for the ones in row C I don't know yet which hostnames are in C2 and which in C3.

I'll re-check everything with racktables once updated.

Updated with all the new to-be-configured ones, re-checked everything with racktables.

Volans removed Volans as the assignee of this task.May 27 2016, 10:53 AM
Volans moved this task from In progress to Next on the DBA board.

The above schema for the distribution of the new servers will resolve this issue. Is pending the blocking task(s).

jcrespo closed this task as Resolved.Jun 16 2016, 8:50 AM
jcrespo claimed this task.

This is now fixed, D1 is no longer a SPOF. Although somehow heavy, if D1 or the whole D row went down, we would still have more than 75% capacity of all services on other rows.

dbstore100[12] is not an issue because we have dbstore200[12].