Page MenuHomePhabricator

Physical location SPOF because of database server distribution on a single rack (D1)
Closed, ResolvedPublic

Description

The following database servers are in eqiad D1:
db1073
db1072
db1066
db1065

  • All together receive, 75% of the configured read enwiki traffic, and are the only "high end servers", with 160GB of ram. Also, they are the only API servers. Pplease note that writes are, in comparison, negligible.

db1071
db1070

  • 100% of configured read s5 traffic (dewiki and datawiki). Please note that writes are, in comparison, negligible.

db1069

  • Sanitarium host, all writes from all shards go here

db1068
db1064

  • 83% of configured read s4 traffic (commons)

dbstore1002
dbstore1001

  • Primary and secondary go-to systems in case of failure, and the analytics host. They replicate all data from all shards and all misc servers.

Event Timeline

jcrespo claimed this task.
jcrespo raised the priority of this task from to Low.
jcrespo updated the task description. (Show Details)
jcrespo added a subscriber: jcrespo.
Volans added a subscriber: Volans.

With the new coredb server of T135253 we can re-distribute the load, here my proposal for the assignement of servers:

# Current racking + proposed new servers distribution (dbNEW[1-3])

# S1
db1057: C2 # master
db1051: C2
db1052: C2
db1053: C2
db1055: C2
db1065: D1
db1066: D1
db1072: D1
db1073: D1
db1080: A2
db1083: B1
db1089: C3

# S2
db1018: B1 # master
db1021: B1
db1024: B1
db1036: B2
db1054: C2
db1060: C2
db1063: D1
db1067: D1
db1074: A2
db1076: B1
db1090: C3

# S3
db1075: A2 # master
db1015: A2
db1035: B2
db1038: B2
db1044: B2
db1077: B1
db1078: C3

# S4
db1042: B2 # master
db1019: B1
db1040: B2
db1056: C2
db1059: C2
db1064: D1
db1068: D1
db1081: A2
db1084: B1
db1091: D2

# S5
db1049: B1 # master
db1026: B1
db1045: B2
db1070: D1
db1071: D1
db1082: A2
db1087: C2
db1092: D2

# S6
db1050: B2 # master
db1022: B1
db1023: B1
db1030: B1
db1037: B2
db1061: D1
db1085: B3
db1088: C2
db1093: D2

# S7
db1041: B2 # master
db1028: B1
db1033: B2
db1034: B2
db1039: B2
db1062: D1
db1079: A2
db1086: B3
db1094: D2

Updated with the actual hostnames.

@Cmjohnson for the ones in row C I don't know yet which hostnames are in C2 and which in C3.

I'll re-check everything with racktables once updated.

Updated with all the new to-be-configured ones, re-checked everything with racktables.

Volans moved this task from In progress to Pending comment on the DBA board.

The above schema for the distribution of the new servers will resolve this issue. Is pending the blocking task(s).

jcrespo claimed this task.

This is now fixed, D1 is no longer a SPOF. Although somehow heavy, if D1 or the whole D row went down, we would still have more than 75% capacity of all services on other rows.

dbstore100[12] is not an issue because we have dbstore200[12].