Physical location SPOF because of database server distribution on a single rack (D1)
The following database servers are in eqiad D1:

  • All together receive, 75% of the configured read enwiki traffic, and are the only "high end servers", with 160GB of ram. Also, they are the only API servers. Pplease note that writes are, in comparison, negligible.


  • 100% of configured read s5 traffic (dewiki and datawiki). Please note that writes are, in comparison, negligible.


  • Sanitarium host, all writes from all shards go here


  • 83% of configured read s4 traffic (commons)


  • Primary and secondary go-to systems in case of failure, and the analytics host. They replicate all data from all shards and all misc servers.

With the new coredb server of T135253 we can re-distribute the load, here my proposal for the assignement of servers:

# Current racking + proposed new servers distribution (dbNEW[1-3])

# S1
db1057: C2 # master
db1051: C2
db1052: C2
db1053: C2
db1055: C2
db1065: D1
db1066: D1
db1072: D1
db1073: D1
db1080: A2
db1083: B1
db1089: C3

# S2
db1018: B1 # master
db1021: B1
db1024: B1
db1036: B2
db1054: C2
db1060: C2
db1063: D1
db1067: D1
db1074: A2
db1076: B1
db1090: C3

# S3
db1075: A2 # master
db1015: A2
db1035: B2
db1038: B2
db1044: B2
db1077: B1
db1078: C3

# S4
db1042: B2 # master
db1019: B1
db1040: B2
db1056: C2
db1059: C2
db1064: D1
db1068: D1
db1081: A2
db1084: B1
db1091: D2

# S5
db1049: B1 # master
db1026: B1
db1045: B2
db1070: D1
db1071: D1
db1082: A2
db1087: C2
db1092: D2

# S6
db1050: B2 # master
db1022: B1
db1023: B1
db1030: B1
db1037: B2
db1061: D1
db1085: B3
db1088: C2
db1093: D2

# S7
db1041: B2 # master
db1028: B1
db1033: B2
db1034: B2
db1039: B2
db1062: D1
db1079: A2
db1086: B3
db1094: D2

Updated with the actual hostnames.

@Cmjohnson for the ones in row C I don't know yet which hostnames are in C2 and which in C3.

I'll re-check everything with racktables once updated.

Updated with all the new to-be-configured ones, re-checked everything with racktables.

The above schema for the distribution of the new servers will resolve this issue. Is pending the blocking task(s).

This is now fixed, D1 is no longer a SPOF. Although somehow heavy, if D1 or the whole D row went down, we would still have more than 75% capacity of all services on other rows.

dbstore100[12] is not an issue because we have dbstore200[12].