Page MenuHomePhabricator

eqiad 10G ports needs
Closed, ResolvedPublic


The asw-a/b/c rows refresh in eqiad will add 1 more 10G switch (48 ports) per row. Bringing the total number of 10G ports in eqiad from ~432 to ~528. Switch cycle is ~5 years, and row D will most likely be re-trofitted with an extra 10G switch in the future.

This task is to track the 1G->10G upgrades or servers, so 1/ teams that need 10G are aware that we have the capacity, and 2/ we don't exceed that capacity

Talking with various teams in charge of servers usually heavy in data last August (when planning the switch upgrade) I noted those potential evolutions.
As this is from 7 months ago it's now mostly a baseline and should be update with more accurate numbers.

ServiceEvolution (5y)Note
Databases backups/provisioning service+10
Media storage backend+6/8 per year (40 in 5 years)
Cloud+8/10 (over 2 years)
Memcacheno planned changes
Hadoop+10/12 nodes/y (no need for 10G)
Kafkamax doubling (probably not 10G)
Kubernetespretty sure wont need 10G
GanetiPeaks close to 1G
ElasticsearchPeaks close to 1G

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

For elasticsearch we are expecting to switch all servers to 10G on their standard server refresh schedule. I'm not sure what exactly that schedule is though. Soon there will be 43 elasticsearch servers in eqiad. If i had to make a 5 year guess, 60 might be reasonable?

ayounsi mentioned this in Unknown Object (Task).Mar 21 2018, 11:33 PM

I have only hunches and no data to back any of this, but I think ElasticSearch, Hadoop, WMCS, Backups, plus probably Ganeti and Kafka would be good candidates to go 10G-only. Kubernetes I could see it going either way, depending on the density we'll come up with.

(I presume the decrease in Traffic is in number of ports, not in 10G->1G :)

I'd love to see stream processing from Kafka running in Kubernetes one day (pipe dream!), and that could be highish traffic.

  • Backups servers (heze/helium in the current incarnation) will definitely have 10G (we 've already budgeted for it).
  • Ganeti hosts are not so clear. Per grafana eqiad [1] and grafana codfw [2] we still don't need 10G there. codfw's traffic is the actual representative one since the latest large spikes/plateaus in eqiad are probably due to me doing many very heavy IO tests for T181121. Since this is long term planning and T181121 has probably been resolved, we should wait a few weeks and see if that is true. Of course we can only do simple projections and can't really predict the future, so it's difficult to say for sure. My hunch is that for now we don't need 10G and we probably won't need 10G on ganeti hosts for another 1-2 years. After that, I don't know.
  • Kubernetes hosts have just got in production, are handling very minimal traffic and the entire idea of that infrastructure is to scale out, not scale up, so even if we end up running kafka stream processing (or anything for that matter) in kubernetes, to me it seems that 10G will be a waste of money, so I agree on the "pretty sure we won't need 10G".


I've clarified the database/backups provisioning service, so that it can comfortably recover in an emergency multiple databases at the same time, in case of catastrophic failure to reduce TTR, but also the time to setup.

Regular databases do not need 10G at all, we use 200MB/s with all core databases together per datacenter:

The number of ports/hosts should be initially 3 per datacenter, but could be more as growth happens.

RobH triaged this task as Medium priority.Mar 26 2018, 7:02 PM
RobH added a subscriber: RobH.

I'm simply trying to reduce our number of 'needs triage' tasks in SRE. This seems to be an issue that is either normal, or higher priority. Due to the timeline of the 10G upgrade in eqiad, I'll flag this as normal. (High seems more for items that need to be done in advance of other tasks.)

I have better information of Databases backups/provisioning service:

in the end, only 2 hosts per datacenter T216137 T216138 over the next 3-5 years

A summary of the logical architechture can be seen at: as a starting point for a better understanding towards netops

ayounsi claimed this task.

I don't think there is much value anymore for this task (it was for last year). The spreadsheet for next FY capex has a 10G column.