Page MenuHomePhabricator

eqiad 10G ports needs
Open, NormalPublic

Description

The asw-a/b/c rows refresh in eqiad will add 1 more 10G switch (48 ports) per row. Bringing the total number of 10G ports in eqiad from ~432 to ~528. Switch cycle is ~5 years, and row D will most likely be re-trofitted with an extra 10G switch in the future.

This task is to track the 1G->10G upgrades or servers, so 1/ teams that need 10G are aware that we have the capacity, and 2/ we don't exceed that capacity

Talking with various teams in charge of servers usually heavy in data last August (when planning the switch upgrade) I noted those potential evolutions.
As this is from 7 months ago it's now mostly a baseline and should be update with more accurate numbers.

ServiceEvolution (5y)Note
Databases backups/provisioning service+10
Media storage backend+6/8 per year (40 in 5 years)
Cloud+8/10 (over 2 years)
Trafficdecrease
Memcacheno planned changes
Hadoop+10/12 nodes/y (no need for 10G)
Kafkamax doubling (probably not 10G)
Kubernetespretty sure wont need 10G
GanetiPeaks close to 1G
Backups
ElasticsearchPeaks close to 1G

Event Timeline

ayounsi created this task.Mar 21 2018, 11:22 PM
Restricted Application added a project: Operations. · View Herald TranscriptMar 21 2018, 11:22 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
EBernhardson added a subscriber: EBernhardson.EditedMar 21 2018, 11:24 PM

For elasticsearch we are expecting to switch all servers to 10G on their standard server refresh schedule. I'm not sure what exactly that schedule is though. Soon there will be 43 elasticsearch servers in eqiad. If i had to make a 5 year guess, 60 might be reasonable?

ayounsi updated the task description. (Show Details)Mar 21 2018, 11:26 PM
ayounsi mentioned this in Unknown Object (Task).Mar 21 2018, 11:33 PM

I have only hunches and no data to back any of this, but I think ElasticSearch, Hadoop, WMCS, Backups, plus probably Ganeti and Kafka would be good candidates to go 10G-only. Kubernetes I could see it going either way, depending on the density we'll come up with.

(I presume the decrease in Traffic is in number of ports, not in 10G->1G :)

I'd love to see stream processing from Kafka running in Kubernetes one day (pipe dream!), and that could be highish traffic.

elukey added a subscriber: elukey.Mar 23 2018, 4:38 PM
  • Backups servers (heze/helium in the current incarnation) will definitely have 10G (we 've already budgeted for it).
  • Ganeti hosts are not so clear. Per grafana eqiad [1] and grafana codfw [2] we still don't need 10G there. codfw's traffic is the actual representative one since the latest large spikes/plateaus in eqiad are probably due to me doing many very heavy IO tests for T181121. Since this is long term planning and T181121 has probably been resolved, we should wait a few weeks and see if that is true. Of course we can only do simple projections and can't really predict the future, so it's difficult to say for sure. My hunch is that for now we don't need 10G and we probably won't need 10G on ganeti hosts for another 1-2 years. After that, I don't know.
  • Kubernetes hosts have just got in production, are handling very minimal traffic and the entire idea of that infrastructure is to scale out, not scale up, so even if we end up running kafka stream processing (or anything for that matter) in kubernetes, to me it seems that 10G will be a waste of money, so I agree on the "pretty sure we won't need 10G".

[1] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?panelId=84&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ganeti&var-instance=All&from=now-6M&to=now
[2] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?panelId=84&fullscreen&orgId=1&from=now-6M&to=now&var-datasource=codfw%20prometheus%2Fops&var-cluster=ganeti&var-instance=All

jcrespo updated the task description. (Show Details)Mar 26 2018, 4:41 PM
jcrespo added a subscriber: jcrespo.EditedMar 26 2018, 4:45 PM

I've clarified the database/backups provisioning service, so that it can comfortably recover in an emergency multiple databases at the same time, in case of catastrophic failure to reduce TTR, but also the time to setup.

Regular databases do not need 10G at all, we use 200MB/s with all core databases together per datacenter: https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated?panelId=2&fullscreen&orgId=1&from=1514306666008&to=1522082666008&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-group=dbstore&var-group=misc&var-group=parsercache&var-shard=All&var-role=All

The number of ports/hosts should be initially 3 per datacenter, but could be more as growth happens.

RobH triaged this task as Normal priority.Mar 26 2018, 7:02 PM
RobH added a subscriber: RobH.

I'm simply trying to reduce our number of 'needs triage' tasks in Operations. This seems to be an issue that is either normal, or higher priority. Due to the timeline of the 10G upgrade in eqiad, I'll flag this as normal. (High seems more for items that need to be done in advance of other tasks.)

I have better information of Databases backups/provisioning service:

in the end, only 2 hosts per datacenter T216137 T216138 over the next 3-5 years

A summary of the logical architechture can be seen at: https://wikitech.wikimedia.org/wiki/MariaDB/Backups#/media/File:Database_backups_overview.svg as a starting point for a better understanding towards netops