Page MenuHomePhabricator

Upgrade eqiad-misc varnish cluster from 2 to 4 systems.
Closed, ResolvedPublic



I'm working on replacing some of our older SHA1 certificates with SHA256, and in poking at misc-web, we really do have a lot of services on it these days in annual, dev, doc, git, gdash, graphite, grafana, parsoid-tests, performance, integration, phabricator, people, releases, bugs, bugzilla, bug-attachment, contacts, datasets, iegreview, ishmael, legalpad, logstash, metrics, noc, old-bugzilla, planet. (Pulled via grepping dns; as the misc-web config has items that its not currently serving.)

Right now we have two cp servers assigned to this role, and they have next business day parts delivery. I imagine this is possibly being overly paranoid, but I wanted to ask if we should perhaps increase this pool by one in eqiad, for a total of three systems. I fear a mainboard/controller break may leave us with a single cp server, which could be a disaster waiting to happen over a long weekend. Since we tend to scale most other clustered services across more than two systems (not always, but usually), is it time to do that here as well?

Please advise,

Event Timeline

RobH assigned this task to mark.
RobH raised the priority of this task from to Needs Triage.
RobH updated the task description. (Show Details)
RobH added a project: ops-core.
RobH changed the edit policy from "All Users" to "WMF-NDA (Project)".
RobH subscribed.

@RobH: So those cp servers are the old Squids that we had many of. How many unused ones of those do we still have left?

If we add 1-2 boxes to this (4 is our usual minimum for a prod cache cluster, e.g. 2 per row/rack if possible), I'll need to merge it into the current ongoing work on netboot/disk-setup/etc and we'll want to install them as Jessie initially.

fgiunchedi subscribed.
faidon removed mark as the assignee of this task.Apr 24 2015, 12:29 PM
faidon added a project: Traffic.
faidon set Security to None.

The plan being bandied about at this point is to block this on the dissolution of bits-cluster ( T95448 ), and reuse the 4x bits machines at eqiad, ulsfo, and esams as a global misc-web clusters (ups us from 2->4 boxes at eqiad, adds termination/cache benefits at remote sites too).

BBlack lowered the priority of this task from High to Medium.Apr 24 2015, 3:52 PM
BBlack moved this task from Blocked on Internal to Backlog on the Traffic board.

Blocking this on ipsec, so we don't reduce the security of any critical HTTPS sessions behind misc-web by moving them out to remote TLS termination and then backhauling to eqiad insecurely.

BBlack renamed this task from increase misc-web-lb cp pool from 2 to 3 systems? to Upgrade eqiad-misc varnish cluster from 2 to 4 systems..Jun 4 2015, 12:01 AM

Splitting this up. This task is about the expansion of the eqiad-only cluster (by replacing it with the former bits machines when they're available). Setting it up at other DCs will be separate.

Change 229943 had a related patch set uploaded (by BBlack):
decom more bits-cluster stuff (hieradata)

Change 229943 merged by BBlack:
decom more bits-cluster stuff (hieradata)

Change 229944 had a related patch set uploaded (by BBlack):
old eqiad bits hosts -> cache::misc role

Change 229944 merged by BBlack:
old eqiad bits hosts -> cache::misc role

Change 230033 had a related patch set uploaded (by BBlack):
add conftool/hiera data for new eqiad misc boxes

Change 230033 merged by BBlack:
add conftool/hiera data for new eqiad misc boxes

The 4x previous eqiad-bits machines (cp1056, cp1057, cp1069, cp1070) are functioning in the eqiad misc cluster now, alongside the old. Will give them a little for validation/sanity/cache-fill (what little they truly cache...), then decom the two older ones (cp1043, cp1044).

BBlack claimed this task.

Old hosts removed. eqiad now has 4 hosts (previous bits cluster hosts).