Page MenuHomePhabricator

Upgrade eqiad-misc varnish cluster from 2 to 4 systems.
Closed, ResolvedPublic

Description

Mark,

I'm working on replacing some of our older SHA1 certificates with SHA256, and in poking at misc-web, we really do have a lot of services on it these days in wikimedia.org: annual, dev, doc, git, gdash, graphite, grafana, parsoid-tests, performance, integration, phabricator, people, releases, bugs, bugzilla, bug-attachment, contacts, datasets, iegreview, ishmael, legalpad, logstash, metrics, noc, old-bugzilla, planet. (Pulled via grepping dns; as the misc-web config has items that its not currently serving.)

Right now we have two cp servers assigned to this role, and they have next business day parts delivery. I imagine this is possibly being overly paranoid, but I wanted to ask if we should perhaps increase this pool by one in eqiad, for a total of three systems. I fear a mainboard/controller break may leave us with a single cp server, which could be a disaster waiting to happen over a long weekend. Since we tend to scale most other clustered services across more than two systems (not always, but usually), is it time to do that here as well?

Please advise,

Event Timeline

RobH created this task.Jan 13 2015, 11:17 PM
RobH assigned this task to mark.
RobH raised the priority of this task from to Needs Triage.
RobH updated the task description. (Show Details)
RobH added a project: ops-core.
RobH changed the edit policy from "All Users" to "WMF-NDA (Project)".
RobH added a subscriber: RobH.
mark added a comment.Mar 11 2015, 10:15 AM

@RobH: So those cp servers are the old Squids that we had many of. How many unused ones of those do we still have left?

BBlack added a subscriber: BBlack.Mar 11 2015, 2:30 PM

If we add 1-2 boxes to this (4 is our usual minimum for a prod cache cluster, e.g. 2 per row/rack if possible), I'll need to merge it into the current ongoing work on netboot/disk-setup/etc and we'll want to install them as Jessie initially.

fgiunchedi triaged this task as High priority.Apr 1 2015, 11:27 AM
fgiunchedi added a subscriber: fgiunchedi.
faidon removed mark as the assignee of this task.Apr 24 2015, 12:29 PM
faidon added a project: Traffic.
faidon set Security to None.

The plan being bandied about at this point is to block this on the dissolution of bits-cluster ( T95448 ), and reuse the 4x bits machines at eqiad, ulsfo, and esams as a global misc-web clusters (ups us from 2->4 boxes at eqiad, adds termination/cache benefits at remote sites too).

BBlack lowered the priority of this task from High to Normal.Apr 24 2015, 3:52 PM
BBlack moved this task from Blocked on Internal to Triage on the Traffic board.

Blocking this on ipsec, so we don't reduce the security of any critical HTTPS sessions behind misc-web by moving them out to remote TLS termination and then backhauling to eqiad insecurely.

BBlack renamed this task from increase misc-web-lb cp pool from 2 to 3 systems? to Upgrade eqiad-misc varnish cluster from 2 to 4 systems..Jun 4 2015, 12:01 AM

Splitting this up. This task is about the expansion of the eqiad-only cluster (by replacing it with the former bits machines when they're available). Setting it up at other DCs will be separate.

Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 6 2015, 8:03 PM

Change 229943 had a related patch set uploaded (by BBlack):
decom more bits-cluster stuff (hieradata)

https://gerrit.wikimedia.org/r/229943

Change 229943 merged by BBlack:
decom more bits-cluster stuff (hieradata)

https://gerrit.wikimedia.org/r/229943

Change 229944 had a related patch set uploaded (by BBlack):
old eqiad bits hosts -> cache::misc role

https://gerrit.wikimedia.org/r/229944

Change 229944 merged by BBlack:
old eqiad bits hosts -> cache::misc role

https://gerrit.wikimedia.org/r/229944

Change 230033 had a related patch set uploaded (by BBlack):
add conftool/hiera data for new eqiad misc boxes

https://gerrit.wikimedia.org/r/230033

Change 230033 merged by BBlack:
add conftool/hiera data for new eqiad misc boxes

https://gerrit.wikimedia.org/r/230033

BBlack added a comment.Aug 7 2015, 2:26 AM

The 4x previous eqiad-bits machines (cp1056, cp1057, cp1069, cp1070) are functioning in the eqiad misc cluster now, alongside the old. Will give them a little for validation/sanity/cache-fill (what little they truly cache...), then decom the two older ones (cp1043, cp1044).

BBlack closed this task as Resolved.Aug 7 2015, 4:38 AM
BBlack claimed this task.

Old hosts removed. eqiad now has 4 hosts (previous bits cluster hosts).

BBlack moved this task from Triage to Done on the Traffic board.Aug 7 2015, 4:41 AM