Page MenuHomePhabricator

codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster
Closed, ResolvedPublic

Description

initial request

We need to set up a proper caching infrastructure for Maps (see T109162 and T125126 for details). This requires 4x servers in 4x datacenters (eqiad, codfw, ulsf, esams), so 16 Varnish servers. Specs of Varnish servers seem to be standardized, but I'm unsure of what they actually are...

Note: we might be able to repurpose existing hardware for this as we decommissioned the mobile cache cluster.

allocation notes

So the request is to allocate the 4 machines in codfw/eqiad/esams/ulsfo each to the maps varnish service cluster. These are already in site.pp with 'ex-moble' in comment within their entries. Brandon pointed out in IRC these are the hosts for this. He stated these were reclaimed some time ago from active service and held back for this potential project. These are

eqiad: cp1046, cp1047, cp1059, cp1060
codfw: cp2003, cp2009, cp2015, cp2021
esams: cp3003 cp3004, cp3005, cp3006
ulsfo: cp4011, cp4012, cp4019, cp4020

Event Timeline

RobH subscribed.

I'll create and link in procurement tasks for pricing shortly.

According to @BBlack on IRC:

  • Specs is "standard, SSD-based varnish cluster machine configurations".
  • we should not need to buy any new hardware, we should be able to repurpose existing machines

Well to be completely clear: should not need to buy any new hardware this quarter - all of them need replacing on standard lifetimes, with the first batch of four coming up in FY16-17 IIRC.

As we already have the hardware, this just needs @mark's approval.

Excellent, I was too quick to claim for processing!

So the request is to allocate the 4 machines in codfw/eqiad/esams/ulsfo each to the maps varnish service cluster. These are already in site.pp with 'ex-moble' in comment within their entries. Brandon pointed out in IRC these are the hosts for this. He stated these were reclaimed some time ago from active service and held back for this potential project. These are

eqiad: cp1046, cp1047, cp1059, cp1060
codfw: cp2003, cp2009, cp2015, cp2021
esams: cp3003 cp3004, cp3005, cp3006
ulsfo: cp4011, cp4012, cp4019, cp4020

@mark: we just need to ensure that you approve of this allocation. Please note any questions/comments/concerns/approvals and assign back to either myself or @Gehel. (I'm happy to assist in getting these reclaimed and reinstalled for service, or walk @Gehel through the entire process so he knows it in the future.)

RobH renamed this task from Hardware for cache cluster for Maps to codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster.Apr 5 2016, 10:12 PM
RobH moved this task from Backlog to Pending Approval on the hardware-requests board.
RobH updated the task description. (Show Details)

@RobH I'd really appreciate if you could let me do the reclaim / reinstall so that I learn something in the process (this is likely going to take you more time and energy than doing it yourself, but it should pay in the long run...).

@RobH I'd really appreciate if you could let me do the reclaim / reinstall so that I learn something in the process (this is likely going to take you more time and energy than doing it yourself, but it should pay in the long run...).

@Gehel: Absolutely, It is perfect document testing conditions! We've outlined all the steps on https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_or_Decommission, so those should have everything you need to reclaim. Let me know when you start working on these (once Mark approves) and I can shadow and double check them (plus work with you to update the docs if they are missing steps.)

Additionally, I'll need to modify the lifecycle about wiping to note that it isn't actually supported on SSDs and their trim support. My understanding is we need to find and use the trim/wipe utilities of the drive vendor to accomplish that.

However, as these were caching machines, and are becoming new caching machines, the wipe is likely unneeded for these anyhow.

There's no real need to reinstall them. I have patches pending to put them into their proper roles, etc.

The patch series starts at: https://gerrit.wikimedia.org/r/#/c/268236/ , but needs manual rebases at this point.

(it's better to look at T109162, that had all the patch links)

@Gehel: Since this isn't going to end up being a reinstall, I'll ping you to do a reinstall on one of the many I do every week!

Let's move forward with repurposing the existing (ex mobile) Varnish servers for maps. :)

With post-switchover work, a weekend coming, and other misc constraints, @Gehel and I planning to actually do the work on Monday.

@Gehel and I made partial progress on this today, to resume tomorrow. Current situation:

  1. all 16x new cache_maps machines are puppetized in their new roles, but the non-eqiad ones aren't yet handling frontend (user-facing) traffic. They're also all upgraded to varnish4 (like the old cache_maps boxes were too).
  2. in eqiad, the 2x old/beta cache_maps machines are out of service/decommed at all layers, and the 4x new ones there are in service for all frontend (user-facing) traffic.
  3. the new eqiad caches are routing requests via cache_maps:eqiad -> cache_maps:codfw -> kartotherian.svc.codfw.wmnet (whereas the old cache_maps:eqiad contacted kartotherian.codfw directly).

What's saved for tomorrow is to go through the steps to pool up geographically-routed frontend requests into the other 3x DCs (codfw, ulsfo, esams).

Mentioned in SAL [2016-04-26T19:54:37Z] <gehel> restarting pybal on lvs4004 to enable new cache configuration for maps (T131880)

Mentioned in SAL [2016-04-26T20:02:31Z] <bblack> restarting pbyal on lvs4002 to enable new cache configuration for maps (T131880)

Mentioned in SAL [2016-04-26T20:30:12Z] <bblack> restarting pbyal on lvs2005 to enable new cache configuration for maps (T131880)

Mentioned in SAL [2016-04-26T20:32:51Z] <bblack> restarting pbyal on lvs2002 to enable new cache configuration for maps (T131880)

Mentioned in SAL [2016-04-26T20:35:56Z] <bblack> restarting pbyal on lvs3004 to enable new cache configuration for maps (T131880)

Mentioned in SAL [2016-04-26T20:38:25Z] <bblack> restarting pybal on lvs3002 to enable new cache configuration for maps (T131880)

Mentioned in SAL [2016-04-26T21:06:43Z] <gehel> activating geodns for new varnish maps servers (T131880)