Page MenuHomePhabricator

requesting WMF7426 as phabricator system in eqiad
Closed, ResolvedPublic

Description

This task is to track the request to allocate spare pool system WMF7426 (purchased on T195418) as the secondary phabricator system in eqiad.

Please note a mw class system was recently requested to be phab1002, but is now being returned to mw system use since it only has 32GB of ram and phabricator requires 64GB on its host.

https://netbox.wikimedia.org/dcim/devices/248/

WMF7426 system specs:

  • Dual Intel Xeon Silver 4110 2.10GHz/8C
  • 64GB Ram
  • Dual 240GB SSD
  • 1GB NIC

This will be the last dual cpu spare pool host in eqiad, so once approved @RobH will also file a task to order 1-3 more dual cpu spare pool systems for eqiad.

Event Timeline

RobH triaged this task as Normal priority.Feb 5 2019, 7:40 PM
RobH created this task.
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptFeb 5 2019, 7:40 PM
RobH updated the task description. (Show Details)Feb 5 2019, 7:41 PM
RobH renamed this task from requesting wmf7622 as phabricator system in eqiad to requesting WMF7426 as phabricator system in eqiad.Feb 5 2019, 7:45 PM
RobH reassigned this task from RobH to faidon.
RobH moved this task from Backlog to Pending Approval on the hardware-requests board.
RobH removed faidon as the assignee of this task.Feb 5 2019, 7:54 PM
RobH added a subscriber: faidon.
RobH assigned this task to Dzahn.Feb 5 2019, 7:58 PM
RobH moved this task from Pending Approval to In Discussion / Review on the hardware-requests board.

So I filed this on behalf of a conversation with @Dzahn regarding parent task T195623.

We still need to have direct confirmation on this task that the use of the following specifications for phabricator server use will be acceptable:

WMF7426 system specs:

Dual Intel Xeon Silver 4110 2.10GHz/8C
64GB Ram
Dual 240GB SSD
1GB NIC
mmodell added a subscriber: mmodell.Feb 5 2019, 9:13 PM
Dzahn added a subscriber: 20after4.Feb 5 2019, 9:47 PM

Talked about it on IRC , the direct comparison between CPUs is:

https://ark.intel.com/compare/123550,83359

The amount of RAM is what we needed and disk space is enough.

Also @20after4 confirmed it's more than enough space.

Yes, we would like to use it please.

Dzahn reassigned this task from Dzahn to RobH.Feb 5 2019, 10:11 PM
RobH reassigned this task from RobH to faidon.Feb 5 2019, 10:18 PM
RobH moved this task from In Discussion / Review to Pending Approval on the hardware-requests board.

@faidon,

Please approve the allocation of our last dual cpu spare pool system in eqiad to allocation as the secondary phabricator system in eqiad.

On a related note, I'll also file a task to order more spare pool systems.

Is there a task describing the plans for a secondary Phabricator system? How did we come up with those specs?

RobH added a comment.Feb 6 2019, 12:55 AM

So the original phab1002 was requested on T195623, but then @Dzahn advised (via discussion with @20after4) that it needed 64GB, not the 32GB it has.

That leaves us with allocating a spare system. Further details on hardware requirements will need justification by @Dzahn or @20after4.

I'm unaware of any task discussion for plans beyond that, but unsure why we need 2 phabricator hosts in each datacenter.

RobH reassigned this task from faidon to Dzahn.Feb 6 2019, 12:56 AM
Dzahn added a comment.EditedFeb 8 2019, 2:05 AM

Hi @faidon, let me explain. It was never a request for running 2 phabricator hosts in each datacenter. That's a misunderstanding.

It's just that we want to reinstall phab1001 with stretch but we want to be able to fall-over to something during the migration.

Since we are changing a couple things at once, distro version, PHP version, httpd version and worker config, we would like to be able to set that up as phab1002, switch traffic to it, confirm it's ok and after a short period of time also reinstall phab1001 with stretch, switch traffic back to 1001 and finally give this temp. host back to the pool and be done. (with 1 production phab server running stretch).

This was our common understanding after having a meeting about how to move forward.

Then the additional part is that this was already requested exactly the same way back in May 2018 and Moritz had suggested to take one of the appservers because it was currently not in use anyways, so we did that and called it phab1002. Then things stalled for other reasons and the upgrade plan never materialized. Now that we got back to it we noticed the existing replacement server (already called phab1002) only has 32GB while the prod server has 64GB and Mukunda raised concerns that 32Gb is not enough.

We would like the replacement server to have the same amount of RAM is the core of this request. So next step i asked if it's feasible to put additional 32GB into this server and Robh told me we don't do upgrades and i should request a new server instead. So the replacement for the temp replacement so to say. When Rob checked it just so happened that there was one single misc server left in the pool and it happened to have the requested 64GB and other specs work out. The specs required were always just "not worse than the current prod server" and the intention was always to only use something that is currently not in use anyways and give it back to the pool. So i think this should not have any budget implications.

cc: @20after4

Dzahn reassigned this task from Dzahn to faidon.EditedFeb 8 2019, 2:07 AM

P.S. running it in codfw is blocked on unrelated things (lack of dbproxy) and the host currently called phab1002 with 32GB would immediately go back to pool

related tickets: T190568 (phab1002), T196019 (reimage to stretch) and T182832 (something we hope to fix with the stretch upgrade) and T192457 (where these currently unused hosts will go back to the pool and serve their original roles as mw appserver)

The end result of all of this would be that once done we still use the same phab server as before that was designated for it and the servers originally bought as mw appservers would also do their job again.

faidon reassigned this task from faidon to RobH.Feb 15 2019, 4:10 PM

@Dzahn that's all fine, but we should have that documented in a separate Phabricator task tracking this work, if one doesn't exist already :) Separately, I'd also really love having a permanent non-SPOF setup in each data center as well, whether that's multiple bare metal servers, multiple VMs or running Phabricator on k8s. This is too important of a service to run in one misc-type server per site.

Anyway! Yes, this request is approved, but T215837 takes priority, so it may need to wait a little bit. I'll leave the discussion of that in that other task.

Change 496119 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] replace phab1002 with phab1003

https://gerrit.wikimedia.org/r/496119

Dzahn changed the task status from Open to Stalled.Mar 13 2019, 11:50 AM

Thanks! Setting to stalled to reflect we are waiting another week or so to decide if T215837 is still requested (per IRC with volans). If it turns out to be not needed then we can use it for this as requested.

Dzahn changed the task status from Stalled to Open.Mar 22 2019, 9:43 AM

Re-opening. icinga has been switched back from 2001 to 1001, T215837 has been declined and per ops mailing list / cdanis it's ok to use this for phab again.

Dzahn edited projects, added serviceops; removed Patch-For-Review.Mar 22 2019, 9:43 AM

@RobH Now that you are back.. can i please have this server assigned to me? That would unblock T190568 which has been waiting for quite a bit. It has already been approved in the past (T215335#4957432) but then it was stalled again as using it for icinga had preference. but that is now not the case anymore, cdanis has confirmed they don't need it anymore for that. And finally you will get another server back into the pool from that (T215332) in addition to this one here.

RobH closed this task as Resolved.Apr 18 2019, 4:28 PM

Granted and resolving, setup is on T221389.

Change 496119 abandoned by Dzahn:
replace phab1002 with phab1003

Reason:
duplicate, already done elsewhere today

https://gerrit.wikimedia.org/r/496119