Page MenuHomePhabricator

eqiad: (2) Relevance forge servers
Closed, ResolvedPublic

Description

Requesting 2 servers for implementing the discovery relevance forge project.

2x 6 core processors
128 GB RAM
4x 2T or 3T disks.

These will need to be labs accessible. These are to be purchased with the discovery FY15-16 budget and as such need to be delivered before July.

See also T128433 for estimates of why this hardware.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Systems are labs, so eqiad DC.

RobH renamed this task from Relevance forge hardware to eqiad: (2) Relevance forge servers.Apr 5 2016, 6:37 PM
RobH mentioned this in Unknown Object (Task).
RobH added a subtask: Unknown Object (Task).
RobH moved this task from Backlog to In Discussion / Review on the hardware-requests board.

If labs instances need to route to them, it'll need to be in a labs support vlan.

As for the hardware itself, we don't have any spare systems that match this specification.

The use of 3TB disks restricts your options to LFF (3.5") disks. We don't have any spare pool systems with that much memory (or that core count) currently available in eqiad. As such, this will have to generate an order. Also those sizes tend to be in SATA, so they won't be as fast as SSD or SAS.

I've created procurement task T131871 to track the pricing and ordering.

SATA is great, well not great but the disk requirements here make SSD's a bit untenable. 2x6 isn't a strict requirement, we figured the 2x8 we use for normal search servers was more than necessary.

We could probably work with 1x8 if that leads to availability of servers to upgrade rather than buying completely new (and is desirable from ops perspective). We've been testing on nobelium which is 1x8 (@2.5ghz) and can sustain a reasonable enough throughput as long as the query data fits in memory.

Well, the only potential spare sysems would be our recently reclaimed restbase1001-1006, but they would need a memory upgrade, plus the purchase of all the disks. I'll list it with detailed pricing in the linked procurement task, along with the pricing for purchasing new systems.

RobH changed the task status from Open to Stalled.Apr 21 2016, 7:09 PM

This is presently pending mgmt purchase/allocations approvals on procurement task T131871. I'm setting this to stalled, until that order is approved or we have feedback.

@EBernhardson Is the Labs team aware of this project at all? :)

I think we were aware of it in a cursory manner. What is the Labs general outcome for this? Is this a service provided to a) all of labs b) tools c) only certain projects within labs (which ones?) d) who manages the servers

There is decent "what" but I'm missing some of the "who".

a/b/c) The goal is to have a cluster to test improvements in search configuration, allowing to run tests against different optimization and validate that we do have improvement in result quality. So this servers will provide service mainly to the Discovery team.

d) My understanding is that @Gehel (me) will manage those servers.

If someone wanted to launch a bunch of queries against this cluster from Tools to analyze our search results would this be an appropriate target?

The project will need to be accessible (read/write) from the search project in labs. These will have indices (not updated in real time) for all of the larger wiki's and as such might be useful to the rest of labs, but we will have to find a way to allow writes from the search project and only reads from everything else to make that usable. We have a method today for limiting write access (via nginx proxy) it just needs some way to be aware of how ip's are handed out in labs. I'm not too concerned about how that will be done, it just hasn't been figured out yet.

@chasemp yes i think that would be reasonable. Since we are getting spinning disks they will only see good performance for things that can fit in memory (2x 128G will fit one of the large wiki's content indices in memory), and not for recreating things like mwgrep that query all wiki's at once. The caveat as mentioned above is we won't be updating them in real time. We will instead be importing the publicly available cirrussearch dumps on a semi-regular basis.

Does this usurp nobelium so we can decomission it in its labs support role? I assumed yes but want to make sure.

Just a note on language (we have a tradition of being bad at it)

Systems are labs, so eqiad DC.

The way I understand it, these systems are production as labs infrastructure. They will provide a service to a particular labs project(s) managed by Discovery for write/read as an outcome of dumps from prod things. They will live in a labs-support VLAN and will be similar in ops/other-team demarcation of responsibility to nodepool1001. Labs instances in general are welcome to query the data and experiment with search results and outcomes for research or design with the understanding that results are stale to a degree. Not dissimilar from labsdb, although probably more stale.

I swear there is a reason for the pedagogic / pedantic commentary :D

We do have "hardware" in labs (metal that is treated as closely to an instance as possible) as a thing but don't currently support it broadly and saying these systems are 'labs' or 'in labs' confuses the outcome here between these two distinct cases. We also have https://wikitech.wikimedia.org/wiki/Labs_labs_labs and https://wikitech.wikimedia.org/wiki/Labs_labs_labs/future which should help outline that 'labs' is a broadly descriptive term for things or where they live in most cases. It's overloaded. In phab terms I would probably tag issues concerning these with Discovery-ARCHIVED Cloud-VPS Cloud-Services going forward.

but yeah seems cool guys

Damn, this labs thing is confusing... @chasemp thanks for the precisions. I'm not entirely sure I understand what you mean by "these systems are production as labs infrastructure". Isn't production and labs exclusive? Or is there also a Prod, prod, PROD confusion?

This project does replace nobelium. Nobelium will be decommissioned and returned to the pool.

I think the difference is that labs infrastructure is production. The machines have IP addresses in the prod network and can access/be accessed by other prod hosts. They are maintained by production puppet and can't have custom puppetmasters like VM's can. Labs VM's can only access other labs VM's.

Is that basically the distinction? If so that all sounds reasonable to me.

So my understanding of the commitment from the labs team is:

  1. We poke some holes in the labs-instance/labs-support firewall to allow labs instances to talk to this group of servers.

and nothing else at all. Specifically here are the things that this does *not* need:

  1. This isn't a 'labs on real hardware' situation, which lives inside the labs instances vlan + use labs puppetmaster + labs proxy, etc
  2. Ongoing support from labs admin wrt anything other than the firewall hole :D

Does that sound accurate?

This project does replace nobelium. Nobelium will be decommissioned and returned to the pool.

I think the difference is that labs infrastructure is production. The machines have IP addresses in the prod network and can access/be accessed by other prod hosts. They are maintained by production puppet and can't have custom puppetmasters like VM's can. Labs VM's can only access other labs VM's.

Is that basically the distinction? If so that all sounds reasonable to me.

re: nobelium understood
re: distinctions defined, yes this is a reasonable pov

Damn, this labs thing is confusing... @chasemp thanks for the precisions. I'm not entirely sure I understand what you mean by "these systems are production as labs infrastructure". Isn't production and labs exclusive? Or is there also a Prod, prod, PROD confusion?

https://en.wikipedia.org/wiki/Blind_men_and_an_elephant

Well, 'production' includes the infrastructure that 'labs VMs' run on. When you use a signifier like $realm == 'labs' it would be true in the context of the VM itself only.

i.e. labservices1001 is production and is infrastructure the environment we generally refer to as 'labs' is built on. tools-bastion-03 is a VM running 'in labs'. Nobelium is a production host in the labs-support VLAN providing a service to 'labs VMs'.

So my understanding of the commitment from the labs team is:

  1. We poke some holes in the labs-instance/labs-support firewall to allow labs instances to talk to this group of servers.

and nothing else at all. Specifically here are the things that this does *not* need:

  1. This isn't a 'labs on real hardware' situation, which lives inside the labs instances vlan + use labs puppetmaster + labs proxy, etc
  2. Ongoing support from labs admin wrt anything other than the firewall hole :D

Does that sound accurate?

Seems like everybody is on-board.

Please note that this allocation, via procurement task T131871, has been approved. Two of our spare systems wmf4657 & wmf4658 will be allocated to this, along with those systems receiving memory and storage upgrades.

I hesitate to ask, but is there a hostname standard picked out for these? Will they be in a service cluster and have a service cluster defined name, or misc. naming?

If misc, I'll just pick out two free elements and name them. If they are to use a service cluster name, please detail it here and also update https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions

Thanks!

I'm thinking it will be simpler to give them a service cluster name, makes things easier to remember. If there are no objections i will update the wiki page for these example names:

  • relforge1001.eqiad.wmnet
  • relforge1002.eqiad.wmnet

I'm thinking it will be simpler to give them a service cluster name, makes things easier to remember. If there are no objections i will update the wiki page for these example names:

  • relforge1001.eqiad.wmnet
  • relforge1002.eqiad.wmnet

These seem sensible enough to me.

RobH claimed this task.

The two machines have been allocated and are now in the OS installation stage. I'm resolving this task.

mark closed subtask Unknown Object (Task) as Resolved.Jun 14 2016, 9:58 AM