Page MenuHomePhabricator

eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use
Closed, ResolvedPublic

Description

Labs Project Tested: 'search' project (estest01 and 02)
Site/Location: eqiad
Number of systems: 1
Service: ElasticSearch labs replicas
Networking Requirements: Labs VLAN, same setup as labsdb*** boxes
Suggested spare: Dell PowerEdge R420, dual Intel Xeon E5-2450 v2 2.50GHz, 64GB Memory, (4) 3TB Disks

Request this for a period of time to be able to test T109715. We don't fully know what levels of performance we'll get / our hardware requirements, Picked the suggested server based on disk space requirements - we know that the prod search cluster is 2.5TB, so we'll need at least that.

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a subscriber: yuvipanda.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptSep 10 2015, 8:15 PM
yuvipanda renamed this task from Site: (eqiad) hardware access request for ElasticSearch replication to Labs to Site: (eqiad) hardware request for ElasticSearch replication to Labs.Sep 10 2015, 8:17 PM
yuvipanda set Security to None.
RobH claimed this task.Sep 10 2015, 8:21 PM
RobH added a subscriber: RobH.

chatted in irc, yuvi says he'd need these for 3-4 weeks. (stealing for review and escalation for approvals later this week)

We'll need for production hosts (mw*) jobrunners to be able to hit port 9200 on this box, and for labs instances to be able to hit port 80. Both will just speak http.

RobH renamed this task from Site: (eqiad) hardware request for ElasticSearch replication to Labs to eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use.Sep 11 2015, 5:25 PM
RobH reassigned this task from RobH to mark.Sep 11 2015, 5:36 PM
RobH added subscribers: Cmjohnson, mark.

I'll allocate wmf4543 for this, pending Mark's approval that this is replicated in labs (just a rubberstamp that he is fine with limited misc hardware being used for this.)

As it is a 4 week use case, I don't expect any issues, but I don't want to assume things for @mark.

If approved, we'll need to pull wmf4543 off the spares page (wikitech) and then I'll put in followup tasks. We'll need to have it relocated from row C to row B. (I've chatted about this with @Cmjohnson already in passing, so he is expecting it.) This means two tasks for followup:

  • setup of wmf4543 task for the overall os and software deployment
    • sub-task for the relocation of wmf4543 to row b

Assigning this to @mark for his approval. Once approved, please assign back to me and I'll handle the above.

Thanks!

mark added a comment.Sep 15 2015, 4:07 PM

So, this request seems to serve T109715, but that ticket doesn't seem to be about a temporary test. So what exactly is the goal of this temporary hardware, and what happens after that?

And who is actually the requester of this hardware, assuming it's not Yuvi?

It is temporary becayse we don't know what the actual hardware requirements
in terms of memory will be with a fully replicated index and this is the
way to find out.

I guess requestors are me and @EBernhardson.

EBernhardson added a comment.EditedSep 15 2015, 4:36 PM

Yes, the initial question to answer here is if we can actually serve 2.5TB of indices out of a machine with 64GB of memory. In prod the replicas bump the total data size to 8.3TB served off a cluster with 3.5TB of memory.

We tried serving a single 150GB index out of a single 16GB labs instance and it failed miserably(10 to 30s per query). Adding a second 16GB labs instance brought that down to about .5-1s per query. Basically we are unsure which end of the spectrum this will end up at.

mark added a comment.Sep 15 2015, 5:24 PM

Alright, so it sounds like this will likely end up in a future procurement request for hardware then, after this trial period is up? :)

Depends on result of this test - assuming we are actually able to get this
to work with reasonable performance levels without having to spend a *lot*
of money...

there is also the possibility of using the old lsearchd cluster (but they are 1.5yrs out of warrenty), but i'm not super thrilled about the idea of having to use so many machine (only 300G disk per server, so 12-ish)

Not sure if we can fit 12 machines in b row (requirement to be in labs
subnet) and pretty sure we don't want to :)

mark added a comment.Sep 16 2015, 2:01 PM

there is also the possibility of using the old lsearchd cluster (but they are 1.5yrs out of warrenty), but i'm not super thrilled about the idea of having to use so many machine (only 300G disk per server, so 12-ish)

No, that's... not a possibility. :) Those servers are too old, and should not be repurposed.

@yuvipanda 12 servers in row B is not an option due to lack of space and need for labs.

mark added a comment.Sep 18 2015, 5:17 PM

labs-support VLAN (unlike labs instances etc) are not in row B I think...

mark added a comment.Sep 18 2015, 5:18 PM

I approve a test with one spare server for a maximum of 6 weeks, after which it will need to be given back in any circumstance. If we want to move forward after that, we'll need to procure servers for it.

RobH claimed this task.Sep 18 2015, 5:19 PM
RobH closed this task as Resolved.Sep 21 2015, 7:31 PM

mendelevium / wmf4543 is allocaed for this task. The related tasks for the setup and update the labels have been set to have this task as blocker.

mendelevium / wmf4543 is allocaed for this task. The related tasks for the setup and update the labels have been set to have this task as blocker.

Wait, what? I thought that mendelevium was for OTRS?

RobH added a comment.Sep 21 2015, 8:04 PM

it was, i misallocated since it was in gaeneti.

now its allocated as system name nobelium for wmf4543 (associated tasks already updated)