Page MenuHomePhabricator

Request three servers for Pageview API
Closed, ResolvedPublic

Description

Labs Project Tested: Tested (cassandra+restbase on cassandra-dev.analytics.eqiad.wmflabs but not puppetized)
Site/Location: EQIAD
Number of systems: 3
Service: Pageview API
Networking Requirements: internal
Processor Requirements: #REQUIRED
Memory: #REQUIRED
Disks: Cassandra wants SSDs
NIC(s): 1x Gb
Partitioning scheme: TBD
Other Requirements:

We are asking for three servers to store pageview data and serve it to the broader community. These servers are going to run RESTBase with Cassandra as storage. The reason for 3 is so we can have redundancy in Cassandra. @Ottomata (cc-ed) said he has some hardware from the recent Kafka Broker decommission that we can use.

Related Objects

Event Timeline

Milimetric raised the priority of this task from to Needs Triage.
Milimetric updated the task description. (Show Details)
Milimetric added a project: hardware-requests.
Milimetric added subscribers: Milimetric, Ottomata, mark.
Restricted Application added a project: acl*sre-team. · View Herald TranscriptSep 1 2015, 2:40 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

After the meeting ops had with analytics where the requirements were clarified, this was proposed. I support 3 different spare boxes on 3 rack rows (if possible) in eqiad should do it. Networking wise they should be outside the analytics VLAN as there will be no private data in the cluster. I 'll update the task with the format required in https://phabricator.wikimedia.org/project/profile/1014/

Restricted Application added a subscriber: Matanya. · View Herald TranscriptSep 4 2015, 2:26 PM
akosiaris updated the task description. (Show Details)Sep 4 2015, 2:29 PM
akosiaris set Security to None.
Ottomata added a comment.EditedSep 4 2015, 3:00 PM

The boxes that can be slated for this currently are:

  • analytics1011
  • analytics1016
  • analytics1017
  • analytics1019
  • analytics1015
  • analytics1021

These are all Dell PowerEdge R720 12 core EW-2620 @ 2.00GHz 48G RAM 12 x 2T HDDs

The first 4 in this list are still live Hadoop workers. analytics1015 has already been removed from Hadoop, as I had planned to use it for a new Hive server. analytics1021 was previously a Kafka broker. We'll have to check to see which rows these are in to choose correctly.

JAllemandou updated the task description. (Show Details)Sep 4 2015, 4:41 PM
JAllemandou added a subscriber: JAllemandou.

The machines @Ottomata describes have no SSDs --> @akosiaris: Is that a no-go ?

ssd vs no ssd I guess depends on the workload, if we are bulk writing and enough ram to cache reads also spinning disks might do it (for comparison, restbase sees ~200 IOPS per ssd) perhaps worth a quick test if we have the machines already good to go

What @fgiunchedi says. Datastax officially recommends SSDs for cassandra

http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architecturePlanningHardware_c.html

but it obviously depends on the workload. That being said, I have no estimation as to what the workload will be. 10GB per day write workload is not much though. We might end up wanting to rate limit it but that should be doable in the code. Read wise, IIRC estimations talk about minimal req/s number so we might very well be OK with spinning disks.

jcrespo triaged this task as Normal priority.Sep 8 2015, 9:01 AM
jcrespo added a subscriber: jcrespo.

Normal as per a conversation with them "not an emergency".

@akosiaris / @fgiunchedi: I believe you can run any tests you like on those machines, they're not doing anything at the moment. But I agree with Alex's optimism, 10GB / day is not a lot of writing and we only expect light load for the short to medium future.

Those machines are in 3 different rack rows indeed, so they might very well be good.

analytics1011 => A rack row
analytics1015 => C rack row
analytics1016 => C rack row
analytics1017 => C rack row
analytics1019 => D rack row
analytics1021 => A rack row

So it seems analytics1019 is a given, need to look at the specs of the other ones.

Right, we'll need to update our puppet accordingly @Milimetric :)

The specs of those are all the same.

We'll use

  • analytics1011
  • analytics1016
  • analytics1019

These will be reinstalled with Jessie and renamed. The current node names we are going with is aqs10xx (Analytics Query Service).

ObjectionS!?!?!

mark added a comment.Sep 15 2015, 1:49 PM

The specs of those are all the same.
We'll use

  • analytics1011
  • analytics1016
  • analytics1019

These will be reinstalled with Jessie and renamed. The current node names we are going with is aqs10xx (Analytics Query Service).
ObjectionS!?!?!

Be aware that these machines are running out of warranty in a month, and are slated to be replaced with new hardware in ~1 year.

@akosiaris we need to know:

  • Is aqs100x ok for a name
  • What VLAN should I put these in (and how?)

If someone is putting machines into vlans can I watch too?

@akosiaris we need to know:

  • Is aqs100x ok for a name

I got nothing better, so yes.

  • What VLAN should I put these in (and how?)

3 different ones, one for every row. Those should be:

  • analytics1011 => private1-a-eqiad
  • analytics1016 => private1-c-eqiad
  • analytics1019 => private1-d-eqiad

DNS for assigning IPs (add me as a reviewer), DHCP/install_server changes as well.
I 'll do the changes on the switches as soon as I get a thumbs up, that is those boxes can safely be wiped (effectively). That means:

  • depooled from any service
  • removed from puppet/salt
  • and of course icinga. which is automagically done 30 mins after the step above.

    @Ottomata, IIRC you already stated the first part is done, no ? Please confirm, the rest is easy enough.

Be aware that these machines are running out of warranty in a month, and are slated to be replaced with new hardware in ~1 year.

This is disconcerting though. We got 3 so we should be able to bear the problems associated with 1 (or even 2) going down in the meantime but I am not loving this.

If someone is putting machines into vlans can I watch too?

Yes. hangout with screen sharing (can't think of a way screen -X will work with agent forwarding disabled right now)

Change 239175 had a related patch set uploaded (by Ottomata):
Rename analytics nodes to aqs (analytics query service), put them in private1 vlans

https://gerrit.wikimedia.org/r/239175

Change 239177 had a related patch set uploaded (by Ottomata):
Rename analytics1011, 1016, and 1019 to aqs1001, 1002, 1003

https://gerrit.wikimedia.org/r/239177

Ottomata added a comment.EditedSep 17 2015, 7:27 PM

Ok!

https://gerrit.wikimedia.org/r/#/c/239175/
https://gerrit.wikimedia.org/r/#/c/239177/

analytics1011, 1016 and 1019 are removed from Hadoop and ready for reinstall at any time. I'm not sure what partition layout these need, so I just removed them from netboot.cfg. We'll need to add aqs* entries there and pick a partman recipe.

This is disconcerting though. We got 3 so we should be able to bear the problems associated with 1 (or even 2) going down in the meantime but I am not loving this.

Yeah, I mean, all 6 of our Kafka brokers are these same models with the same warrantee :(

LGTM

analytics1011, 1016 and 1019 are removed from Hadoop and ready for reinstall at any time. I'm not sure what partition layout these need, so I just removed them from netboot.cfg. We'll need to add aqs* entries there and pick a partman recipe.

I suppose let's go a simple LVM ?

This is disconcerting though. We got 3 so we should be able to bear the problems associated with 1 (or even 2) going down in the meantime but I am not loving this.

Yeah, I mean, all 6 of our Kafka brokers are these same models with the same warrantee :(

Indeed :-(

Change 239371 had a related patch set uploaded (by Ottomata):
Add cassandrahosts-12hdd.cfg partman recipe and test on d-i-test

https://gerrit.wikimedia.org/r/239371

Change 239371 merged by Ottomata:
Add cassandrahosts-12hdd.cfg partman recipe and test on d-i-test

https://gerrit.wikimedia.org/r/239371

Change 239376 had a related patch set uploaded (by Ottomata):
Try again with cassandrahosts-12hdd.cfg

https://gerrit.wikimedia.org/r/239376

Change 239376 merged by Ottomata:
Try again with cassandrahosts-12hdd.cfg

https://gerrit.wikimedia.org/r/239376

Change 239175 merged by Ottomata:
Rename analytics nodes to aqs (analytics query service), put them in private1 vlans

https://gerrit.wikimedia.org/r/239175

Change 239177 merged by Ottomata:
Rename analytics1011, 1016, and 1019 to aqs1001, 1002, 1003

https://gerrit.wikimedia.org/r/239177

Ok, I will have to manually partition these, partman is too dumb.

Alex, proceed with VLAN changes! Then we can reinstall.

Alex, proceed with VLAN changes! Then we can reinstall.

Done. All 3 boxes changed in VLAN and interface descriptions

Ottomata moved this task from In Progress to Done on the Analytics-Kanban board.Sep 21 2015, 3:30 PM
kevinator closed this task as Resolved.Sep 21 2015, 3:40 PM
kevinator added a subscriber: kevinator.