eqiad: (3) AQS replacement nodes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ottomata
	Jan 27 2016, 7:12 PM

Description

Ok, this request is for nodes to replace the OOW aqs100[123].

We'd like 32-64G RAM, ~8 1T SSDs, ~12 cores.

This should come out of Analytics hardware budget for this FY.

@JAllemandou can expand if needed.

Details

	Subject	Repo	Branch	Lines +/-
	Change default consistency to localOne	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Ottomata	T134275 rack/setup/deploy 3 eqiad druid nodes
Resolved	RobH	T128807 eqiad: (3) nodes for Druid / analytics
		Unknown Object (Task)
Duplicate	• mobrovac	T125345 Many error 500 from pageviews API "Error in Cassandra table storage backend"
Resolved	JAllemandou	T124314 Better response times on AQS (Pageview API mostly) {melc}
Declined	Ottomata	T124951 Hadoop Node expansion for end of FY
Duplicate	elukey	T132938 Provision new SSD-able machines on AQS
Resolved	RobH	T124947 eqiad: (3) AQS replacement nodes
		Unknown Object (Task)
		Unknown Object (Task)
		Unknown Object (Task)

Event Timeline

Ottomata created this task.Jan 27 2016, 7:12 PM

Ottomata assigned this task to JAllemandou.

Ottomata raised the priority of this task from to Medium.

Ottomata updated the task description. (Show Details)

Ottomata added projects: hardware-requests, Analytics-Kanban.

Ottomata added subscribers: Ottomata, • Cmjohnson, JAllemandou.

Restricted Application added a project: SRE. · View Herald TranscriptJan 27 2016, 7:12 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Ottomata renamed this task from 8 x 3 SSDs per for AQS nodes. to 8 x 3 SSDs for AQS nodes..Jan 27 2016, 7:13 PM

Ottomata set Security to None.

Ottomata updated the task description. (Show Details)

Ottomata mentioned this in T124951: Hadoop Node expansion for end of FY.Jan 27 2016, 7:22 PM

Ottomata added a parent task: T124951: Hadoop Node expansion for end of FY.

Ottomata updated the task description. (Show Details)Jan 27 2016, 8:05 PM

When discussing about cassandra response time issues with @GWicke, he told me the Services Team had used SSDs to mitigate that issue.
They use Samsung 850 Pro 1Tb SSDs, costing about $430 each on amazon see here.
As for storage, our current capacity planning gives us two years with 8T per machine.

In the short term, a few tweaks could also help to reduce latency:

Reduce replication factor for the per-article keyspace from 3 to 2.
Read with CL_ONE by default.
Possibly, reduce Cassandra heap size slightly to use less than 50% of total available RAM, leaving more space for page cache.

• GWicke added a subscriber: Eevans.Feb 1 2016, 3:10 AM

@GWicke:

The replication factor for the per-article table had been changed to 2 a few month ago. I think restbase config management for cassandra has changed it back ...
The read with CL_ONE is a great idea
About reducing cassandra heap size, why not try :)

• GWicke mentioned this in T125345: Many error 500 from pageviews API "Error in Cassandra table storage backend".Feb 1 2016, 11:59 PM

Change 267924 had a related patch set uploaded (by Eevans):
Change default consistency to localOne

https://gerrit.wikimedia.org/r/267924

gerritbot added a project: Patch-For-Review.Feb 2 2016, 5:48 PM

In T124947#1985752, @JAllemandou wrote:

@GWicke:

The replication factor for the per-article table had been changed to 2 a few month ago. I think restbase config management for cassandra has changed it back ...

Yeah. :( RESTBase only has two notions of durability, low and not low (replication factor 1 and 3 respectively). And, if during startup replication other than what is called for is detected, the keyspaces are altered accordingly.

TL;DR I think this would require code changes.

The read with CL_ONE is a great idea

See: https://gerrit.wikimedia.org/r/267924

Hold on this, it seems will be replacing the aqs1xxx nodes since they are out of warranty.

I +1ed the CR for changing restbase read consistency to one (it'll be good even with SSDs).
The change on code for replication factor set to 2 can wait, storage is not yet an issue.
Thanks @Eevans for the patch !

In T124947#1991215, @Ottomata wrote:

Hold on this, it seems will be replacing the aqs1xxx nodes since they are out of warranty.

Moar memory please.

Change 267924 merged by Ottomata:
Change default consistency to localOne

https://gerrit.wikimedia.org/r/267924

Milimetric edited projects, added Analytics; removed Analytics-Kanban.Feb 4 2016, 6:05 PM

Milimetric moved this task from Incoming to Analytics Query Service on the Analytics board.

Milimetric moved this task from Analytics Query Service to Radar on the Analytics board.

JAllemandou renamed this task from 8 x 3 SSDs for AQS nodes. to New Hardware for AQS has SSDs instead of HDD and more RAM than current nodes..Feb 5 2016, 10:56 AM

Ottomata renamed this task from New Hardware for AQS has SSDs instead of HDD and more RAM than current nodes. to AQS replacement nodes.Mar 3 2016, 10:59 PM

Ottomata reassigned this task from JAllemandou to RobH.

Ottomata updated the task description. (Show Details)

Please note we don't want to use non-supported SSDs in production. As such, we don't want to purchase more of the Samsung SSDs. We order Intel S3610 SSDs with the systems, as then they are covered under the system warranty with the manufacturer.

I have some addiitonal questions before I can obtain quotes.

aqs100[1-3] have the following:

Dual Intel® Xeon® Processor E5-2620 (2GHz/6core)
48GB RAM
lots of ssds installed

The future specification seems to be the following:

Dual Intel Xeon CPU with 6 core per cpu
Upscale to 64GB memory.
The hard disk space requirements are unclear.
- Do you need 8TB after raid10 or before?
- As I've stated, we order Intel S3610 SSDs direct from HP or Dell as part of the system build. This allows them to be covered under the system's warranty. These come in 1.2TB, which is close to the requested capacity. (They come in 480GB/800GB/1.2TB)

The raid/disk space question seems to be for @JAllemandou to answer, since he mentioned the 1TB disk use anyhow? Please note we don't want to run raid0 for disks in production use. We prefer to raid1 (for two disks) or raid10. As such, this may require twice the disks/ssds initially requested. How much storage space on SSD do you need?

I've assigned this back to @JAllemandou for his feedback. Please provide and assign back to me. Thanks!

RobH renamed this task from AQS replacement nodes to eqiad: (3)AQS replacement nodes.Mar 4 2016, 12:04 AM

RobH renamed this task from eqiad: (3)AQS replacement nodes to eqiad: (3) AQS replacement nodes.

RobH moved this task from Backlog to In Discussion / Review on the hardware-requests board.

@Eevans : is 64GB memory good (for 2x 6 cores CPU0, or is it better to ask for 128 ?

@RobH : 8T useful (after RAID 10) per machine gives us two year t least. If we have 8TB real (before RAID 10), it makes 4TB usable, then giving us about 1 year if not less. IIRC on aqs100[1-3] RAID is not setup, that's why we didn't consider it.
Since those machines will be new, it's better IMO to assume 2 years stability, therefore 8T useful.

JAllemandou reassigned this task from JAllemandou to RobH.Mar 4 2016, 10:48 AM

Agreed, we expect systems to typically last for three years. I'll move ahead on getting a quote to give you at least 8TB usable space after raid10 on SSD. I'll have the quotes for both 64GB and 128GB.

This is actually very, very close to our potentially updated spare specification on T128910, except it has a LOT of SSD space requirement. So we won't be able to allocate any kind of spares for this.

Bump! :)

So we have a specification for this, it is actually our new spare pool specification on T128910. I've added this as dependent on that specification and order. (This specific one will need that spec, but with 8 * 2TB SSD option.) Pricing is on that task, and should not be copied to this non S4 space task.

I should be following up with @mark about the spare pool order tomorrow.

RobH moved this task from In Discussion / Review to Pending Approval on the hardware-requests board.Mar 15 2016, 10:33 PM

Bump.

The systems that can be used for this were ordered today on T130738.

I'm now assigning this task to @mark.
@mark: Please review the above request. Please attach relevant approvals for allocation, or add questions/comments for followup, and assign back to me. This request was noted on the spare pool order on T128910.

Thanks!

Ottomata mentioned this in T130816: Broken disk on aqs1001.eqiad.wmnet.Mar 24 2016, 1:11 PM

Bump

I accidentally assigned in another person, and didn't assign the task to mark. Not sure how I did that, but it didn't reveal any private info. Even when someone is directly subscribed to a task linked to S4, it won't allow them to actually view it unless they are also in the right acl groups.

I've corrected and assigned this to @mark for his review.

Approved from the pool of new spare systems.

@RobH, are the SSDs for this already ordered too?

RobH mentioned this in Unknown Object (Task).Mar 31 2016, 4:08 PM

I'm not fully following the SSD quote ticket, but just looking for a status update. How soon do you think we will have these?

We are having problems with AQS nodes related to lack of SSDs: our iowait is going real high, to the point that we think might be affecting loading of new data, some of our loading jobs are failing.

Can we get an ETA in those SSDs?

RobH added a subtask: Unknown Object (Task).Apr 5 2016, 6:14 PM

RobH edited subtasks, added: Unknown Object (Task); removed: Unknown Object (Task).

RobH moved this task from Pending Approval to Allocation/Ordering/Implementation on the hardware-requests board.Apr 5 2016, 10:12 PM

RobH mentioned this in Unknown Object (Task).Apr 7 2016, 9:58 PM

RobH edited subtasks, added: Unknown Object (Task); removed: Unknown Object (Task).

elukey subscribed.Apr 13 2016, 1:57 PM

@RobH, over in T132067 it looks like these nodes were ordered, is this correct? If so, any idea on ETA?

Eevans added a parent task: T124314: Better response times on AQS (Pageview API mostly) {melc}.Apr 13 2016, 3:17 PM

Ottomata added a parent task: T132938: Provision new SSD-able machines on AQS.Apr 18 2016, 6:43 PM

They've arrived onsite, and are in the queue for chris to rack. I'm marking this as resolved by the purchase task T132067.

RobH mentioned this in T133785: rack/setup/deploy aqs100[456].Apr 27 2016, 4:05 PM

@Cmjohnson if you could prioritize this one a little, we'd appreciate it. We've been waiting for a while and the current OOW nodes that are hosting this service are starting to fall over. Thank you!

We can call these aqs100[456]. If you can just get these to DNS and ready for install, we will handle the actual partman layout and install. The disk layout is still under discussion here: https://etherpad.wikimedia.org/p/analytics-aqs-cassandra

• Cmjohnson mentioned this in T134276: Rack and setup (3) Druid Nodes in eqiad .May 3 2016, 4:05 PM

mark closed subtask Unknown Object (Task) as Resolved.May 6 2016, 3:42 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

eqiad: (3) AQS replacement nodesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

eqiad: (3) AQS replacement nodes
Closed, ResolvedPublic
Actions

Related Objects
Search...