Ok, this request is for nodes to replace the OOW aqs100[123].
We'd like 32-64G RAM, ~8 1T SSDs, ~12 cores.
This should come out of Analytics hardware budget for this FY.
@JAllemandou can expand if needed.
Ok, this request is for nodes to replace the OOW aqs100[123].
We'd like 32-64G RAM, ~8 1T SSDs, ~12 cores.
This should come out of Analytics hardware budget for this FY.
@JAllemandou can expand if needed.
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Change default consistency to localOne | operations/puppet | production | +1 -1 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Ottomata | T134275 rack/setup/deploy 3 eqiad druid nodes | |||
Resolved | RobH | T128807 eqiad: (3) nodes for Druid / analytics | |||
Unknown Object (Task) | |||||
Duplicate | • mobrovac | T125345 Many error 500 from pageviews API "Error in Cassandra table storage backend" | |||
Resolved | JAllemandou | T124314 Better response times on AQS (Pageview API mostly) {melc} | |||
Declined | Ottomata | T124951 Hadoop Node expansion for end of FY | |||
Duplicate | elukey | T132938 Provision new SSD-able machines on AQS | |||
Resolved | RobH | T124947 eqiad: (3) AQS replacement nodes | |||
Unknown Object (Task) | |||||
Unknown Object (Task) | |||||
Unknown Object (Task) |
In the short term, a few tweaks could also help to reduce latency:
Change 267924 had a related patch set uploaded (by Eevans):
Change default consistency to localOne
Yeah. :( RESTBase only has two notions of durability, low and not low (replication factor 1 and 3 respectively). And, if during startup replication other than what is called for is detected, the keyspaces are altered accordingly.
TL;DR I think this would require code changes.
- The read with CL_ONE is a great idea
Hold on this, it seems will be replacing the aqs1xxx nodes since they are out of warranty.
I +1ed the CR for changing restbase read consistency to one (it'll be good even with SSDs).
The change on code for replication factor set to 2 can wait, storage is not yet an issue.
Thanks @Eevans for the patch !
Please note we don't want to use non-supported SSDs in production. As such, we don't want to purchase more of the Samsung SSDs. We order Intel S3610 SSDs with the systems, as then they are covered under the system warranty with the manufacturer.
I have some addiitonal questions before I can obtain quotes.
aqs100[1-3] have the following:
The future specification seems to be the following:
The raid/disk space question seems to be for @JAllemandou to answer, since he mentioned the 1TB disk use anyhow? Please note we don't want to run raid0 for disks in production use. We prefer to raid1 (for two disks) or raid10. As such, this may require twice the disks/ssds initially requested. How much storage space on SSD do you need?
I've assigned this back to @JAllemandou for his feedback. Please provide and assign back to me. Thanks!
@Eevans : is 64GB memory good (for 2x 6 cores CPU0, or is it better to ask for 128 ?
@RobH : 8T useful (after RAID 10) per machine gives us two year t least. If we have 8TB real (before RAID 10), it makes 4TB usable, then giving us about 1 year if not less. IIRC on aqs100[1-3] RAID is not setup, that's why we didn't consider it.
Since those machines will be new, it's better IMO to assume 2 years stability, therefore 8T useful.
Agreed, we expect systems to typically last for three years. I'll move ahead on getting a quote to give you at least 8TB usable space after raid10 on SSD. I'll have the quotes for both 64GB and 128GB.
This is actually very, very close to our potentially updated spare specification on T128910, except it has a LOT of SSD space requirement. So we won't be able to allocate any kind of spares for this.
So we have a specification for this, it is actually our new spare pool specification on T128910. I've added this as dependent on that specification and order. (This specific one will need that spec, but with 8 * 2TB SSD option.) Pricing is on that task, and should not be copied to this non S4 space task.
I should be following up with @mark about the spare pool order tomorrow.
The systems that can be used for this were ordered today on T130738.
I'm now assigning this task to @mark.
@mark: Please review the above request. Please attach relevant approvals for allocation, or add questions/comments for followup, and assign back to me. This request was noted on the spare pool order on T128910.
Thanks!
I accidentally assigned in another person, and didn't assign the task to mark. Not sure how I did that, but it didn't reveal any private info. Even when someone is directly subscribed to a task linked to S4, it won't allow them to actually view it unless they are also in the right acl groups.
I've corrected and assigned this to @mark for his review.
I'm not fully following the SSD quote ticket, but just looking for a status update. How soon do you think we will have these?
We are having problems with AQS nodes related to lack of SSDs: our iowait is going real high, to the point that we think might be affecting loading of new data, some of our loading jobs are failing.
Can we get an ETA in those SSDs?
@RobH, over in T132067 it looks like these nodes were ordered, is this correct? If so, any idea on ETA?
They've arrived onsite, and are in the queue for chris to rack. I'm marking this as resolved by the purchase task T132067.
@Cmjohnson if you could prioritize this one a little, we'd appreciate it. We've been waiting for a while and the current OOW nodes that are hosting this service are starting to fall over. Thank you!
We can call these aqs100[456]. If you can just get these to DNS and ready for install, we will handle the actual partman layout and install. The disk layout is still under discussion here: https://etherpad.wikimedia.org/p/analytics-aqs-cassandra