Page MenuHomePhabricator

Adjust balance of WDQS nodes to allow continued operation if eqiad went offline.
Closed, ResolvedPublic

Description

Currently both WDQS boxes are mostly idle, but strictly necessary for HA purposes.

I am left wondering though if it would be reasonable, and perhaps prudent, to have the second machine in a separate datacenter? If we wanted to go that route can the second machine in eqiad be returned to ops and an available server in codfw used, or would a machine for codfw need to be purchased?

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptJan 25 2016, 5:11 AM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

One machine per DC seems dangerous though, if we want full redundancy. I mean, if DC goes down it is usually something serious, which may take significant time to fix. Then we're left with one, and we can't do any maintenance on it since it's all we have. So ideally I'd prefer 2 per DC, but with current load I'm not sure 4 machines are justified... Maybe just add 1 server in codfw for now? Would be glad to hear comments on that.

EBernhardson set Security to None.
Deskana triaged this task as Low priority.Jan 25 2016, 11:51 PM
Deskana moved this task from Needs triage to Ops on the Discovery board.
Deskana added a subscriber: Deskana.

If we have lost a DC then we should not be doing maintenance on a node. Stability is key then. I'm used to running in an n+1 hardware layout where any of your DC's can fall over and the other(s) can take total traffic while running slightly hot. Then it doesn't really matter which one gets the traffic and your users are happy.

OK then, I would then suggest imaging a server in codfw, and once it is complete we can proceed to decommission one of the eqiad servers. Note also T120714 with regard to disk space. I was waiting with reimaging of the servers because we'd need to reload the database anyway after Blazegraph 2.0 upgrade and implementing geospatial search, but if this happens before we need to set up codfw server the same way as we plan to do for eqiad one in the future.

EBernhardson added a subscriber: RobH.EditedJan 28 2016, 6:49 AM

@RobH Would you be able to provide a ballpark estimate on per-server costs of WDQS nodes that we can include in next years budget? These don't need to be perfect, just close enough for yearly planning purposes. We would be looking to add one or two of these servers to codfw in Q1 of next FY.

The existing nodes to be replicated are:

Dell Poweredge R420s
24 cores
96G memory
4x Intel 320 Series SSDSA2CW300G3 2.5" 300GB)

I'm not clear on diskspace, @Smalyshev could perhaps chime in. It looks like they had 2x intel ssd's, and i see the ticket about adding 2 more intel ssd's, but ganglia still reports total disk space as 225G, rather than the 400-500G i would expect from raid 1

I am not sure what is required by RAID configuration, so I'll talk in terms of diskspace. Right now the data partition on WDQ servers has 184G and is at 62% capacity. This is low, so we'd need to add some more capacity - another ~200G would be ideal. So if it's one more 300G that's fine.

Now, I'm not sure what is the preferred RAID config - if we do raid1 or raid10 (better) then we need 4 disks. Unless we use bigger ones than 300G of course - the target is to have ~300-400G of diskspace available for the data partition (i.e. not counting OS, deployments, etc.)

The disks on wdq1/wdq2 are installed but not enabled, see T120714. So that's why you don't see much space there.

hoo added a subscriber: hoo.May 11 2016, 7:38 AM
Restricted Application added a subscriber: Southparkfan. · View Herald TranscriptMay 13 2016, 10:26 PM
greg added a subscriber: greg.Sep 29 2016, 7:41 PM

This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is still valid, please prioritize it appropriately relative to your other work. If you have any questions, feel free to ask me (Greg Grossmeier).

I think it's still valid and we have some patches in review. I imagine when @Gehel comes back from admiring Sagrada Familia he can provide more details :)

Work is happening in sub tasks, but it is happening. This should be done by the end of next week.

Gehel closed this task as Resolved.Feb 1 2017, 4:17 PM
Gehel claimed this task.

This is actually done for some time, we have 3 nodes in codfw, matching the 3 nodes in eqiad.