Page MenuHomePhabricator

Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]]
Closed, ResolvedPublic

Description

See https://wikitech.wikimedia.org/wiki/Purge_2016#In_use_deployment-prep

Priorities:

  • Replace -ms-be0[12] (@fgiunchedi) with smaller (maybe medium) instances? Both XLARGE! (note: these are both using about 5G of 100G space for Swift, though do seem to have quite a bit of activity)
  • Replace -logstash2 (@bd808) with a large instance? XLARGE!
  • Replace -tin with a c8.m8.s60 instance? XLARGE!
  • Replace -mediawiki with small flavours? We have three, all large!
  • Replace -fluorine with a custom small/medium flavour with a large disk size? It's large.
  • Delete one or two -elastic (@demon, @EBernhardson, @Gehel)? There are four of these, all large!
  • Delete one or two -parsoid (@Catrope, @ssastry, @mobrovac)? There are four of these, all medium.
  • Delete -pdf02 (@jeremyb, @cscott)? only -pdf01 seems used according to MW config (medium)

Replacement instance proposed sizes:

Existing instanceVCPUsRAMDiskNotesStatus
deployment-ms-be0[12]8?6-16GB?20GBxlarge VCPUs, medium-xlarge RAM, small disk - need to figure out what these servers are up totodo
deployment-logstash2416GB?80GB?Large with xlarge RAM - could be smaller right now but needs space for spikes in log size and for active queries bursting CPU and RAMtodo
deployment-tin44GB80GBc8.m8.s60Done
deployment-mediawiki*12GB20GBSmall?todo
deployment-fluorine12GB80GBSmall with large diskDone

Small instances:

  • What about merging -changeprop (@mobrovac) into -sca0[1-3]?
  • What about merging -ores-redis (@Ladsgroup) into -redis0[12]?
  • Delete -conftool (@Joe)?
  • Delete one or two -kafka (@Ottomata)? There are four of these
  • @mmodell, are you using phab-beta?

Related Objects

StatusSubtypeAssignedTask
ResolvedMoritzMuehlenhoff
ResolvedMoritzMuehlenhoff
OpenNone
ResolvedMoritzMuehlenhoff
ResolvedJoe
ResolvedMoritzMuehlenhoff
OpenNone
OpenNone
Resolvedori
Resolved AlexMonk-WMF
ResolvedKrenair
Resolvedfgiunchedi
ResolvedKrenair
DeclinedNone
Resolved mobrovac
ResolvedKrinkle
ResolvedKartikMistry
ResolvedKartikMistry
Resolvedbd808
InvalidNone
DeclinedNone
Resolved dduvall
Resolved dduvall

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I did some digging through that page as well as instance lists, config files, htop and df -h.

(EDIT: List now moved to ticket description)

In most cases we should try to get input from at least the people listed before doing anything big.

mediawiki-03 is a security only host. Are you all still using that for scanning, @dpatrick ?

Replace -logstash2 (@bd808) with a large instance? XLARGE!

The xlarge was used for disk space more than cpu needs. Currently we are using 27G on the /srv mount. It looks like we could probably do just fine with 4 CPUs and 50G of extra storage. CPU/RAM usage on this node is really bursty depending on the number of folks who are using Kibana to dig through the logs.

The deployment-fluorine instance seems to be about right for disk space, but it would probably do just fine with 1-2 CPU instead of 4 if we had a custom image that would allow us to keep the large disk allocation.

-ores-redis (@Ladsgroup) into -redis0[12]?

We decided to have as similar as possible to production. ORES redis db is dedicated with its own setup in production and that's why I chose a dedicated instance (also ORES, in case of precaching would have huge I/O and memory pressure (which is already putting pressure on our labs setup T141946). I highly advise against merging these instances.

Delete one or two -parsoid (@Catrope, @ssastry, @mobrovac)? There are four of these, all medium.

Done. Only deployment-parsoid09 survived.

What about merging -changeprop (@mobrovac) into -sca0[1-3]

That wouldn't be wise. On the one hand, we heavily experiment with the service there to minimise damage in production. On the other, given the limited resources in Beta, the service can reach such resource-utilisation peaks so as to crash other collocated ones. I'm voting to keep it separate.

this is an awesome task thanks @AlexMonk-WMF and @greg and other folks too

mediawiki-03 is a security only host. Are you all still using that for scanning, @dpatrick ?

Yes. (Scanning is kind of broken at this moment, but yes.)

Note: I'm crossing things off my list above as each item is either dealt with or it is made clear by someone that the current allocation is appropriate, e.g. @Ladsgroup says not to merge -ores-redis, and @mobrovac says not to merge -changeprop, so those have been crossed off. Thanks for your input!

mediawiki-03 is a security only host. Are you all still using that for scanning, @dpatrick ?

Yes. (Scanning is kind of broken at this moment, but yes.)

I wonder if it should be replaced it with a small one? Or does it become busy when in active use?

The deployment-fluorine instance seems to be about right for disk space, but it would probably do just fine with 1-2 CPU instead of 4 if we had a custom image that would allow us to keep the large disk allocation.

@yuvipanda has told me we should be able to get a s80.small flavour, which is basically m1.small but with m1.large disk space. Thanks @yuvipanda!

Note: I'm crossing things off my list above as each item is either dealt with or it is made clear by someone that the current allocation is appropriate, e.g. @Ladsgroup says not to merge -ores-redis, and @mobrovac says not to merge -changeprop, so those have been crossed off. Thanks for your input!

moving to the description (so I can find it easily, I hate having to click on "show older changes" ;)

mediawiki-03 is a security only host. Are you all still using that for scanning, @dpatrick ?

Yes. (Scanning is kind of broken at this moment, but yes.)

Good enough :)

I've added a table with some proposed replacement instance sizes. @demon, @EBernhardson, @Gehel: any thoughts about elastic? Do we need 4, and whether we do or not, do they need to be large (4 VCPUs, 8GB RAM, 80GB space) instead of more like small with large RAM (1-2 VCPUs, 6-8GB RAM, 20GB space)?

estest instances do need that much space, honestly they need much more disk space than that to meet our needs. Because of that we have acquired two dedicated servers with a combined 256gig memory and 12tb of usable disk space. Once those are fully operational (soonish) these will be deleted.

You've got two dedicated servers and you're going to put them in the deployment-prep labs project?

Nikita13311331 subscribed.

Заменить -logstash2 (@bd808) самый большой экземпляр? Мате!

Nikita13311331 raised the priority of this task from Medium to Unbreak Now!.Aug 9 2016, 4:48 AM
AlexMonk-WMF lowered the priority of this task from Unbreak Now! to Medium.Aug 9 2016, 4:49 AM

T138778: mariadb в обновления в развертывание-ДКП от точного/версию mariadb 5.5 Джесси/mariadb в 5.10
T142289: развертывание-imagescaler Настройка сервера(ов) в бета-кластер

Заменить -мс-вы0[12] (@fgiunchedi) с меньшим (возможно среднего) экземпляров? Обе очень большое! (Примечание: эти используют около 5г на 100г пространство для Swift, хотя бы немного активности)
Заменить -logstash2 (@bd808) самый большой экземпляр? Мате!
Заменить -жесть с большой экземпляр? Мате! (Примечание: мира большой)
Заменить -mediawiki с небольшой кухни? У нас есть три, все большие!
Заменить -фтор с пользовательским малого/среднего вкуса с большим размером диска? Это большие.
Удалить один или два -эластичный (@демон, @EBernhardson, @Gehel)? Существуют четыре таких, все большие!

Nikita13311331 changed the visibility from "Public (No Login Required)" to "Custom Policy".
Krenair removed a project: acl*security.
Krenair changed the visibility from "Custom Policy" to "Public (No Login Required)".
Krenair changed the edit policy from "All Users" to "Custom Policy".
Krenair changed Security from Software security bug to None.
Krenair added a subscriber: Nikita13311331.

After a quick look at deployment-elastic0[5-8] instances I think we can remove one of them, indices there are pretty small (except simplewiki) and most of the time they are configured with 2 replicas.
Removing one node and reducing the number of replicas to 1 might be doable. @Gehel what do you think?
@Krenair can we wait til next week so that Erik comes back and we can have a discussion to actually decide what to do?

@Krenair can we wait til next week so that Erik comes back and we can have a discussion to actually decide what to do?

I'm using @AlexMonk-WMF on this ticket. Yes, of course.

Edit: I realised several days after writing this comment that I had touched the ticket under that other account to deal with a vandal... I decided to set a custom edit policy to prevent them messing around with it too much again, but since my WMF account is unprivileged had to use the other one. So please ignore my other account on this ticket :)

Phabricator_maintenance renamed this task from Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]] (tracking) to Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]].Aug 14 2016, 12:23 AM
Krenair changed the edit policy from "Custom Policy" to "All Users".Aug 14 2016, 12:27 AM

We now have a quota allowing 39 more instances, 28 VCPUs, about 71GB RAM, a couple of floating IPs, and 4 security groups. So we currently have some space to grow in where necessary, but T145611 is only temporary and T145636 asked for 8GB more of RAM than the resulting instance actually takes up. So I would expect it to go down to 12 VCPUs and 31GB RAM spare.

@demon, @dcausse, @Gehel, what do you think about the elastic instances?

@demon, @dcausse, @Gehel, what do you think about the elastic instances?

I haven't touched them since setting them up ages ago. If we can drop one or downsize them that's fine by me.

@AlexMonk-WMF I think we can remove one, I'll start to update elastic config for this cluster (reduce replica count).

Going down to two nodes is doable but I know that @Gehel has some concerns with 2 nodes setup.
Downsizing sounds difficult: these machine are already running with 8Gb of ram, 4Gb might not be enough.

Created T147777 to track the work on decommissioning deployment-elastic08.

As @dcausse guessed, I'm not too keen on going below 3 nodes in an elasticsearch cluster. But I'll already follow up on disabling deployment-elastic08.

deployment-elastic08 is now dead (see T147777).

Mentioned in SAL (#wikimedia-labs) [2016-10-24T14:51:26Z] <Krenair> T142288: Shut off -pdf02 and -conftool

So the 2016 purge came and went. Shall we close this now?

bd808 assigned this task to Krenair.
bd808 added a subscriber: Krenair.

So the 2016 purge came and went. Shall we close this now?

Seems reasonable to me.