⚓ T142288 Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge

Status	Subtype	Assigned	Task
Resolved		MoritzMuehlenhoff	T97758 SVG rendering with marker-element is different between librsvg and Inkscape
Resolved	BUG REPORT	MoritzMuehlenhoff	T44090 SVG: Gaussian blur filter effect not rendered correctly for small to medium thumbnail sizes
Open		None	T43371 Thumbnail/imagescaler (tracking)
Resolved		MoritzMuehlenhoff	T111815 SVG files larger than 10 MB cannot be thumbnailed
Resolved		Joe	T104147 can we get rid of rsvg security patch?
Resolved		MoritzMuehlenhoff	T112421 Update rsvg on the image scalers to 2.40.16 (to solve several SVG rendering issues)
Open		None	T53494 Use Beta cluster as a true canary for code deployments (epic)
Open		None	T87220 Minimize infrastructure differences between Beta Cluster and production
Resolved		ori	T71757 Images with extra parameters (e.g. low-quality JPEG) are not rendered in beta
Resolved		• AlexMonk-WMF	T129586 /mnt/upload7 does not exist anywhere, yet it is referenced in multiple places in wmf-config
Resolved		Krenair	T84950 Thumbnail generation should happen via the same setup in the beta cluster and in production (tracking)
Resolved		fgiunchedi	T142289 Setup deployment-imagescaler host(s) in Beta Cluster
Resolved		Krenair	T142288 Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]]
Declined		None	T142255 Move mathoid to deployment-sca* hosts in Beta Cluster
Resolved		• mobrovac	T107309 Test Mathoid in Jessie
Resolved		Krinkle	T142152 Move apertium to deployment-sca* hosts in Beta Cluster
Resolved		KartikMistry	T107306 Package apertium (and dependencies) for Jessie
Resolved		KartikMistry	T106385 [Tracker] Move Apertium packaging to Debian
Resolved		bd808	T143065 deployment-sca0[12] puppet failure due to issues involving /srv/deployment directory
Invalid		None	T142150 Move citoid to deployment-sca* hosts in Beta Cluster
Declined		None	T107302 Package and test Zotero for Jessie
Resolved		dduvall	T138778 Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10
Resolved		dduvall	T147110 Delete deployment-db1 and deployment-db2

I did some digging through that page as well as instance lists, config files, htop and df -h.

(EDIT: List now moved to ticket description)

In most cases we should try to get input from at least the people listed before doing anything big.

mediawiki-03 is a security only host. Are you all still using that for scanning, @dpatrick ?

Replace -logstash2 (@bd808) with a large instance? XLARGE!

The xlarge was used for disk space more than cpu needs. Currently we are using 27G on the /srv mount. It looks like we could probably do just fine with 4 CPUs and 50G of extra storage. CPU/RAM usage on this node is really bursty depending on the number of folks who are using Kibana to dig through the logs.

The deployment-fluorine instance seems to be about right for disk space, but it would probably do just fine with 1-2 CPU instead of 4 if we had a custom image that would allow us to keep the large disk allocation.

-ores-redis (@Ladsgroup) into -redis0[12]?

We decided to have as similar as possible to production. ORES redis db is dedicated with its own setup in production and that's why I chose a dedicated instance (also ORES, in case of precaching would have huge I/O and memory pressure (which is already putting pressure on our labs setup T141946). I highly advise against merging these instances.

• mobrovac closed subtask T142150: Move citoid to deployment-sca* hosts in Beta Cluster as Invalid.Aug 8 2016, 10:31 AM

In T142288#2531552, @AlexMonk-WMF wrote:

Delete one or two -parsoid (@Catrope, @ssastry, @mobrovac)? There are four of these, all medium.

Done. Only deployment-parsoid09 survived.

What about merging -changeprop (@mobrovac) into -sca0[1-3]

That wouldn't be wise. On the one hand, we heavily experiment with the service there to minimise damage in production. On the other, given the limited resources in Beta, the service can reach such resource-utilisation peaks so as to crash other collocated ones. I'm voting to keep it separate.

this is an awesome task thanks @AlexMonk-WMF and @greg and other folks too

In T142288#2531574, @greg wrote:

mediawiki-03 is a security only host. Are you all still using that for scanning, @dpatrick ?

Yes. (Scanning is kind of broken at this moment, but yes.)

Note: I'm crossing things off my list above as each item is either dealt with or it is made clear by someone that the current allocation is appropriate, e.g. @Ladsgroup says not to merge -ores-redis, and @mobrovac says not to merge -changeprop, so those have been crossed off. Thanks for your input!

In T142288#2535127, @dpatrick wrote:

In T142288#2531574, @greg wrote:

mediawiki-03 is a security only host. Are you all still using that for scanning, @dpatrick ?

Yes. (Scanning is kind of broken at this moment, but yes.)

I wonder if it should be replaced it with a small one? Or does it become busy when in active use?

In T142288#2531757, @bd808 wrote:

The deployment-fluorine instance seems to be about right for disk space, but it would probably do just fine with 1-2 CPU instead of 4 if we had a custom image that would allow us to keep the large disk allocation.

@yuvipanda has told me we should be able to get a s80.small flavour, which is basically m1.small but with m1.large disk space. Thanks @yuvipanda!

In T142288#2535199, @AlexMonk-WMF wrote:

Note: I'm crossing things off my list above as each item is either dealt with or it is made clear by someone that the current allocation is appropriate, e.g. @Ladsgroup says not to merge -ores-redis, and @mobrovac says not to merge -changeprop, so those have been crossed off. Thanks for your input!

moving to the description (so I can find it easily, I hate having to click on "show older changes" ;)

In T142288#2535127, @dpatrick wrote:

In T142288#2531574, @greg wrote:

mediawiki-03 is a security only host. Are you all still using that for scanning, @dpatrick ?

Yes. (Scanning is kind of broken at this moment, but yes.)

Good enough :)

greg updated the task description. (Show Details)Aug 8 2016, 10:28 PM

• AlexMonk-WMF updated the task description. (Show Details)Aug 9 2016, 12:51 AM

• AlexMonk-WMF updated the task description. (Show Details)Aug 9 2016, 12:54 AM

• AlexMonk-WMF updated the task description. (Show Details)Aug 9 2016, 1:06 AM

• AlexMonk-WMF added subscribers: Gehel, EBernhardson.

I've added a table with some proposed replacement instance sizes. @demon, @EBernhardson, @Gehel: any thoughts about elastic? Do we need 4, and whether we do or not, do they need to be large (4 VCPUs, 8GB RAM, 80GB space) instead of more like small with large RAM (1-2 VCPUs, 6-8GB RAM, 20GB space)?

• AlexMonk-WMF mentioned this in T134025: LDAP contains two extra incorrect host entries with aRecord=10.68.17.118, one with aRecord=10.68.22.5, and one with aRecord=10.68.16.120.Aug 9 2016, 1:30 AM

estest instances do need that much space, honestly they need much more disk space than that to meet our needs. Because of that we have acquired two dedicated servers with a combined 256gig memory and 12tb of usable disk space. Once those are fully operational (soonish) these will be deleted.

You've got two dedicated servers and you're going to put them in the deployment-prep labs project?

deleted phab-beta.

• mmodell updated the task description. (Show Details)Aug 9 2016, 3:38 AM

Заменить -logstash2 (@bd808) самый большой экземпляр? Мате!

• AlexMonk-WMF removed • Nikita13311331 as the assignee of this task.Aug 9 2016, 3:59 AM

• Nikita13311331 raised the priority of this task from Medium to Unbreak Now!.Aug 9 2016, 4:48 AM

Restricted Application added subscribers: Luke081515, TerraCodes. · View Herald TranscriptAug 9 2016, 4:48 AM

• AlexMonk-WMF lowered the priority of this task from Unbreak Now! to Medium.Aug 9 2016, 4:49 AM

• AlexMonk-WMF removed subscribers: TerraCodes, Luke081515, • Nikita13311331.

T138778: mariadb в обновления в развертывание-ДКП от точного/версию mariadb 5.5 Джесси/mariadb в 5.10
T142289: развертывание-imagescaler Настройка сервера(ов) в бета-кластер

Заменить -мс-вы0[12] (@fgiunchedi) с меньшим (возможно среднего) экземпляров? Обе очень большое! (Примечание: эти используют около 5г на 100г пространство для Swift, хотя бы немного активности)
Заменить -logstash2 (@bd808) самый большой экземпляр? Мате!
Заменить -жесть с большой экземпляр? Мате! (Примечание: мира большой)
Заменить -mediawiki с небольшой кухни? У нас есть три, все большие!
Заменить -фтор с пользовательским малого/среднего вкуса с большим размером диска? Это большие.
Удалить один или два -эластичный (@демон, @EBernhardson, @Gehel)? Существуют четыре таких, все большие!

• Nikita13311331 set Security to Software security bug.Aug 9 2016, 5:01 AM

• Nikita13311331 added a project: acl*security.

• Nikita13311331 changed the visibility from "Public (No Login Required)" to "Custom Policy".

• Nikita13311331 added commits: rWFCG0b32dbee2d8e: Move payments-init queue off ActiveMQ, rOMWC610425dc8ad3: Fix typo for Russian language switcher, rEDOI372dd53d1298: Add ActiveMQ headers to Redis messages, rWFCG980b8ce6fd30: Move antifraud queue off ActiveMQ, rEFLR449cfd2d2279: Try not to make slave lag in updateRecentChanges(), rMWafd24909166e: LoadBalancer object injection cleanups, rWFCGf32b6b5a7f1c: Move banner history off ActiveMQ, rWFCG09a1aab40562: Use SmashPig config shortcuts, reset Context, rOPUPcdd2a53d2ae9: tools: Don't send per-tool http version stats, rOPUPe553371a62d6: tools: Send logs to multiple servers, rOMWC56db2538a0fb: Code style fix in InitialiseSettings.php, rOMWCc755e80e78c8: Syntax fix., rEKAR7d908a399731: Frame and caption with existing .thumb/.thumbinner CSS, rCICFff79e90b9854: Configure tests for integration/commit-message-validator, rOMWC9798262625df: Syntax fix., rEKAR60bccf5a4a72: Frame and caption with Parsoid's Media structure, rOPUPc93464203e89: tools: Use phabricator as source for kubernetes building, rOPUP9cbfa575199f: tools: Don't send per-tool http version stats, rEARPd1e62cfdb986: Add more PHPCS rules, rEWMV837e84e37904: Log ResourceLoader URL-splitting.

• Nikita13311331 added mocks: M181: Search in iPad, M180: Find in page (aka In article search) on iOS app, M178: iPad - Article layout, Restricted Mockup, M101: WikimediaUI.

• Nikita13311331 awarded a token.Aug 9 2016, 5:03 AM

• Nikita13311331 rescinded a token.

Krenair removed • Nikita13311331 as the assignee of this task.Aug 9 2016, 5:22 AM

Krenair removed a project: acl*security.

Krenair changed the visibility from "Custom Policy" to "Public (No Login Required)".

Krenair changed the edit policy from "All Users" to "Custom Policy".

Krenair changed Security from Software security bug to None.

Krenair removed commits: rEWMV837e84e37904: Log ResourceLoader URL-splitting, rEARPd1e62cfdb986: Add more PHPCS rules, rOPUP9cbfa575199f: tools: Don't send per-tool http version stats, rOPUPc93464203e89: tools: Use phabricator as source for kubernetes building, rEKAR60bccf5a4a72: Frame and caption with Parsoid's Media structure, rOMWC9798262625df: Syntax fix., rCICFff79e90b9854: Configure tests for integration/commit-message-validator, rEKAR7d908a399731: Frame and caption with existing .thumb/.thumbinner CSS, rOMWCc755e80e78c8: Syntax fix., rOMWC56db2538a0fb: Code style fix in InitialiseSettings.php, rOPUPe553371a62d6: tools: Send logs to multiple servers, rOPUPcdd2a53d2ae9: tools: Don't send per-tool http version stats, rWFCG09a1aab40562: Use SmashPig config shortcuts, reset Context, rWFCGf32b6b5a7f1c: Move banner history off ActiveMQ, rMWafd24909166e: LoadBalancer object injection cleanups, rEFLR449cfd2d2279: Try not to make slave lag in updateRecentChanges(), rWFCG980b8ce6fd30: Move antifraud queue off ActiveMQ, rEDOI372dd53d1298: Add ActiveMQ headers to Redis messages, rOMWC610425dc8ad3: Fix typo for Russian language switcher, rWFCG0b32dbee2d8e: Move payments-init queue off ActiveMQ.

Krenair removed mocks: M101: WikimediaUI, Restricted Mockup, M178: iPad - Article layout, M180: Find in page (aka In article search) on iOS app, M181: Search in iPad.

Krenair added a subscriber: • Nikita13311331.

After a quick look at deployment-elastic0[5-8] instances I think we can remove one of them, indices there are pretty small (except simplewiki) and most of the time they are configured with 2 replicas.
Removing one node and reducing the number of replicas to 1 might be doable. @Gehel what do you think?
@Krenair can we wait til next week so that Erik comes back and we can have a discussion to actually decide what to do?

In T142288#2535988, @dcausse wrote:

@Krenair can we wait til next week so that Erik comes back and we can have a discussion to actually decide what to do?

I'm using @AlexMonk-WMF on this ticket. Yes, of course.

Edit: I realised several days after writing this comment that I had touched the ticket under that other account to deal with a vandal... I decided to set a custom edit policy to prevent them messing around with it too much again, but since my WMF account is unprivileged had to use the other one. So please ignore my other account on this ticket :)

• AlexMonk-WMF removed subscribers: Krenair, • Nikita13311331.Aug 9 2016, 5:50 PM

• mmodell unsubscribed.Aug 10 2016, 1:51 AM

• Phabricator_maintenance added a project: Goal.Aug 13 2016, 8:40 PM

• Phabricator_maintenance renamed this task from Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]] (tracking) to Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]].Aug 14 2016, 12:23 AM

• Phabricator_maintenance removed a project: Tracking-Neverending.

• Phabricator_maintenance added a subscriber: • mmodell.

Krenair changed the edit policy from "Custom Policy" to "All Users".Aug 14 2016, 12:27 AM

• AlexMonk-WMF mentioned this in T143349: Deprecate precise instances in Labs by 2017-03-31.Aug 18 2016, 9:07 PM

• AlexMonk-WMF removed a parent task: T138778: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10.Aug 18 2016, 9:40 PM

• AlexMonk-WMF added a subtask: T138778: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10.

• AlexMonk-WMF updated the task description. (Show Details)Aug 19 2016, 4:08 PM

hashar awarded a token.Aug 22 2016, 10:08 AM

hashar mentioned this in T144006: Move the MW Beta appservers to Debian.Sep 12 2016, 1:56 PM

hashar mentioned this in T145611: Please raise quota for deployment-prep.Sep 14 2016, 8:31 AM

dduvall closed subtask T138778: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 as Resolved.Sep 23 2016, 6:26 PM

hashar added a project: Tracking-Neverending.Sep 26 2016, 10:08 AM

• AlexMonk-WMF updated the task description. (Show Details)Oct 8 2016, 4:14 PM

• AlexMonk-WMF updated the task description. (Show Details)

We now have a quota allowing 39 more instances, 28 VCPUs, about 71GB RAM, a couple of floating IPs, and 4 security groups. So we currently have some space to grow in where necessary, but T145611 is only temporary and T145636 asked for 8GB more of RAM than the resulting instance actually takes up. So I would expect it to go down to 12 VCPUs and 31GB RAM spare.

@demon, @dcausse, @Gehel, what do you think about the elastic instances?

In T142288#2701724, @AlexMonk-WMF wrote:

@demon, @dcausse, @Gehel, what do you think about the elastic instances?

I haven't touched them since setting them up ages ago. If we can drop one or downsize them that's fine by me.

@AlexMonk-WMF I think we can remove one, I'll start to update elastic config for this cluster (reduce replica count).

Going down to two nodes is doable but I know that @Gehel has some concerns with 2 nodes setup.
Downsizing sounds difficult: these machine are already running with 8Gb of ram, 4Gb might not be enough.

Created T147777 to track the work on decommissioning deployment-elastic08.

As @dcausse guessed, I'm not too keen on going below 3 nodes in an elasticsearch cluster. But I'll already follow up on disabling deployment-elastic08.

deployment-elastic08 is now dead (see T147777).

Gehel updated the task description. (Show Details)Oct 21 2016, 10:06 AM

Mentioned in SAL (#wikimedia-labs) [2016-10-24T14:51:26Z] <Krenair> T142288: Shut off -pdf02 and -conftool

hashar mentioned this in T150339: Purge old instances from wmflabs Shinken.Nov 22 2016, 8:26 AM

• AlexMonk-WMF updated the task description. (Show Details)Nov 22 2016, 9:57 AM

Liuxinyu970226 subscribed.Jan 24 2017, 3:39 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:41 PM

Gehel unsubscribed.Jun 20 2017, 1:32 PM

Krinkle closed subtask T142152: Move apertium to deployment-sca* hosts in Beta Cluster as Resolved.Jul 9 2018, 10:10 PM

• mobrovac closed subtask T142255: Move mathoid to deployment-sca* hosts in Beta Cluster as Declined.Jul 31 2018, 8:04 AM

So the 2016 purge came and went. Shall we close this now?

In T142288#4715022, @Krenair wrote:

So the 2016 purge came and went. Shall we close this now?

Seems reasonable to me.

Liuxinyu970226 unsubscribed.Nov 6 2018, 12:30 PM

Liuxinyu970226 moved this task from Should be Goal instead to Transition completed / Archived on the Tracking-Neverending board.Nov 6 2018, 12:38 PM

Existing instance	VCPUs	RAM	Disk	Notes	Status
deployment-ms-be0[12]	8?	6-16GB?	20GB	xlarge VCPUs, medium-xlarge RAM, small disk - need to figure out what these servers are up to	todo
deployment-logstash2	4	16GB?	80GB?	Large with xlarge RAM - could be smaller right now but needs space for spikes in log size and for active queries bursting CPU and RAM	todo
deployment-tin	4	4GB	80GB	c8.m8.s60	Done
deployment-mediawiki*	1	2GB	20GB	Small?	todo
deployment-fluorine	1	2GB	80GB	Small with large disk	Done

Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]]
Closed, ResolvedPublic
Actions

Description

Related Objects
Search...

Event Timeline

	greg
	Aug 6 2016, 8:09 AM

Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]]Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Consolidate, remove, and/or downsize Beta Cluster instances to help with [[wikitech:Purge_2016]]
Closed, ResolvedPublic
Actions

Related Objects
Search...