Page MenuHomePhabricator

Revert: Request increased quota for Phlogiston labs project
Closed, ResolvedPublic

Description

Project Name: Phlogiston
Type of quota increase requested: 4 CPU/8G ram permanent/1 new floating ip
Reason: Need to replace phlogiston-1, which is under-powered.

Plan:

  1. Create phlogiston-3 as m1.large (4 CPU/8 G ram), phlogiston-3.wmflabs.org.
  2. set up phlogiston-3 as production system and run reports
  3. reassign phlogiston.wmflabs.org from phlogiston-1 to phlogiston-3
  4. Discard phlogiston-1, no longer need 1 CPU/2G/1 ip.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 15 2016, 6:08 PM
JAufrecht renamed this task from Request increased quota for <Replace Me> labs project to Request increased quota for Phlogiston labs project.Aug 15 2016, 6:08 PM
JAufrecht updated the task description. (Show Details)

There are currently no floating ip's assigned to this project. What is this going to be used for?

We can do a temp bump for the shuffle for sure, but none of the metrics I can see make sense for why a bigger instance is needed

https://tools.wmflabs.org/nagf/?project=phlogiston

CPU is predominantly idle

https://graphite-labs.wikimedia.org/render/?width=978&height=475&_salt=1471465355.643&target=cactiStyle(phlogiston.phlogiston-1.cpu.total.*)

https://graphite-labs.wikimedia.org/render/?width=978&height=475&_salt=1471465525.781&target=phlogiston.phlogiston-1.cpu.total.system&target=phlogiston.phlogiston-1.cpu.total.user&target=phlogiston.phlogiston-1.cpu.total.guest&target=phlogiston.phlogiston-1.cpu.total.iowait&target=phlogiston.phlogiston-1.cpu.total.irq&target=phlogiston.phlogiston-1.cpu.total.nice&from=00%3A00_20160812&until=23%3A59_20160817&areaMode=stacked&lineMode=connected&lineWidth=2

Memory

https://graphite-labs.wikimedia.org/render/?width=978&height=475&_salt=1471465931.738&from=00%3A00_20160812&until=23%3A59_20160817&lineWidth=2&areaMode=stacked&target=phlogiston.phlogiston-1.memory.VmallocUsed&target=phlogiston.phlogiston-1.memory.Cached&target=phlogiston.phlogiston-1.memory.Buffers&target=phlogiston.phlogiston-1.memory.MemFree

Disk is ok

https://graphite-labs.wikimedia.org/render/?width=978&height=475&_salt=1471465986.439&from=00%3A00_20160812&until=23%3A59_20160817&lineWidth=2&areaMode=stacked&target=phlogiston.phlogiston-1.diskspace.root.byte_avail&target=phlogiston.phlogiston-1.diskspace.root.byte_free

Can you help me understand the contention issue with the current instance?

There are two current instances: phab-01, which is phlogiston.wmflabs.org, and phab-02, which is phlogiston-dev.wmflabs.org. Phab-02 is fine, and could be scaled back from an xlarge to a large instance. Phab-01 worked for a number of months, but got very unstable in the last few months, and since the m3 move has crashed hard (to where it can't be rebooted from the wikitech web page) each time I've started a processing run. So, reason 1, I suspect that exhausting RAM may be a factor.

Reason 2, the data generation runs do go substantially faster on phlog-02 than on phlog-01. The charts suggest phlog-01 isn't being taxed, but that's probably because I disabled production in phlog-01 weeks ago so it would stop crashing.

So I'd like to replace the xsmall phlogiston.wmflabs.org with a large one. I don't think I need much more disk, and I guess I don't need a floating IP address; I just need to be able to point phlogiston.wmflabs.org to phlog-3.

So the request is for enough quota for one large instance while keeping phlog-02 around, but as a temp bump in some fashion as the resources from phlog-01 will be recycled once the new large phlog-03 is going.

OK, we can do that. But if you look at:

https://graphite-labs.wikimedia.org/render/?width=978&height=475&from=00%3A00_20160812&until=23%3A59_20160817&lineWidth=2&target=phlogiston.phlogiston-1.memory.MemFree&from=-10d

This looks very much like an issue w/ resource usage (memory leak) and not much like an issue with ongoing resource starvation.

A large instance will give a bit more breathing room but isn't a solution to this problem I imagine.

Phlog-01 is not currently running Phlogiston code and hasn't in weeks, so the leak in the charts shouldn't be related to Phlogiston. Phlog-01 was stable until early June (T137736), and the beginning of instability seems to correspond more to problems in Labs than in changes to the Phlogiston scripts, so it wouldn't seem like Phlogiston would be the first place to look for the source of its instabilities.

Phlog-02 does have memory dropping after each Phlogiston run so I'll start keeping an eye on that. I'm not sure if a python script can permanently lose memory after it's done, or if this related to Postgresql memory usage and isn't an actual leak (in the sense of RAM becoming practically unusable over time) or what.

What is the current status of this request? Do you need more information from me to decide what to grant re: phlog-03? I would still like to replace phlog-01 rather than diagnose it because a) if there's even a 50% chance that the problem is Labs-related rather than Phlog-related, it would be much simpler to replace it than diagnose it. b) I'd like to make it a bit bigger than it was because it was pretty slow even when it worked.

Side questions: is there documentation describing either a) what account to use to log in to graphite-labs, or b) how to create those bookmarks to Graphite graphs?

chasemp changed the task status from Open to Stalled.Aug 22 2016, 9:12 PM

You should be gtg on making a new large. I'll convert this to a temp task w/ stall status to circle back and cleanup on phlogiston-01 quota. Hope it helps.

Side questions: is there documentation describing either a) what account to use to log in to graphite-labs, or b) how to create those bookmarks to Graphite graphs?

It is your wikitech account, the same account you use in labs. I don't understand the second part of the question. I used the graphite console to put it together. There is also https://grafana-labs.wikimedia.org/ for building dashboards https://grafana-labs.wikimedia.org/

chasemp renamed this task from Request increased quota for Phlogiston labs project to Revert: Request increased quota for Phlogiston labs project.Aug 22 2016, 9:12 PM

I've created phlogiston-03, moved the DNS, and terminated phlogiston-01. Thanks.

chasemp closed this task as Resolved.Aug 23 2016, 9:50 PM

I've created phlogiston-03, moved the DNS, and terminated phlogiston-01. Thanks.

let me know how it works out!