Page MenuHomePhabricator

atop on stretch overloading a host
Closed, ResolvedPublic


db1114 (stretch s1 API slave) has been suffering mysterious connections errors that were observed on logtash and grafana (T191996) which triggered an investigation which identified the following facts:

After lots of investigation on the HW side and on the MySQL side it was found that atop was causing this issue as the overloaded CPU timings matched the time of the errors: T191996#4139494
And stopping atop resulted on no more errors. Starting it again, errors back, stopping it, errors gone. So it was clear.

db1114 is an s1 API slave along with db1060 and db1080. Errors were _only_ happening on db1114.

db1114: stretch - atop version 2.2.6
db1060: jessie - atop version 1.26
db1080: jessie - atop version 1.26

This is how atop runs (T191996#4139494):

/usr/bin/atop -a -R -w /var/log/atop/atop_20180418 600

We don't have any more slaves running stretch with the same amount of load db1114 so we have not observed these connections errors that triggered all the investigation on db1114.
This might be affecting other stretch hosts so it is worth a proper and more general across the fleet investigation.
So for now, atop on db1114 has been stopped.
All the investigation done is on T191996

Proposal: Either drop the package from all production hosts or reconfigure it without -R

Related Objects

Event Timeline

If I have to guess, I would say it is the combination of the stretch version + high load (if it is network, cpu or io, I cannot say)- I think enwiki API are hosts with logs of ongoing connections/traffic. We should ask Traffic if they have any large-traffic server with stretch.

When I look at our LVS hosts (which are mixed jessie+stretch currently), the jessie ones show atop processes like:

root     26337     1  0 00:00 ?        00:00:04 /usr/bin/atop -a -w /var/log/atop/atop_20180419 600

But the stretches show:

root       737     1  0 08:18 ?        00:00:01 /usr/sbin/atopacctd
root       836     1  0 08:18 ?        00:00:02 /usr/bin/atop -a -R -w /var/log/atop/atop_20180419 600

Note that the atop manpage mentions the -R flag in this chunk of text:

PSIZE    The proportional memory size of this process (or user).
         Every  process  shares  resident  memory with other processes. E.g. when a particular program is started several times, the code pages (text) are only loaded once in memory and shared by all incarnations. Also  the  code  of  shared  libraries  is shared  by all processes using that shared library, as well as shared memory and memory-mapped files.  For the PSIZE calculation of a process, the resident memory of a process that is shared with other processes is divided by the number of sharers.   This means, that every process is accounted for a proportional part of that memory. Accumulating the PSIZE values of all processes in the system gives a reliable impression of the total resident memory consumed by all processes.
         Since gathering of all values that are needed to calculate the PSIZE is a relatively time-consuming task, the 'R'  key  (or '-R' flag) should be active. Gathering these values also requires superuser privileges (otherwise '?K' is shown in the output).
         If a process has finished during the last interval, no value is shown since the proportional memory size is not part of the standard process accounting record.

So it sounds like -R is a known performance-killer, maybe worse in some scenarios than others?

@BBlack in the case of db1114 atop was normally running without causing any issues, but every 10 minutes it would spike for like 2-3 seconds using lots of the core to their 100% (T191996#4139494)

Mentioned in SAL (#wikimedia-operations) [2018-04-19T14:30:09Z] <marostegui> Star atop on db1114 without "-R" - T192551

So atop is now running on db1114 like:

root     30566  0.0  0.0  24712  7780 ?        S<Ls 14:29   0:00 /usr/bin/atop -a -w /var/log/atop/atop_20180419 600

I will report back with the results after a few hours.

Mentioned in SAL (#wikimedia-operations) [2018-04-20T05:31:32Z] <marostegui> Start atop on db1114 with "-R" option enabled - T192551

No errors running atop without "-R".
I have just started it with "-R" to see if errors start showing up.

As soon as it was started there was a spike of errors. So looks like -R is the offender here. (169×1 px, 22 KB)

I have left atop started without -R and will leave it like that for the weekend.

root      1151  0.0  0.0  24328  7400 ?        S<Ls 06:11   0:00 /usr/bin/atop -a -w /var/log/atop/atop_20180420 600

I guess we need to either:

  • Decide whether we want to keep atop as part of standard_packages or we don't really use it.
  • If we want to keep it, I guess we'd need to modify the package ourselves and remove "-R" from /usr/share/atop/atop.daily or manage that file with puppet and make sure it is removed and started without -R
Marostegui triaged this task as Medium priority.Apr 20 2018, 6:17 AM

atophas been running without -R for the whole weekend and this has caused no errors: the last error shown there was before I killed and started atop again without -R
This it is confirmed now that -R caused this. Running atop without -R causes no issues.

We should decide either drop the package from all production hosts (because it is obsolete thanks to the more granular prometheus) or configure it to run without -R. Can people say what was the last time you used atop for and reasons to do one or the other?

Me, personally, never used it.
+1 to drop it from my side

I use it from time to time, though much more often I inspect the current atop logs. Never use -R though.

Personally never used, +1 to drop it.

Is there any data that atop provides that is not already available in grafana?

Alternatively, only the cron.d/daemon is problematic- we could keep the package (*if* it is useful interactively for some) and drop the crontab file/systemd unit.

+1, probably the convservative approach for now would be have puppet disable the systemd unit and remove the cron.daily file (on jessie as well? seems a waste if we expect it's not used).

+1 to remove atop as a daemon/cron, possibly the package altogether too

I haven't used atop at all so far, +1 to either removing it entirely or dropping service/cron instead (but let's still report this to Debian)

+1 to remove the daemon/cron, keeping the package itself.

+1 to remove the daemon/cron, keep the package iff strictly necessary for somebody

Change 428571 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] base: Add disable atop functionality and test it on dbtore hosts

Change 428571 merged by Jcrespo:
[operations/puppet@production] base: Add disable atop functionality and test it on dbstore hosts

Change 428574 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] atop: Disable atop on core&dbstore roles to test jessie/trusty

Change 428574 merged by Jcrespo:
[operations/puppet@production] atop: Disable atop on core&dbstore roles to test jessie/trusty

Change 428579 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] base: Disable atop daemon everywhere

Change 428579 merged by Jcrespo:
[operations/puppet@production] base: Disable atop daemon everywhere

with deployed, we could close this as resolved, and reevaluate later if to drop the package entirely or to reenable it if the upstream bug gets resolved.

Marostegui assigned this task to jcrespo.

For easy access:

Bug submitted to Debian:
Bug submitted to upstream: upstream:

My two cents:

  • I don't see this hiera knob used anywhere in the tree right now; has anyone expressed interest in using it in its current state, especially when the stability of the system is potentially at risk? I personally doubt it'll be very useful and it's yet another thing that we'll have parameterized (in the humongous base class with dozens of parameters no less). As a general rule, I think we should be avoiding adding hiera knobs unless there's a very good reason for it (including at least an existing user in the tree!) and rely on sane defaults and/or other properties of the systems via facter.
  • Right now setting profile::base::atop_enabled will still result in different results in jessie and stretch hosts given the -R difference upstream, so this still comes with the potential minefield that resulted in this task. I can e.g. imagine a new hire that isn't aware of this task enabling this knob in a year on a jessie host, then a month later reimaging the host as stretch and scratching their heads :)
  • Installing the atop package while disabling the cron job isn't going to be particularly useful: atop's value proposition is its recording function; tools like top and htop are at least equally good or superior in the runtime/realtime stuff.

I propose instead:

  1. keep atop installed but override the cron job on debian >= stretch (and not with a hiera knob) to remove the -R that is new and problematic (noone has come to rely on it anyway!);
  2. remove the atop package entirely from the codebase;
  3. fold base::disable_atop (the Service/Cron disable) to base and do it unconditionally.

These are in descending order of preference to me, but I'd be OK with either of those if consensus is different. I volunteer to submit the patches for it, as soon as we have consensus on which way to go. Thoughts?

I would prefer option #2 (remove atop).
My reasoning for it is that we now have to remove "-R" from it, but what could happen in the future? Maybe that option will change, maybe another new one could potentially cause similar outages.
My point is that we'd need to keep an eye on atop's evolution and adapt our puppet code to it, and looks like, that, at the end of the day no one really uses atop.

I tried to do the least amount of impact regarding atop, and offer a way to enable it to who could complain about it. If I was the one to decide, I would personally remove it from everwhere, too, but I cannot discard someone coming a few weeks later that his/her workflow is broken because atop is no longer working.

In other words, your proposal is even easier to do than my patch, but I didn't have the weight/consensus/will/determination to do it/decide for everybody. Note the issue was not a full breakage, and the suspicion is that it will only impact a subset of large servers.

In particular, I predict people with cloud instances complaining (using our puppet code). That was my main fear.

If I was the one to decide, I would personally remove it from everwhere, too.

FWIW, this has my +1.

Change 428930 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] standard_packages: Remove atop for every WMF machine

Created T192551, because as I said, the problem was not technical.

+1 to remove the daemon/cron, keeping the package itself.

I only said to keep the package out of a similar motivation Jaime describe above, to give somebody the option to use it if they really want (maybe in cloud).

That doesn't mean that i use it personally, totally fine with removing it entirely.

Change 428930 merged by Faidon Liambotis:
[operations/puppet@production] standard_packages: Remove atop from every WMF machine