Page MenuHomePhabricator

atop on stretch overloading a host
Closed, ResolvedPublic

Description

db1114 (stretch s1 API slave) has been suffering mysterious connections errors that were observed on logtash and grafana (T191996) which triggered an investigation which identified the following facts:

After lots of investigation on the HW side and on the MySQL side it was found that atop was causing this issue as the overloaded CPU timings matched the time of the errors: T191996#4139494
And stopping atop resulted on no more errors. Starting it again, errors back, stopping it, errors gone. So it was clear.

db1114 is an s1 API slave along with db1060 and db1080. Errors were _only_ happening on db1114.

db1114: stretch - atop version 2.2.6
db1060: jessie - atop version 1.26
db1080: jessie - atop version 1.26

This is how atop runs (T191996#4139494):

/usr/bin/atop -a -R -w /var/log/atop/atop_20180418 600

We don't have any more slaves running stretch with the same amount of load db1114 so we have not observed these connections errors that triggered all the investigation on db1114.
This might be affecting other stretch hosts so it is worth a proper and more general across the fleet investigation.
So for now, atop on db1114 has been stopped.
All the investigation done is on T191996

Proposal: Either drop the package from all production hosts or reconfigure it without -R

Event Timeline

If I have to guess, I would say it is the combination of the stretch version + high load (if it is network, cpu or io, I cannot say)- I think enwiki API are hosts with logs of ongoing connections/traffic. We should ask Traffic if they have any large-traffic server with stretch.

When I look at our LVS hosts (which are mixed jessie+stretch currently), the jessie ones show atop processes like:

root     26337     1  0 00:00 ?        00:00:04 /usr/bin/atop -a -w /var/log/atop/atop_20180419 600

But the stretches show:

root       737     1  0 08:18 ?        00:00:01 /usr/sbin/atopacctd
root       836     1  0 08:18 ?        00:00:02 /usr/bin/atop -a -R -w /var/log/atop/atop_20180419 600

Note that the atop manpage mentions the -R flag in this chunk of text:

PSIZE    The proportional memory size of this process (or user).
         Every  process  shares  resident  memory with other processes. E.g. when a particular program is started several times, the code pages (text) are only loaded once in memory and shared by all incarnations. Also  the  code  of  shared  libraries  is shared  by all processes using that shared library, as well as shared memory and memory-mapped files.  For the PSIZE calculation of a process, the resident memory of a process that is shared with other processes is divided by the number of sharers.   This means, that every process is accounted for a proportional part of that memory. Accumulating the PSIZE values of all processes in the system gives a reliable impression of the total resident memory consumed by all processes.
         Since gathering of all values that are needed to calculate the PSIZE is a relatively time-consuming task, the 'R'  key  (or '-R' flag) should be active. Gathering these values also requires superuser privileges (otherwise '?K' is shown in the output).
         If a process has finished during the last interval, no value is shown since the proportional memory size is not part of the standard process accounting record.

So it sounds like -R is a known performance-killer, maybe worse in some scenarios than others?

@BBlack in the case of db1114 atop was normally running without causing any issues, but every 10 minutes it would spike for like 2-3 seconds using lots of the core to their 100% (T191996#4139494)

Mentioned in SAL (#wikimedia-operations) [2018-04-19T14:30:09Z] <marostegui> Star atop on db1114 without "-R" - T192551

So atop is now running on db1114 like:

root     30566  0.0  0.0  24712  7780 ?        S<Ls 14:29   0:00 /usr/bin/atop -a -w /var/log/atop/atop_20180419 600

I will report back with the results after a few hours.

Mentioned in SAL (#wikimedia-operations) [2018-04-20T05:31:32Z] <marostegui> Start atop on db1114 with "-R" option enabled - T192551

No errors running atop without "-R".
I have just started it with "-R" to see if errors start showing up.

As soon as it was started there was a spike of errors. So looks like -R is the offender here.

screenshot-logstash.wikimedia.org-2018.04.20-07-54-34.png (169×1 px, 22 KB)

I have left atop started without -R and will leave it like that for the weekend.

root      1151  0.0  0.0  24328  7400 ?        S<Ls 06:11   0:00 /usr/bin/atop -a -w /var/log/atop/atop_20180420 600

I guess we need to either:

  • Decide whether we want to keep atop as part of standard_packages or we don't really use it.
  • If we want to keep it, I guess we'd need to modify the package ourselves and remove "-R" from /usr/share/atop/atop.daily or manage that file with puppet and make sure it is removed and started without -R
Marostegui triaged this task as Medium priority.Apr 20 2018, 6:17 AM

atophas been running without -R for the whole weekend and this has caused no errors: https://logstash.wikimedia.org/goto/6c9cfe4615f0538d8e633d299609e7e0 the last error shown there was before I killed and started atop again without -R
This it is confirmed now that -R caused this. Running atop without -R causes no issues.

We should decide either drop the package from all production hosts (because it is obsolete thanks to the more granular prometheus) or configure it to run without -R. Can people say what was the last time you used atop for and reasons to do one or the other?

Me, personally, never used it.
+1 to drop it from my side

I use it from time to time, though much more often I inspect the current atop logs. Never use -R though.

Personally never used, +1 to drop it.

Is there any data that atop provides that is not already available in grafana?

Alternatively, only the cron.d/daemon is problematic- we could keep the package (*if* it is useful interactively for some) and drop the crontab file/systemd unit.

+1, probably the convservative approach for now would be have puppet disable the systemd unit and remove the cron.daily file (on jessie as well? seems a waste if we expect it's not used).

+1 to remove atop as a daemon/cron, possibly the package altogether too

I haven't used atop at all so far, +1 to either removing it entirely or dropping service/cron instead (but let's still report this to Debian)

+1 to remove the daemon/cron, keeping the package itself.

+1 to remove the daemon/cron, keep the package iff strictly necessary for somebody

Change 428571 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] base: Add disable atop functionality and test it on dbtore hosts

https://gerrit.wikimedia.org/r/428571

Change 428571 merged by Jcrespo:
[operations/puppet@production] base: Add disable atop functionality and test it on dbstore hosts

https://gerrit.wikimedia.org/r/428571

Change 428574 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] atop: Disable atop on core&dbstore roles to test jessie/trusty

https://gerrit.wikimedia.org/r/428574

Change 428574 merged by Jcrespo:
[operations/puppet@production] atop: Disable atop on core&dbstore roles to test jessie/trusty

https://gerrit.wikimedia.org/r/428574

Change 428579 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] base: Disable atop daemon everywhere

https://gerrit.wikimedia.org/r/428579

Change 428579 merged by Jcrespo:
[operations/puppet@production] base: Disable atop daemon everywhere

https://gerrit.wikimedia.org/r/428579

with https://gerrit.wikimedia.org/r/428579 deployed, we could close this as resolved, and reevaluate later if to drop the package entirely or to reenable it if the upstream bug gets resolved.

Marostegui assigned this task to jcrespo.

For easy access:

Bug submitted to Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=896767
Bug submitted to upstream: upstream: https://github.com/Atoptool/atop/issues/27

My two cents:

  • I don't see this hiera knob used anywhere in the tree right now; has anyone expressed interest in using it in its current state, especially when the stability of the system is potentially at risk? I personally doubt it'll be very useful and it's yet another thing that we'll have parameterized (in the humongous base class with dozens of parameters no less). As a general rule, I think we should be avoiding adding hiera knobs unless there's a very good reason for it (including at least an existing user in the tree!) and rely on sane defaults and/or other properties of the systems via facter.
  • Right now setting profile::base::atop_enabled will still result in different results in jessie and stretch hosts given the -R difference upstream, so this still comes with the potential minefield that resulted in this task. I can e.g. imagine a new hire that isn't aware of this task enabling this knob in a year on a jessie host, then a month later reimaging the host as stretch and scratching their heads :)
  • Installing the atop package while disabling the cron job isn't going to be particularly useful: atop's value proposition is its recording function; tools like top and htop are at least equally good or superior in the runtime/realtime stuff.

I propose instead:

  1. keep atop installed but override the cron job on debian >= stretch (and not with a hiera knob) to remove the -R that is new and problematic (noone has come to rely on it anyway!);
  2. remove the atop package entirely from the codebase;
  3. fold base::disable_atop (the Service/Cron disable) to base and do it unconditionally.

These are in descending order of preference to me, but I'd be OK with either of those if consensus is different. I volunteer to submit the patches for it, as soon as we have consensus on which way to go. Thoughts?

I would prefer option #2 (remove atop).
My reasoning for it is that we now have to remove "-R" from it, but what could happen in the future? Maybe that option will change, maybe another new one could potentially cause similar outages.
My point is that we'd need to keep an eye on atop's evolution and adapt our puppet code to it, and looks like, that, at the end of the day no one really uses atop.

I tried to do the least amount of impact regarding atop, and offer a way to enable it to who could complain about it. If I was the one to decide, I would personally remove it from everwhere, too, but I cannot discard someone coming a few weeks later that his/her workflow is broken because atop is no longer working.

In other words, your proposal is even easier to do than my patch, but I didn't have the weight/consensus/will/determination to do it/decide for everybody. Note the issue was not a full breakage, and the suspicion is that it will only impact a subset of large servers.

In particular, I predict people with cloud instances complaining (using our puppet code). That was my main fear.

If I was the one to decide, I would personally remove it from everwhere, too.

FWIW, this has my +1.

Change 428930 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] standard_packages: Remove atop for every WMF machine

https://gerrit.wikimedia.org/r/428930

Created T192551, because as I said, the problem was not technical.

+1 to remove the daemon/cron, keeping the package itself.

I only said to keep the package out of a similar motivation Jaime describe above, to give somebody the option to use it if they really want (maybe in cloud).

That doesn't mean that i use it personally, totally fine with removing it entirely.

Change 428930 merged by Faidon Liambotis:
[operations/puppet@production] standard_packages: Remove atop from every WMF machine

https://gerrit.wikimedia.org/r/428930

Five years on, I'm looking forward to suggesting that we add atop back to the fleet (or at least parts of it), now that the -R option has been removed by default. :-)
The option has been removed by default in both the upstream version and then subsequently by the Debian package maintainer.

Ref: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=896767#75 and https://github.com/Atoptool/atop/issues/27#issuecomment-1094328919
These changes should arrive with bookworm, all being well. https://tracker.debian.org/pkg/atop

Personally, I really like the default 10 minute snapshots of the process table for 30 days that atop creates and I've found them to be extremely useful in the past for tracking down the causes of lock-ups etc.
For something like the analytics Hadoop cluster, which runs ad-hoc user workload as well as production data pipelines with complex schedules, I would argue that this retrospective process state logging feature is a great feature and well worth having.

@BTullis While your suggestions seem reasonable, please note that the main reason why that was removed was the unwillingness of upstream to support our use cases. I personally have no reservations, but I think you will understand it would have to be well justified to enable it back in.

In particular, atop looks to me like an "old school" tool, and I would like to see observability weight in on a tool better tuned to our needs and stack (e.g. prometheus- based monitoring or something else with standarized logging)- and they could suggest a more modern (and well supported) alternative.

In any case I would open a new ticket with your specific needs, rather than a specific solution so the need can be solved for everyone, and link to this one, leaving this important bug/outage as is for future reference.

@BTullis, @jcrespo coming in late to this thread to update you that we've scheduled to tackle T108027: Collect per-cgroup cpu/mem and other system level metrics next quarter (q4), which I think would address this issue. Feel free to reach out if you'd like more info.