Page MenuHomePhabricator

Puppet failures on trusty due to libmonitoring-plugin-perl
Closed, ResolvedPublic

Description

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/467011/ introduces a dependency on the 'libmonitoring-plugin-perl' package which is not available on Trusty. That change needs some distro switches or something.

Info: Applying configuration version '1539869005'
Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install libmonitoring-plugin-perl' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package libmonitoring-plugin-perl
Error: /Stage[main]/Packages::Libmonitoring_plugin_perl/Package[libmonitoring-plugin-perl]/ensure: change from purged to present failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install libmonitoring-plugin-perl' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package libmonitoring-plugin-perl

Event Timeline

Andrew created this task.Oct 18 2018, 1:30 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 18 2018, 1:30 PM
Dzahn added a subscriber: faidon.Oct 18 2018, 1:43 PM
Dzahn added a comment.Oct 18 2018, 1:55 PM

Implemented as suggested on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/466951/ and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/467011/

What host is the error on? I guess shinken-01? In that case it should also be resolved by T204562. And it seems a new shinken host has already been created.

Of course i can add a trusty check nevertheless.

shinken-01 is still active and needs to work for now. We're hoping to rebuild it but that's a work in progress.

Modifying our checks to support both Nagios::Plugin and Monitoring::Plugin is very messy and we elected in not doing so for the transition in prod. Adding these conditionals for what basically is now a 4½-year old distro used in a VPS is not something we should do.

Example alternatives are:

  • The Shinken maintainers should backport libmonitoring-plugin-perl to trusty. It's a Perl library and should be fairly straightforward.
  • The Shinken VMs should be upgraded to a more modern distribution.
  • The Shinken VMs can be pointed at a puppetmaster running an older version of the puppet tree before those changes were made.

@faidon, I'm not sure I understand your response here. We have an agreed-upon date for the removal of Trusty dependencies, and we are working as fast as we can to hit that date. You seem to have mentally moved that date into the past, which doesn't seem very realistic.

Obviously when I have to patch breakages like this it takes me away from my primary work, all of which is in service of upgrading our stack.

That said, if you want to pitch in on https://phabricator.wikimedia.org/T204562, please do!

I think the alternatives are:

  • SRE holds off the upgrade of Icinga from jessie to stretch in production until Shinken maintainers get the chance to keep up. (I don't even know if said maintainers are you or other Foundation staff or volunteers.)
  • SRE introduces backwards compatibility for trusty in our code despite having no use for it ourselves or way to test it, and thus pay the cost for doing so, and for maintaining it for the next ~6 months.
  • SRE backports packages and/or upgrades VMs that someone else maintains, and that we know little about (and which in this case uses software we haven't ever used or know much about).

Am I missing an alternative here? What is your suggestion/ask?

Shinken provides the only icinga-like monitoring process for Cloud VPS hosted instances. I know that it is used actively by the Toolforge and Beta Cluster projects to identify operational failures.

@GTirloni (staff) and @Krenair (volunteer) are actively attempting to migrate Shinken to Jessie in T204562. Currently they are trying to understand and resolve a problem with the poller. They also need the work from T41785 to be complete so that email alerts will work from the new host. I think that piece may be down to testing that service at this point.

I would be mostly pleased if we can find a short term solution that:

  • keeps shinken-01 (trusty) operational until shinken-02 (jessie) is fully ready to replace it
  • requires a minimum amount of work for the combined WMCS, volunteer, and SRE teams who are involved in the resolution
  • introduces the least amount of long term technical debt
  • blocks the least amount of tangentially related projects from moving forward (SRE upgrades, Trusty deprecation, etc)
  • SRE holds off the upgrade of Icinga from jessie to stretch in production until Shinken maintainers get the chance to keep up. (I don't even know if said maintainers are you or other Foundation staff or volunteers.)
  • SRE introduces backwards compatibility for trusty in our code despite having no use for it ourselves or way to test it, and thus pay the cost for doing so, and for maintaining it for the next ~6 months.

With both @GTirloni and @Krenair working on the blockers here I think we should have a workable solution in far less than 6 months. I hesitate to put down a timeline for the work as a promise, but maybe they can chime in with an educated guess?

  • SRE backports packages and/or upgrades VMs that someone else maintains, and that we know little about (and which in this case uses software we haven't ever used or know much about).

The use of Shinken here is for "hysterical raisins" and something that the cloud-services-team wants to get rid of in the future. This is currently lower priority than Trusty deprecation and Neutron adoption, and thus not resourced until we are certain that those projects are ready to turn loose of someone to work on a next gen monitoring replacement. We had also kind of been putting it off to see what if any major change happens in the production monitoring space with the hope that we could do one replacement rather than two to become more aligned with production.

Proposal: Would a 30 day window of backwards compatibility for trusty in the Puppet manifests be heavily impactful for the SRE team? If we can't fix shinken-02 by then I think that signals that there are deeper issues with shinken as a product and we should accelerate our long term desire to abandon it and also unblock SRE's desired changes.

Fully agreed on all of your points and desirables here! That 30-day window is possible, but it also means that we'll lose steam in a project that's well underway :/ Could you go with a cherry-picked reverted patch or just with sticking with an older puppet tree during that 30-day period?

The other (and better!) alternative as I see it is... just backporting libmonitoring-plugin-perl for trusty and installing it on the Shinken hosts. My estimate is that this is probably a 10-15min task. We may be at the point where we've collectively spent more time discussing it than it actually requires for a fix :)

bd808 added a comment.Oct 18 2018, 5:58 PM

I don't think that the Shinken project has its own puppetmaster, but I suppose we could cherry-pick something to the Cloud shared puppetmaster? If the backport is that easy however it sounds like its worth a shot. Is that something that the SRE can help with/do or do I need to ask Giovanni or Arturo to take care of it?

So OK, I gave it a shot so that we can move things forward and not waste everyone's time. Took me 10 minutes to spawn a trusty chroot, and... 3 minutes to do the backport (echo 9 > debian/compat; dch --bpo; dpkg-buildpackage -uc -us). I spent another... 2 minutes to copy files around and reprepro include the backport in trusty-wikimedia, which I think should unbreak the setup and address the issue mentioned in the task description.

Then I went to test it... only to realize that check_ssl doesn't really work on trusty, and hasn't been for a while, because of other issues introduced back in October 2016 (specifically: lack of OCSP support in trusty's IO::Socket::SSL). The other checks that were converted were check_jnx_alarms, check_bgp and check_pybal, none of which sound applicable to WMCS? Why is Shinken even provisioning those checks?

faidon closed this task as Resolved.Oct 19 2018, 5:09 PM

@Andrew reports that this is fixed indeed, resolving.