Page MenuHomePhabricator

Icinga check defined from LVS configuration for cloudelastic are borked
Open, NormalPublic0 Story Points

Description

There are several "cloudelastic" LVS service defs starting around:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/hieradata/common/lvs/configuration.yaml#646

They all share a hostname but use different ports.

The icinga check part ends up creating just 1/N checks (I think because they share a hostname), and the host definition for it uses icinga1001's IP address instead of the service IP (which luckily fails because of the odd port number, otherwise we might've assumed it was ok if it was polling icinga1001's 443):

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=cloudelastic.wikimedia.org

I have no idea what's going on with this and I'm giving up for the day, but maybe someone else can figure it out!

Event Timeline

BBlack created this task.Thu, Aug 1, 8:57 PM
Restricted Application added projects: Operations, Discovery-Search. · View Herald TranscriptThu, Aug 1, 8:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
CDanis added a subscriber: CDanis.Thu, Aug 1, 9:03 PM
ema moved this task from Triage to LoadBalancer on the Traffic board.Fri, Aug 2, 12:39 PM

I've seen the same behaviour configuring the ncredir LVS service as it's using two ports (80/443). Same happens with text and upload LVS services.

In those scenarios what we do is to provide check configuration only for the service listening on port 80 (http) and we let lvs::monitor_service_http_https to handle that automagically for us and create the HTTP and HTTPS checks:

ncredir:
    description: "Non canonical redirect service"
    class: high-traffic1
    sites:
    - eqiad
    - codfw
    ip: *ip_block054
    depool-threshold: '.5'
    monitors:
      ProxyFetch:
        url:
        - http://en.wikipedia.com/_status
    conftool:
      cluster: ncredir
      service: nginx
    icinga:
      uri: 'en.wikipedia.com!/_status'
      critical: true
      sites:
        eqiad:
          hostname: ncredir-lb.eqiad.wikimedia.org
        codfw:
          hostname: ncredir-lb.codfw.wikimedia.org
  ncredir-https:
    description: "Non canonical redirect service"
    class: high-traffic1
    sites:
    - eqiad
    - codfw
    ip: *ip_block054
    port: 443
    depool-threshold: '.5'
    monitors:
      ProxyFetch:
        url:
        - https://en.wikipedia.com/_status
    conftool:
      cluster: ncredir
      service: nginx

Using ncredir/ncredir-https as an example, you can see that ncredir-https lacks the icinga section in the configuration, because both checks (http/https) are being handled in the ncredir service configuration

About cloudelastic resolving to icinga1001, I had jbond help me do see where it cloudelastic.wikimedia.org resolves to and it seems to be resolving to the correct IP.
@Vgutierrez we could remove the icinga part of the configuration in configuration.yaml file and define the checks in lvs::monitor_services instead. I think that should work.

BBlack added a comment.EditedMon, Aug 5, 12:53 PM

So, yes, cloudelastic is correct in DNS for normal lookups. The issue is that the icinga check defines the virtual host entry for cloudelastic monitoring an explicit IP in its configuration, and that IP ends up being the IP of icinga1001, not of cloudelastic. This probably has to do with the puppet host context in which the resource is evaluated.

Regardless of whatever's causing that, most of the rest of the issue, I believe, lies in modules/lvs/templates/monitor_lvs.erb, which creates a hash of resource definitions which is keyed on the hostname, meaning it can only define one check per hostname (where we have several here sharing a hostname/IP, but differing in ports). That script/template/whatever has lots of other baggage going on as well, as noted by its own commentary (e.g. the hardcoding of the lb and lb6 suffixes, etc).

In the short term, I think @Mathew.onipe 's answer makes sense: we'll just get rid of the icinga stanzas here and create (several) explicit definitions in lvs::monitor_services. In the medium-to-long-term though: this was just the whipping tail end of the long train wreck of confusion and chaos that was "figure out how to define the puppetization of a complex LVS service in 2019", all of which I think needs some refactoring towards simplicity...

@BBlack yea yea.. I've missed your musing on complex system. Thanks. I will make a patch

Mathew.onipe triaged this task as Normal priority.Mon, Aug 5, 12:56 PM

Sadly, I don't think this will work as the host param will not be unique and icinga does not seem to handle that well. Another option might be to create more CNAMEs or more A-records like we have for git and git-ssh here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/master/templates/wikimedia.org#336

This is dirty (I'm not sure) Maybe CNAMEs are more cleaner.
So we can have:
cloudelastic-chi > cloudelastic.wikimedia.org
cloudelastic-chi-pub -> cloudelastic.wikimedia.org

@BBlack @Vgutierrez what do you think?

Change 528491 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] lvs: isolate cloudelastic icinga check

https://gerrit.wikimedia.org/r/528491

Change 528491 merged by Gehel:
[operations/puppet@production] lvs: isolate cloudelastic icinga check

https://gerrit.wikimedia.org/r/528491

Change 528885 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] lvs: isolate cloudelastic icinga check

https://gerrit.wikimedia.org/r/528885

Change 529362 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] cloudelastic: fix monitored ip addresses

https://gerrit.wikimedia.org/r/529362

jbond added a subscriber: jbond.Fri, Aug 9, 2:00 PM

When this alerted today and ended up going right down the rabbit hole. Anyway i think i have found out some information.

So first Class['lvs::monitor'] calls create_resources(lvs::monitor_service_http_https, $monitors). The monitoring variable in this call is created by using the template command to parse modules/lvs/template/monitor_lvs.erb. This template uses the lvs::configuration::lvs_services: variable to generate $monitors used above.

If we resolve the yaml pointers for cloudelastic-chi-https and only concentrating on the keys used in monitor_lvs.erb [icinga, ip] we get the following

cloudelastic-chi-https:
  ip:
    eqiad:
      cloudelasticlb: 208.80.154.241
      cloudelasticlb6: 2620:0:861:ed1a::3:241
  icinga:
    check_command: check_https_lvs_on_port!cloudelastic.wikimedia.org!9243!/
    sites:
      eqiad:
        hostname: cloudelastic.wikimedia.org

once we run this through monitor_lvs.erb we end up with a hash like the following

tmp['cloudelastic.wikimedia.org'] = {
  'ip_address'  => nil,
  'ip_address6'  => nil,
}

The reason we get nil here instead of the ip address is because the erb script ettempts to find the key service['ip']['eqiad']['cloudelastic-chi-httpslb'] and service['ip']['eqiad']['cloudelastic-chi-httpslb6']

On first look i thought this should cause a failure as the ip_address paramater inlvs::monitor_service_http_https is mandatory. however on further digging the above ruby hash after first passing thrugh a yaml phase we end up with the following puppet variable

$monitors = {
  ...
  'cloudelastic.wikimedia.org' => { 'ip_address' => undef, ip6_address => undef } 
  ...
}

And apparently puppet treats a value of undef as a value and therefore does not fail with the expected message Lvs::Monitor_service_http_https['cloudelastic.wikimedia.org']: expects a value for parameter ip_address· The undef value is then passed as the ip_address to monitoring::host. when we look at monitoring::host we see that the ip_address has a default value of $facts['ip_address']. in this case puppet decides undef is not a value and therefore uses the default. lvs::monitor is resolved on the icinga host which is why this service ultimately ends up with the ip address of icinga

how to fix all this? i have just uploaded a CR which hopefully fixes this

@jbond Thank you!
You fix is way better than mine. I will look at the patch now

Mentioned in SAL (#wikimedia-operations) [2019-08-13T15:32:50Z] <bblack> disabled pupped on lvs1014, lvs1016, icinga1001 ahead of deploying https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/528885/ - T229621

Change 528885 merged by BBlack:
[operations/puppet@production] lvs: isolate cloudelastic icinga check

https://gerrit.wikimedia.org/r/528885

This issue is solved for now and cloudelastic checks for all ports have been generated on icinga. However, only IPv4 checks were generated and this is Ok for now. If there's need to generate IPv6 checks, we can always reopen this task

Change 529362 abandoned by Jbond:
cloudelastic: fix monitored ip addresses

Reason:
This would have broken other parts of LVS fixed in I7a071e8acf8bb6b9e874de14047ee409a231615a

https://gerrit.wikimedia.org/r/529362

This comment was removed by jbond.