Page MenuHomePhabricator

scap-o-scap: Bootstrapping a new host fails
Closed, ResolvedPublic

Description

@elukey reported via IRC:

hey folks, I am bootstrapping a Cassandra cluster and getting
Error: /Stage[main]/Cassandra::Twcs/Scap::Target[cassandra/twcs]/Package[cassandra/twcs]: Provider scap3 is not functional on this host

I suspect that this is because /usr/bin/scap was not functional on the host at the point this executed, possibly because puppet had not recently run on deploy1002.

modules/cassandra/manifests/twcs.pp
class cassandra::twcs(
) {
    require ::cassandra

    scap::target { 'cassandra/twcs':
        deploy_user => 'deploy-service',
        manage_user => true,
    }
}
modules/scap/manifests/target.pp
...
    package { $package_name:
        ensure          => $ensure,
        install_options => [{
            owner => $deploy_user
        }],
        provider        => 'scap3',
        require         => $require_package,
    }
...
modules/scap/spec/types/package/scap3_spec.rb
...
provider_class = Puppet::Type.type(:package).provider(:scap3)
describe provider_class do
  before do
    @resource = Puppet::Type.type(:package).new(name: 'foo/deploy')
    @provider = provider_class.new(@resource)
    @resource.provider = @provider
    @resource[:install_options] = [{ 'owner' => 'mwdeploy' }]
    # Stub all filesystem operations
    allow(FileUtils).to receive(:chown_R)
    allow(FileUtils).to receive(:makedirs)
    allow(FileUtils).to receive(:rm_rf)
    # Stub our mwdeploy user
    allow(Etc).to receive(:getpwnam).with('mwdeploy').and_return(OpenStruct.new(uid: 666))
    # Stub the existance of our deploy-local command
    allow(@provider.class).to receive(:command)
      .with(:scap)
      .and_return('/usr/bin/scap')
  end
...

Event Timeline

To keep archives happy - even after running puppet on deploy1002 (to update the dsh target lists) I wasn't able to run puppet on the ml-cache nodes. We had to run scap install-world --batch on deploy1002.

@dancy was asking on IRC so I had a quick look. From what I can tell:

  • ml-cache1002 was added to /etc/dsh/group/scap_targets by a puppet run at 09:45:11 UTC
  • this triggered a run of bootstrap-scap-targets that was successful: Exec[bootstrap-scap-targets] success
Jun 16 09:45:11 deploy1002 puppet-agent[31807]: (/Stage[main]/Scap::Dsh/Scap::Dsh::Group[scap_targets]/File[/etc/dsh/group/scap_targets]) Scheduling refresh of Exec[bo
otstrap-scap-targets]
Jun 16 09:45:47 deploy1002 puppet-agent[31807]: (/Stage[main]/Profile::Mediawiki::Deployment::Server/Exec[bootstrap-scap-targets]) Triggered 'refresh' from 1 event
  • Subsequent runs of puppet on ml-cache1002 after this time kept failing at 09:55 and 09:58, after the change was applied on deploy1002.

I think there are two problems here:

  • One is the actual issue that puppet didn't work on the target host after the successful run on deploy1002
  • The other is the actual design of the workflow

Can I ask where this approach was decided?
Because AFAICT, unless I misunderstood something, this is broken by design for the bootstrapping of new hosts as it depends on an additional puppet run on the deploy1002 host before any puppet run on a target scap host can work.
It's basically breaking the current provision automation for any host that is also a scap target.

If this approach will be kept, at the very minimum the current reimage cookbook would need to be adapted to force a puppet run on the deploy1002 host in a quite hacky way, similarly of what is doing right now with the Icinga host for different reasons. If possible we should avoid making the provision more brittle adding more live-dependencies to it.

@Volans pretty sure T303559 is the relevant task and change (disclaimer the CR is from me but don't take that to mean i understand the underlining scap implications :))

Change 806397 had a related patch set uploaded (by Jaime Nuche; author: Jaime Nuche):

[operations/puppet@production] scap bootstrap: refactor

https://gerrit.wikimedia.org/r/806397

@Volans, I created https://gerrit.wikimedia.org/r/806397 to refactor the workflow. It performs the bootstrapping on the targets and thus avoids the problem with the async process on the deployment server. However I don't know enough Puppet and I've run into a wall.

That patch currently applies the new bootstrapping config in the scap class, but it needs access to at least one of the different hieradata values declaring deployment servers/masters. Probably, that value should be deployment_hosts. The problem is we are supposed to inject hiera values only into the signatures of profiles: https://wikitech.wikimedia.org/wiki/Puppet_coding#Hiera. But if we do that, we would need to pass along the value all the way down from a profile (or profiles) to the relevant classes. That's dozens of call sites, many of them for scap::target

Can you think of a better way of making the deployment server value(s) available to the classes?

@dancy Having the bootstrapping in scap class means the masters will also be running the code and syncing with themselves. That works in isolation, but it could create a race condition with scap install-world. There doesn't seem to be a clear way where to add this so that it's applied to targets only but not to masters; scap::target would work for scap3 targets, but adding the config to the class for MW targets (mediawiki::scap) means the change would end up on the masters.

As far as I can see, the solution would be to have an if condition in the scap class where we only apply the bootstrapping if the current host is not in deployment_hosts (see above about injecting that)

@Volans Curiously, we seem to be implicitly injecting hiera values into the scap class that's declared in scap::targets. And it seems to work, the right hostname is populated for scap::deployment_server :D See https://puppet-compiler.wmflabs.org/pcc-worker1001/35910/netbox1002.eqiad.wmnet/fulldiff.html

Since the class is already non-compliant with our current rules, maybe it's ok to also inject deployment_hosts in the same way? We would then need to add that argument to the explicit call to the scap class in profile::mediawiki::scap_client

@jnuche for those kind of exceptions I turn the question to @jbond ;)

@jnuche I landed on making the bootstrap-scap-target exec resource only run if ${scap_home}/scap/bin/scap does not exist. That will prevent it from running on already-bootstrapped nodes, which is what we want anyway.

Change 807510 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:mediawiki::scap_client: add paremeter to indicate scap master

https://gerrit.wikimedia.org/r/807510

Sorry for the delay in commenting.

That patch currently applies the new bootstrapping config in the scap class, but it needs access to at least one of the different hieradata values declaring deployment servers/masters.

Is this still required? or is it resolved with the exec constraint mentioned by dancy? if still required could someone point me to how/where it is to be used?

And it seems to work, the right hostname is populated for scap::deployment_server

The scap classes are quite old and i think never got fully converted to the new way profile/role/module way of doing things. Ideally this would either get moved to some profile or in this case it may make more senses to have this as a hiera global e.g. scap_deployment_server however there is no need to tangent to that now

Since the class is already non-compliant with our current rules, maybe it's ok to also inject deployment_hosts in the same way

I'd prefer not, however i suspect that there are better ways to achieve what's needed. I'm just not 100% clear on what that is, however if i have followed along correctly i this CR may achieve the desired goal

Sorry for the delay in commenting.

That patch currently applies the new bootstrapping config in the scap class, but it needs access to at least one of the different hieradata values declaring deployment servers/masters.

Is this still required? or is it resolved with the exec constraint mentioned by dancy? if still required could someone point me to how/where it is to be used?

It is not required as far as I'm concerned.

Change 808058 had a related patch set uploaded (by Dzahn; author: Dzahn):

[labs/private@master] add missing fake keys for keyholder/trainbranchbot

https://gerrit.wikimedia.org/r/808058

Change 808058 merged by Dzahn:

[labs/private@master] add missing fake keys for keyholder/trainbranchbot

https://gerrit.wikimedia.org/r/808058

Change 806397 merged by Dzahn:

[operations/puppet@production] scap bootstrap: refactor

https://gerrit.wikimedia.org/r/806397

Change 808061 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] scap: remove 'files' from file path

https://gerrit.wikimedia.org/r/808061

Change 808061 merged by Dzahn:

[operations/puppet@production] scap: remove 'files' from file path

https://gerrit.wikimedia.org/r/808061

Change 808062 had a related patch set uploaded (by Dzahn; author: Zabe):

[operations/puppet@production] scap: remove 'files' from puppet file url

https://gerrit.wikimedia.org/r/808062

Change 808062 merged by Dzahn:

[operations/puppet@production] scap: remove 'files' from puppet file url

https://gerrit.wikimedia.org/r/808062

dancy changed the task status from Open to Stalled.Jun 23 2022, 9:25 PM
dancy triaged this task as Low priority.

Lowering the priority of this ticket and keeping open for a bit until we have verification that new nodes are bootstrapped properly.

This is what I get from new Cassandra nodes (ml-cache200[1-3]) upon first/second puppet run:

Notice: /Stage[main]/Cassandra::Logging/Scap::Target[cassandra/logstash-logback-encoder]/Group[deploy-service]/ensure: created
Notice: /Stage[main]/Cassandra::Logging/Scap::Target[cassandra/logstash-logback-encoder]/User[deploy-service]/ensure: created
Notice: /Stage[main]/Cassandra::Logging/Scap::Target[cassandra/logstash-logback-encoder]/File[/var/lib/deploy-service]/ensure: created
Error: Execution of '/usr/bin/scap deploy-local --repo cassandra/logstash-logback-encoder -D log_json:False' returned 70: 
Error: /Stage[main]/Cassandra::Logging/Scap::Target[cassandra/logstash-logback-encoder]/Package[cassandra/logstash-logback-encoder]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo cassandra/logstash-logback-encoder -D log_json:False' returned 70: 
Error: Execution of '/usr/bin/scap deploy-local --repo cassandra/twcs -D log_json:False' returned 70: 
Error: /Stage[main]/Cassandra::Twcs/Scap::Target[cassandra/twcs]/Package[cassandra/twcs]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo cassandra/twcs -D log_json:False' returned 70:

I've ran scap install-world --batch on deploy1002 to unblock my use case, lemme know if you need more info from me :)

Edit: The above command didn't fix, I forgot that puppet needed to run on deploy1002 to update the scap targets. Once done, everything got deployed on ml-cache nodes correctly.

dancy changed the task status from Stalled to In Progress.Jun 28 2022, 2:22 PM
dancy raised the priority of this task from Low to Medium.

We need some way to ensure that the File['/usr/bin/scap'] resource (defined in operations/puppet/modules/scap/manifests/init.pp) is created before the scap3 provider (defined in operations/puppet/modules/scap/lib/puppet/provider/package/scap3.rb) is used.

Change 809270 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] scap: Make scap3 provider packages depend on /usr/bin/scap

https://gerrit.wikimedia.org/r/809270

We need some way to ensure that the File['/usr/bin/scap'] resource (defined in operations/puppet/modules/scap/manifests/init.pp) is created before the scap3 provider (defined in operations/puppet/modules/scap/lib/puppet/provider/package/scap3.rb) is used.

@dancy That's strange, according to the Puppet docs, that ordering should already be honored since the target manifest includes the scap class defined in init.pp. See https://puppet.com/docs/puppet/5.5/lang_relationships.html

@elukey would you have available a full Puppet execution log for ml-cache200[1-3]?

It turns out the opening paragraph of the doc I pasted in my previous comment refers to lexical scope only. For the relationship to actually exist, you need to explicitly declare it, for example using require instead of include in scap::target.

I've updated the latest patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/809270/

Change 809270 merged by Alexandros Kosiaris:

[operations/puppet@production] scap: make scap::target require the scap class

https://gerrit.wikimedia.org/r/809270

jnuche claimed this task.