Page MenuHomePhabricator

Convert makevm to spicerack cookbook
Closed, ResolvedPublic

Description

makevm[1] is a small shell script that while very useful, one needs to know it's existence, login into a ganeti cluster master, execute it, answer questions and obtain the required VM. It would be beneficial, at the very least for conformity's sake to create an alternative spicerack cookbook that does at least that but also

[1] https://github.com/wikimedia/puppet/blob/production/modules/profile/files/ganeti/makevm.sh

Details

Related Gerrit Patches:

Event Timeline

akosiaris renamed this task from Convert makevm το spicerack cookbook to Convert makevm to spicerack cookbook.Sep 10 2018, 3:54 PM
MoritzMuehlenhoff triaged this task as Medium priority.Sep 25 2018, 9:46 AM
akosiaris updated the task description. (Show Details)Dec 17 2018, 3:04 PM
crusnov moved this task from Backlog to Up next on the SRE-tools board.Feb 14 2019, 5:37 PM

Is it okay to use rapi for this or is there a compelling reason to use cumin+ganeti-* commands?

Is it okay to use rapi for this or is there a compelling reason to use cumin+ganeti-* commands?

It probably is easier to do it via cumin+ganeti-* commands as there are some security aspects that don't have to be considered in this case as they are implicit (depending on being able to ssh to ganeti master) like the population of the rapi username+password on the spicerack host, the opening of firewalls, as well as a lot of scaffolding already existing in the spicerack cookbooks etc. Aside from it being (potentially) less work, I can't think of a compelling reason to force cumin usage.

elukey added a subscriber: elukey.Mar 5 2019, 12:18 PM
crusnov claimed this task.Mar 14 2019, 8:34 PM
crusnov added a project: User-crusnov.

Change 496527 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/cookbooks@master] Port MakeVM to cookbook.

https://gerrit.wikimedia.org/r/496527

crusnov moved this task from Backlog to Pending on the User-crusnov board.Mar 14 2019, 8:36 PM
crusnov moved this task from Pending to In Progress on the User-crusnov board.
crusnov moved this task from Up next to In Progress on the SRE-tools board.Mar 16 2019, 12:25 AM

Change 496527 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/cookbooks@master] Port MakeVM to cookbook.

https://gerrit.wikimedia.org/r/496527

crusnov moved this task from In Progress to Pending on the User-crusnov board.Mar 19 2019, 5:44 PM
crusnov moved this task from In Progress to In Code Review on the SRE-tools board.Mar 19 2019, 6:11 PM

Change 496527 merged by CRusnov:
[operations/cookbooks@master] Port MakeVM to a cookbook

https://gerrit.wikimedia.org/r/496527

crusnov moved this task from Pending to Complete on the User-crusnov board.May 1 2019, 6:47 PM
fsero moved this task from Backlog to Incoming on the serviceops board.Jun 20 2019, 2:21 PM

Should we close this? Is there anything left to be done?

Volans added a subscriber: Volans.Jul 1 2019, 1:15 PM

Not yet as the script has clearly not been tested:

$ sudo cookbook sre.ganeti.makevm -h
Exception raised while parsing arguments for cookbook sre.ganeti.makevm:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/cookbook.py", line 460, in _parse_args
    args = self.module.argument_parser().parse_args(self.args)
  File "/srv/deployment/spicerack/cookbooks/sre/ganeti/makevm.py", line 37, in argument_parser
    clusters_and_rows = [cluster + '_' + row for cluster, row in CLUSTERS_AND_ROWS.items()]
  File "/srv/deployment/spicerack/cookbooks/sre/ganeti/makevm.py", line 37, in <listcomp>
    clusters_and_rows = [cluster + '_' + row for cluster, row in CLUSTERS_AND_ROWS.items()]
TypeError: Can't convert 'tuple' object to str implicitly

In spicerack that is defined as:

CLUSTERS_AND_ROWS = {'eqiad': ('A', 'C'), 'codfw': ('A', 'B')}

So the current code doesn't actually work.

Change 520011 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.ganeti.makevm: add the possibility to choose link analytics

https://gerrit.wikimedia.org/r/520011

Change 520011 merged by jenkins-bot:
[operations/cookbooks@master] sre.ganeti.makevm: add the possibility to choose link analytics

https://gerrit.wikimedia.org/r/520011

elukey added a comment.EditedJul 1 2019, 3:01 PM

Almost!

elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_A an-tool1006.eqiad.wmnet --vcpus 2 --memory 4g --disk 150g --link analytics
usage: cookbook [-h] [--vcpus VCPUS] [--memory MEMORY] [--disk DISK]
                [--link {public,private,analytics}]
                {codfw_A,codfw_B,eqiad_A,eqiad_C} fqdn
cookbook: error: argument cluster_and_row: invalid choice: ['eqiad', 'A'] (choose from 'codfw_A', 'codfw_B', 'eqiad_A', 'eqiad_C')

Change 520028 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.ganeti.makevm: move split away from argparse

https://gerrit.wikimedia.org/r/520028

Change 520028 merged by jenkins-bot:
[operations/cookbooks@master] sre.ganeti.makevm: move split away from argparse

https://gerrit.wikimedia.org/r/520028

Change 520033 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.ganet.makevm: add info about chosen link before creation

https://gerrit.wikimedia.org/r/520033

Change 520033 merged by jenkins-bot:
[operations/cookbooks@master] sre.ganet.makevm: add info about chosen link before creation

https://gerrit.wikimedia.org/r/520033

elukey added a comment.Jul 1 2019, 3:41 PM
2019-07-01 15:36:51,623 [INFO] START - Cookbook sre.ganeti.makevm
2019-07-01 15:36:51,749 [INFO] Creating new VM named an-tool1006.eqiad.wmnet in eqiad with row=A vcpu=2 memory=4 gigabytes disk=150 gigabytes link=analytics
2019-07-01 15:36:57,480 [INFO] Executing commands [cumin.transports.Command('gnt-instance add -t drbd -I hail --net 0:link=analytics --hypervisor-parameters=kvm:boot_order=network -o bootstrap+default --no-install -g row_A -B vcpus=2,memory=4g --disk 0:size=150g an-tool1006.eqiad.wmnet')] on '1' hosts: ganeti1001.eqiad.wmnet
2019-07-01 15:37:03,921 [INFO] Completed command 'gnt-instance add -t drbd -I hail --net 0:link=analytics --hypervisor-parameters=kvm:boot_order=network -o bootstrap+default --no-install -g row_A -B vcpus=2,memory=4g --disk 0:size=150g an-tool1006.eqiad.wmnet'
2019-07-01 15:37:03,923 [ERROR] 100.0% (1/1) of nodes failed to execute command 'gnt-instance add...1006.eqiad.wmnet': ganeti1001.eqiad.wmnet
2019-07-01 15:37:03,923 [CRITICAL] 0.0% (0/1) success ratio (< 100.0% threshold) for command: 'gnt-instance add...1006.eqiad.wmnet'. Aborting.
2019-07-01 15:37:03,924 [CRITICAL] 0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
2019-07-01 15:37:03,924 [ERROR] Exception raised while executing cookbook sre.ganeti.makevm:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/cookbook.py", line 407, in _run
    ret = self.module.run(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/ganeti/makevm.py", line 94, in run
    results = ganeti_host.run_sync(command)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 214, in run_sync
    batch_sleep=batch_sleep, is_safe=is_safe)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 383, in _execute
    raise RemoteExecutionError(ret, 'Cumin execution failed')
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
2019-07-01 15:37:03,926 [INFO] END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
elukey added a comment.EditedJul 1 2019, 3:44 PM

From an old task to create an-tool1005 (https://phabricator.wikimedia.org/T217738):

gnt-instance add -t drbd -I hail --net 0:link=analytics --hypervisor-parameters=kvm:boot_order=network -o debootstrap+default --no-install -g row_C -B vcpus=4,memory=6g --disk 0:size=20g an-tool1005.eqiad.wmnet

vs the one that failed

gnt-instance add -t drbd -I hail --net 0:link=analytics --hypervisor-parameters=kvm:boot_order=network -o bootstrap+default --no-install -g row_A -B vcpus=2,memory=4g --disk 0:size=150g an-tool1006.eqiad.wmnet

That looks good, so probably the creation of the VM itself on ganeti1001 failed..

elukey added a comment.Jul 1 2019, 4:45 PM

Direct execution leads to:

elukey@ganeti1001:~$ sudo gnt-instance add -t drbd -I hail --net 0:link=analytics --hypervisor-parameters=kvm:boot_order=network -o bootstrap+default --no-install -g row_A -B vcpus=2,memory=4g --disk 0:size=150g an-tool1006.eqiad.wmnet
Mon Jul  1 16:44:37 2019  - INFO: No-installation mode selected, disabling startup
Mon Jul  1 16:44:41 2019  - INFO: Selected nodes for instance an-tool1006.eqiad.wmnet via iallocator hail: ganeti1007.eqiad.wmnet, ganeti1006.eqiad.wmnet
Failure: command execution error:
OS Parameters validation failed on node ganeti1007.eqiad.wmnet: Directory for OS bootstrap not found in search path

I just found a bit that it is different: -o debootstrap+default vs bootstrap+default

Change 520044 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.ganeti.makevm: fix create vm command

https://gerrit.wikimedia.org/r/520044

elukey added a comment.Jul 1 2019, 5:01 PM

@akosiaris I know that today I asked you 1000 questions about ganeti, but if you could review the diff between debootstrap+default and bootstrap+default it would be super great (maybe they are not related to the error that I reported..)

@akosiaris I know that today I asked you 1000 questions about ganeti, but if you could review the diff between debootstrap+default and bootstrap+default it would be super great (maybe they are not related to the error that I reported..)

bootstrap+default does not exist. There is no such ganeti OS definition. So, that error, "Directory for OS bootstrap not found in search path" is fully expected given that.

If you are interested into what a Ganeti OS definition is, the TL;DR is in https://github.com/ganeti/ganeti/wiki/OS-Definitions

The +default part is that this OS definition allows to create variants of it, default being, well the default, and the only one that we have currently.

A dpkg -L ganeti-instance-debootstrap on a ganeti host would provide you with the actual technical details (there aren't many in reality)

Change 520044 merged by jenkins-bot:
[operations/cookbooks@master] sre.ganeti.makevm: fix create vm command

https://gerrit.wikimedia.org/r/520044

elukey added a comment.EditedJul 1 2019, 5:52 PM

After launching the cookbook, it got stuck:

elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_A an-tool1006.eqiad.wmnet --vcpus 2 --memory 4 --disk 150 --link analytics
START - Cookbook sre.ganeti.makevm
Creating new VM named an-tool1006.eqiad.wmnet in eqiad with row=A vcpu=2 memory=4 gigabytes disk=150 gigabytes link=analytics
Is this correct?
Type "done" to proceed
> done

On ganeti1001:

(gdb) py-bt
Traceback (most recent call first):
  File "/usr/share/ganeti/2.15/ganeti/rpc/transport.py", line 183, in Recv
    data = self.socket.recv(4096)
  File "/usr/share/ganeti/2.15/ganeti/rpc/transport.py", line 205, in Call
    return self.Recv()
  File "/usr/share/ganeti/2.15/ganeti/rpc/client.py", line 225, in send
    return self.transport.Call(data)
  File "/usr/share/ganeti/2.15/ganeti/rpc/transport.py", line 225, in RetryOnNetworkError
    return fn(try_no)
  File "/usr/share/ganeti/2.15/ganeti/rpc/client.py", line 227, in _SendMethodCall
    lambda _: self._CloseTransport())
  File "/usr/share/ganeti/2.15/ganeti/rpc/client.py", line 144, in CallRPCMethod
    response_msg = transport_cb(request_msg)
  File "/usr/share/ganeti/2.15/ganeti/rpc/client.py", line 249, in CallMethod
    version=self.version)
  File "/usr/share/ganeti/2.15/ganeti/luxi.py", line 180, in WaitForJobChangeOnce
    min(WFJC_TIMEOUT, timeout)))
  File "/usr/share/ganeti/2.15/ganeti/cli.py", line 858, in WaitForJobChangeOnce
    prev_job_info, prev_log_serial)
  File "/usr/share/ganeti/2.15/ganeti/cli.py", line 729, in GenericPollJob
    prev_logmsg_serial)
  File "/usr/share/ganeti/2.15/ganeti/cli.py", line 955, in PollJob
    return GenericPollJob(job_id, _LuxiJobPollCb(cl), reporter)
  File "/usr/share/ganeti/2.15/ganeti/cli.py", line 976, in SubmitOpCode
    reporter=reporter)
  File "/usr/share/ganeti/2.15/ganeti/cli.py", line 1011, in SubmitOrSend
    return SubmitOpCode(op, cl=cl, feedback_fn=feedback_fn, opts=opts)
  File "/usr/share/ganeti/2.15/ganeti/cli.py", line 1459, in GenericInstanceCreate
    SubmitOrSend(op, opts)
  File "/usr/share/ganeti/2.15/ganeti/client/gnt_instance.py", line 263, in AddInstance
    return GenericInstanceCreate(constants.INSTANCE_CREATE, opts, args)
  File "/usr/share/ganeti/2.15/ganeti/cli.py", line 1221, in GenericMain
    result = func(options, args)
  File "/usr/share/ganeti/2.15/ganeti/client/gnt_instance.py", line 1741, in Main
    env_override=_ENV_OVERRIDE)
  File "/usr/sbin/gnt-instance", line 21, in <module>
    sys.exit(main.Main())

The instance has been created, but:

elukey@ganeti1001:~$ sudo gnt-instance list  | grep an-tool
an-tool1005.eqiad.wmnet            kvm        debootstrap+default ganeti1004.eqiad.wmnet running      6.0G
an-tool1006.eqiad.wmnet            kvm        debootstrap+default ganeti1007.eqiad.wmnet ADMIN_down      -
elukey added a comment.Jul 1 2019, 6:05 PM

Nevermind, found the following in /var/log/ganeti on ganeti1001:

2019-07-01 18:03:52,655: job-646757 pid=7552 INFO - device disk/0: 52.70% done, 32m 33s remaining (estimated)

So it is doing something, we'll just need to wait.

Interesting, it sure does take a while for the disk to build, and the tool will wait.

elukey added a comment.Jul 1 2019, 6:46 PM

All logs (they were emitted only at the end):

Mon Jul  1 17:27:36 2019  - INFO: No-installation mode selected, disabling startup
Mon Jul  1 17:27:40 2019  - INFO: Selected nodes for instance an-tool1006.eqiad.wmnet via iallocator hail: ganeti1007.eqiad.wmnet, ganeti1006.eqiad.wmnet
Mon Jul  1 17:27:41 2019 * creating instance disks...
Mon Jul  1 17:27:44 2019 adding instance an-tool1006.eqiad.wmnet to cluster config
Mon Jul  1 17:27:44 2019 adding disks to cluster config
Mon Jul  1 17:27:45 2019  - INFO: Waiting for instance an-tool1006.eqiad.wmnet to sync disks
Mon Jul  1 17:27:45 2019  - INFO: - device disk/0:  0.10% done, 5h 12m 3s remaining (estimated)
Mon Jul  1 17:28:45 2019  - INFO: - device disk/0:  1.50% done, 1h 5m 22s remaining (estimated)
Mon Jul  1 17:29:45 2019  - INFO: - device disk/0:  2.90% done, 1h 5m 47s remaining (estimated)
Mon Jul  1 17:30:45 2019  - INFO: - device disk/0:  4.40% done, 1h 6m 11s remaining (estimated)
Mon Jul  1 17:31:46 2019  - INFO: - device disk/0:  5.80% done, 1h 2m 58s remaining (estimated)
Mon Jul  1 17:32:46 2019  - INFO: - device disk/0:  7.20% done, 1h 3m 47s remaining (estimated)
Mon Jul  1 17:33:46 2019  - INFO: - device disk/0:  8.60% done, 1h 4m 23s remaining (estimated)
Mon Jul  1 17:34:46 2019  - INFO: - device disk/0: 10.10% done, 1h 0m 33s remaining (estimated)
Mon Jul  1 17:35:46 2019  - INFO: - device disk/0: 11.50% done, 1h 2m 29s remaining (estimated)
Mon Jul  1 17:36:47 2019  - INFO: - device disk/0: 13.00% done, 58m 19s remaining (estimated)
Mon Jul  1 17:37:47 2019  - INFO: - device disk/0: 14.40% done, 1h 3m 31s remaining (estimated)
Mon Jul  1 17:38:47 2019  - INFO: - device disk/0: 15.80% done, 54m 43s remaining (estimated)
Mon Jul  1 17:39:47 2019  - INFO: - device disk/0: 17.20% done, 54m 24s remaining (estimated)
Mon Jul  1 17:40:47 2019  - INFO: - device disk/0: 18.70% done, 53m 56s remaining (estimated)
Mon Jul  1 17:41:48 2019  - INFO: - device disk/0: 20.20% done, 52m 5s remaining (estimated)
Mon Jul  1 17:42:48 2019  - INFO: - device disk/0: 21.70% done, 51m 27s remaining (estimated)
Mon Jul  1 17:43:48 2019  - INFO: - device disk/0: 23.20% done, 50m 50s remaining (estimated)
Mon Jul  1 17:44:48 2019  - INFO: - device disk/0: 24.70% done, 49m 47s remaining (estimated)
Mon Jul  1 17:45:48 2019  - INFO: - device disk/0: 26.20% done, 46m 43s remaining (estimated)
Mon Jul  1 17:46:49 2019  - INFO: - device disk/0: 27.70% done, 46m 32s remaining (estimated)
Mon Jul  1 17:47:49 2019  - INFO: - device disk/0: 29.20% done, 46m 12s remaining (estimated)
Mon Jul  1 17:48:49 2019  - INFO: - device disk/0: 30.70% done, 45m 39s remaining (estimated)
Mon Jul  1 17:49:49 2019  - INFO: - device disk/0: 32.20% done, 46m 15s remaining (estimated)
Mon Jul  1 17:50:49 2019  - INFO: - device disk/0: 33.70% done, 42m 39s remaining (estimated)
Mon Jul  1 17:51:50 2019  - INFO: - device disk/0: 35.20% done, 42m 7s remaining (estimated)
Mon Jul  1 17:52:50 2019  - INFO: - device disk/0: 36.70% done, 41m 47s remaining (estimated)
Mon Jul  1 17:53:50 2019  - INFO: - device disk/0: 38.20% done, 39m 9s remaining (estimated)
Mon Jul  1 17:54:50 2019  - INFO: - device disk/0: 39.70% done, 39m 46s remaining (estimated)
Mon Jul  1 17:55:50 2019  - INFO: - device disk/0: 41.20% done, 38m 32s remaining (estimated)
Mon Jul  1 17:56:51 2019  - INFO: - device disk/0: 42.70% done, 38m 41s remaining (estimated)
Mon Jul  1 17:57:51 2019  - INFO: - device disk/0: 44.20% done, 37m 6s remaining (estimated)
Mon Jul  1 17:58:51 2019  - INFO: - device disk/0: 45.70% done, 36m 7s remaining (estimated)
Mon Jul  1 17:59:51 2019  - INFO: - device disk/0: 47.20% done, 34m 9s remaining (estimated)
Mon Jul  1 18:00:52 2019  - INFO: - device disk/0: 48.70% done, 33m 30s remaining (estimated)
Mon Jul  1 18:01:52 2019  - INFO: - device disk/0: 50.20% done, 33m 1s remaining (estimated)
Mon Jul  1 18:02:52 2019  - INFO: - device disk/0: 51.40% done, 55m 29s remaining (estimated)
Mon Jul  1 18:03:52 2019  - INFO: - device disk/0: 52.70% done, 32m 33s remaining (estimated)
Mon Jul  1 18:04:52 2019  - INFO: - device disk/0: 54.20% done, 29m 54s remaining (estimated)
Mon Jul  1 18:05:53 2019  - INFO: - device disk/0: 55.70% done, 29m 13s remaining (estimated)
Mon Jul  1 18:06:53 2019  - INFO: - device disk/0: 57.20% done, 28m 28s remaining (estimated)
Mon Jul  1 18:07:53 2019  - INFO: - device disk/0: 58.70% done, 26m 20s remaining (estimated)
Mon Jul  1 18:08:53 2019  - INFO: - device disk/0: 60.20% done, 25m 36s remaining (estimated)
Mon Jul  1 18:09:54 2019  - INFO: - device disk/0: 61.70% done, 25m 17s remaining (estimated)
Mon Jul  1 18:10:54 2019  - INFO: - device disk/0: 63.20% done, 24m 14s remaining (estimated)
Mon Jul  1 18:11:54 2019  - INFO: - device disk/0: 64.70% done, 22m 31s remaining (estimated)
Mon Jul  1 18:12:54 2019  - INFO: - device disk/0: 66.20% done, 21m 51s remaining (estimated)
Mon Jul  1 18:13:54 2019  - INFO: - device disk/0: 67.70% done, 21m 0s remaining (estimated)
Mon Jul  1 18:14:55 2019  - INFO: - device disk/0: 69.20% done, 20m 24s remaining (estimated)
Mon Jul  1 18:15:55 2019  - INFO: - device disk/0: 70.70% done, 18m 39s remaining (estimated)
Mon Jul  1 18:16:55 2019  - INFO: - device disk/0: 72.20% done, 17m 50s remaining (estimated)
Mon Jul  1 18:17:55 2019  - INFO: - device disk/0: 73.70% done, 17m 3s remaining (estimated)
Mon Jul  1 18:18:55 2019  - INFO: - device disk/0: 75.20% done, 16m 20s remaining (estimated)
Mon Jul  1 18:19:56 2019  - INFO: - device disk/0: 76.70% done, 15m 29s remaining (estimated)
Mon Jul  1 18:20:56 2019  - INFO: - device disk/0: 78.20% done, 14m 0s remaining (estimated)
Mon Jul  1 18:21:56 2019  - INFO: - device disk/0: 79.70% done, 13m 12s remaining (estimated)
Mon Jul  1 18:22:56 2019  - INFO: - device disk/0: 81.20% done, 12m 22s remaining (estimated)
Mon Jul  1 18:23:57 2019  - INFO: - device disk/0: 82.70% done, 11m 30s remaining (estimated)
Mon Jul  1 18:24:57 2019  - INFO: - device disk/0: 84.20% done, 10m 7s remaining (estimated)
Mon Jul  1 18:25:57 2019  - INFO: - device disk/0: 85.70% done, 9m 13s remaining (estimated)
Mon Jul  1 18:26:57 2019  - INFO: - device disk/0: 87.20% done, 8m 21s remaining (estimated)
Mon Jul  1 18:27:57 2019  - INFO: - device disk/0: 88.70% done, 7m 26s remaining (estimated)
Mon Jul  1 18:28:58 2019  - INFO: - device disk/0: 90.20% done, 6m 14s remaining (estimated)
Mon Jul  1 18:29:58 2019  - INFO: - device disk/0: 91.70% done, 5m 21s remaining (estimated)
Mon Jul  1 18:30:58 2019  - INFO: - device disk/0: 93.20% done, 4m 25s remaining (estimated)
Mon Jul  1 18:31:58 2019  - INFO: - device disk/0: 94.70% done, 3m 29s remaining (estimated)
Mon Jul  1 18:32:58 2019  - INFO: - device disk/0: 96.20% done, 2m 33s remaining (estimated)
Mon Jul  1 18:33:59 2019  - INFO: - device disk/0: 97.70% done, 1m 28s remaining (estimated)
Mon Jul  1 18:34:59 2019  - INFO: - device disk/0: 99.20% done, 31s remaining (estimated)
Mon Jul  1 18:35:30 2019  - INFO: - device disk/0: 100.00% done, 1s remaining (estimated)
Mon Jul  1 18:35:31 2019  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
Mon Jul  1 18:35:31 2019  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
Mon Jul  1 18:35:32 2019  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
Mon Jul  1 18:35:32 2019  - INFO: Instance an-tool1006.eqiad.wmnet's disks are in sync
Mon Jul  1 18:35:32 2019  - INFO: Waiting for instance an-tool1006.eqiad.wmnet to sync disks
Mon Jul  1 18:35:32 2019  - INFO: Instance an-tool1006.eqiad.wmnet's disks are in sync
instance an-tool1006.eqiad.wmnet created with MAC aa:00:00:97:e8:24
Dzahn added a subscriber: Dzahn.Jul 1 2019, 7:30 PM

regarding the very last line.. where it outputs the MAC address. In the original script i was thinking "maybe we can do this even more automatic and have it create the needed puppet change in install_server module .. to add the MAC address to DHCP. But of course that would run locally and not on a cumin server and also it would probably non-trivial to insert code in the right place rather than just appending it. On the other hand maybe appending and ordering by date instead of alphabetically would also be just fine.

Volans added a comment.Jul 1 2019, 8:05 PM

@elukey: yes because of the temporary suppression of cumin's default output, to allow each cookbook to decide what to do with it, this specific one is printing its output at the end. This will soon-ish be more flexible on cumin side and should allow spicerack to expose it in a way that each cookbook can choose what to do with it. Sorry for the non-optimal experience for now.

@Dzahn the hardcoded MAC addesses will soon not be needed anymore, so IMHO not worth investing time in automating the code change while we'll be investing time to avoid that requirement in the first place ;)

elukey added a comment.Jul 2 2019, 8:23 AM

@Volans @crusnov I created https://wikitech.wikimedia.org/wiki/Ganeti#Create_the_VM_(using_the_cookbook_sre.ganeti.makevm) trying to summarize the current status, feel free to amend/modify it if wrong :)

Volans added a comment.Jul 2 2019, 9:44 AM

Thanks a lot @elukey. I've just added a small detail and formatted hosts and paths.
For the status of the task, the port of makevm script is done, the remaining additional parts are still pending.

@Dzahn the hardcoded MAC addesses will soon not be needed anymore <snip>

What does this mean exactly?

@akosiaris the "plan" was partially explained as part of the bare metal/host provisioning breakout session at the SRE Summit. You can find more details in the notes of the summit but basically the TL;DR is that as part of the effort to automate host provisioning we're aiming to have a system in which we don't need to hardcode MAC addresses anymore.
The details of the plan are evolving with the plan itself but the gist is that it will involve DHCP option 82 (or IPv6 autoconf alternatively) and iPXE (or equivalent) to dynamically map a physical host to data available in Netbox and from there drive the whole installation process with the required parameters.
Ping me offline if you want more details.

@akosiaris the "plan" was partially explained as part of the bare metal/host provisioning breakout session at the SRE Summit. You can find more details in the notes of the summit but basically the TL;DR is that as part of the effort to automate host provisioning we're aiming to have a system in which we don't need to hardcode MAC addresses anymore.
The details of the plan are evolving with the plan itself but the gist is that it will involve DHCP option 82 (or IPv6 autoconf alternatively) and iPXE (or equivalent) to dynamically map a physical host to data available in Netbox and from there drive the whole installation process with the required parameters.
Ping me offline if you want more details.

Thanks, I got the gist of it and enough pointers to keep me busy. Nice plan btw!

I used the makevm cook book to create a pool counter VM and it worked great for me! One thing I'd suggest to improve: Initially I had missed that I need to create the DNS entries beforehand, which resulted in some fairly obscure error

Exception raised while executing cookbook sre.ganeti.makevm:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/cookbook.py", line 407, in _run
    ret = self.module.run(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/ganeti/makevm.py", line 94, in run
    results = ganeti_host.run_sync(command)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 214, in run_sync
    batch_sleep=batch_sleep, is_safe=is_safe)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 383, in _execute
    raise RemoteExecutionError(ret, 'Cumin execution failed')
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)

After digging in /var/log/spicerack/sre/ganeti/makevm.log I found the root cause, maybe the cookbook should simply try to resolve the FQDN and bail out if that fails, advising the user to create the DNS entry?

elukey added a comment.Jul 5 2019, 2:42 PM

I used the makevm cook book to create a pool counter VM and it worked great for me! One thing I'd suggest to improve: Initially I had missed that I need to create the DNS entries beforehand, which resulted in some fairly obscure error

Exception raised while executing cookbook sre.ganeti.makevm:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/cookbook.py", line 407, in _run
    ret = self.module.run(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/ganeti/makevm.py", line 94, in run
    results = ganeti_host.run_sync(command)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 214, in run_sync
    batch_sleep=batch_sleep, is_safe=is_safe)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 383, in _execute
    raise RemoteExecutionError(ret, 'Cumin execution failed')
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)

After digging in /var/log/spicerack/sre/ganeti/makevm.log I found the root cause, maybe the cookbook should simply try to resolve the FQDN and bail out if that fails, advising the user to create the DNS entry?

+1!

Ah, and one more thing: After typing "done" for confirmation, output stalls for about ten minutes while the Ganeti instance is created (but eventually all progress is printed), let's maybe add a short message like "Creating your instance, this will take some time", as initially I was worried it was failing.

And one more thought/idea: Our reimage script requires to be run in screen/tmux, maybe that's a good idea here too?

Change 520897 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.ganeti.makevm: add dns check before creating the vm

https://gerrit.wikimedia.org/r/520897

Change 520897 merged by jenkins-bot:
[operations/cookbooks@master] sre.ganeti.makevm: add dns check before creating the vm

https://gerrit.wikimedia.org/r/520897

elukey added a comment.Jul 8 2019, 9:55 AM
elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_A test_not_existing.eqiad.wmnet --vcpus 2 --memory 4 --disk 150 --link analytics
START - Cookbook sre.ganeti.makevm
Exception raised while executing cookbook sre.ganeti.makevm:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/dns.py", line 137, in resolve
    response = self._resolver.query(qname, record_type)
  File "/usr/lib/python3/dist-packages/dns/resolver.py", line 1051, in query
    raise NXDOMAIN(qnames=qnames_to_try, responses=nxdomain_responses)
dns.resolver.NXDOMAIN: None of DNS query names exist: test_not_existing.eqiad.wmnet., test_not_existing.eqiad.wmnet.eqiad.wmnet.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/deployment/spicerack/cookbooks/sre/ganeti/makevm.py", line 81, in run
    spicerack.dns().resolve_ipv4(args.fqdn)
  File "/usr/lib/python3/dist-packages/spicerack/dns.py", line 52, in resolve_ipv4
    return self._resolve_addresses(name, 'A')
  File "/usr/lib/python3/dist-packages/spicerack/dns.py", line 159, in _resolve_addresses
    return [rdata.address for rdata in self.resolve(name, record_type).rrset]  # type: ignore
  File "/usr/lib/python3/dist-packages/spicerack/dns.py", line 141, in resolve
    record_type=record_type, qname=qname)) from e
spicerack.dns.DnsNotFound: Record A not found for test_not_existing.eqiad.wmnet

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/cookbook.py", line 407, in _run
    ret = self.module.run(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/ganeti/makevm.py", line 85, in run
    .format(args.fqdn)) from e
RuntimeError: The DNS A record for test_not_existing.eqiad.wmnet seems not available. Have you added A/PTR records before starting?
END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)

Ah, and one more thing: After typing "done" for confirmation, output stalls for about ten minutes while the Ganeti instance is created (but eventually all progress is printed), let's maybe add a short message like "Creating your instance, this will take some time", as initially I was worried it was failing.

Riccardo wrote above that cumin's logging is disabled temporarily, since eventually every cookbook will be able to tune the logging that they want, so long term there will not be any doubt about stalling due to Cumin's output... On the short/medium term, we could add also a note about cumin's output suppressed, and the fact that the status of the ganeti instance can be found in /var/log/ganeti on the ganeti master host (so if anybody wants to verify he/she can do it easily without figuring out how to do it).

Riccardo wrote above that cumin's logging is disabled temporarily, since eventually every cookbook will be able to tune the logging that they want, so long term there will not be any doubt about stalling due to Cumin's output... On the short/medium term, we could add also a note about cumin's output suppressed, and the fact that the status of the ganeti instance can be found in /var/log/ganeti on the ganeti master host (so if anybody wants to verify he/she can do it easily without figuring out how to do it).

Ack, let's just wait for that flexible logging.

Can we close this?

From my PoV yes, I've used this multiple times successfully to create Ganeti instances, all further enhancesments can be done via separate patches/tasks.

elukey closed this task as Resolved.Aug 6 2019, 12:46 PM

Same for me, please re-open if necessary!