Page MenuHomePhabricator

Create a replacement for kraz.wikimedia.org
Closed, ResolvedPublic

Description

In the parent task it is explained how the Analytics and SRE teams think about how irc.wikimedia.org should evolve. It is currently served by kraz.wikimedia.org, running Jessie, and we'd like to create a VM with Buster to test traffic and eventually let it takeover irc.wikimedia.org's traffic entirely.

The new VM will run a Python daemon that Faidon wrote (see parent task), that eventually could become a good candidate for Kubernetes.

High level details of the VM:

datacenter: codfw
IP: public
RAM: 8G
Cores: 4
Disk space: 40G (should be more than enough, the daemon is stateless and doesn't need disk space IIUC but adding some more room in case it will be needed in the future)

@faidon Let me know if the above requirements are ok or if we need something more powerful.

Event Timeline

jijiki triaged this task as Medium priority.Feb 10 2020, 11:51 AM
jijiki added a project: serviceops.

Just to confirm. It should still have a public IP in wikimedia.org ?

Just to confirm. It should still have a public IP in wikimedia.org ?

Yes it should! :)

I've never understood why we do this. Wouldn't it be better to create the VM in the private network and use frontends + LVS to route to the services?

I've never understood why we do this. Wouldn't it be better to create the VM in the private network and use frontends + LVS to route to the services?

For this particular service I don't think that we'd need the overhead of a LVS config, so Iet's proceed with the creation of the VM since time is running out. For the discussion of the more general topic (very interesting), I'll leave SRE to comment :)

Don't we want to make the service HA?

Change 571563 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add irc2001.wikimedia.org

https://gerrit.wikimedia.org/r/571563

Don't we want to make the service HA?

I don't think it is a goal for now, we'd need a replacement for kraz to test what Faidon wrote and possibly to do the switch soonish. Then not sure, kubernetes, HA, etc.. This is my understanding :)

I thought one of the issues with the current IRC daemon was that it can't ever be taken offline. Without an HA or at least auto-failover setup of some kind, we miss that!

But ok! If this is just for testing now I guess so!

The way this works now is that the entire MW fleet sends UDP packets to a specific IP (kraz) using the so-called "echo" protocol (= #channel<tab>message). We could theoretically switch this to a multicast address in order to get the ability of having multiple listeners (all connecting to separate IRC servers, each on each listener's localhost perhaps?), but noone has invested the time to do this and set up those multiple frontends.

At this point we should switch message propagation to something else anyway, like, say, Kafka or SSE ;) After that's done, then yeah, this probably belongs somewhere behind LVS, across DCs etc. (either multiple VMs, or k8s).

In terms of this migration... we haven't really explicitly made this call together, but I woud argue that we should probably replace the software while we keep UDP echo in place, and move to Kafka/SSE as a second step. I don't feel strongly though.

The way this works now is that the entire MW fleet sends UDP packets to a specific IP (kraz) using the so-called "echo" protocol (= #channel<tab>message). We could theoretically switch this to a multicast address in order to get the ability of having multiple listeners (all connecting to separate IRC servers, each on each listener's localhost perhaps?), but noone has invested the time to do this and set up those multiple frontends.

MW can just be configured to send to multiple different IPs if we don't want to do network-level magic.

In terms of this migration... we haven't really explicitly made this call together, but I woud argue that we should probably replace the software while we keep UDP echo in place, and move to Kafka/SSE as a second step. I don't feel strongly though.

I would vote for moving on with UDP echo for the moment, since the Debian Jessie deadline is not that far. Let's keep going with the creation of the VM.

Change 571563 merged by Elukey:
[operations/dns@master] add irc2001.wikimedia.org

https://gerrit.wikimedia.org/r/571563

elukey@ganeti2001:~$  sudo gnt-group list
Group Nodes Instances AllocPolicy NDParams
row_A     4        34 preferred   ovs=False, ssh_port=22, ovs_link=, spindle_count=1, exclusive_storage=False, cpu_speed=1, ovs_name=switch1, oob_program=
row_B     4        33 preferred   ovs=False, ssh_port=22, ovs_link=, spindle_count=1, exclusive_storage=False, cpu_speed=1, ovs_name=switch1, oob_program=

Does this really need 8 GB RAM and 8 CPUs? The machine that this will replace (kraz) uses a single CPU (and hardly uses it) and has an average memory usage of 0.25G. I'm all for adding some headroom, but that seems a little excessive :-)

Does this really need 8 GB RAM and 8 CPUs? The machine that this will replace (kraz) uses a single CPU (and hardly uses it) and has an average memory usage of 0.25G. I'm all for adding some headroom, but that seems a little excessive :-)

The new kraz will run a python daemon, not what is currently running in there, so I went for 4 cpus and 8G of ram, that seems conservative I know but I'd like to be sure :) I promise that I'll re-create the VM in case the requirements will be too much :)

elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm codfw_B --link public --memory 8 --disk 40 --vcpus 4 irc2001.wikimedia.org
START - Cookbook sre.ganeti.makevm
Creating new VM named irc2001.wikimedia.org in codfw with row=B vcpu=4 memory=8 gigabytes disk=40 gigabytes link=public
Is this correct?
Type "done" to proceed
> done
Wed Feb 12 08:55:33 2020  - INFO: No-installation mode selected, disabling startup
Wed Feb 12 08:55:38 2020  - INFO: Selected nodes for instance irc2001.wikimedia.org via iallocator hail: ganeti2003.codfw.wmnet, ganeti2004.codfw.wmnet
Wed Feb 12 08:55:40 2020 * creating instance disks...
Wed Feb 12 08:55:43 2020 adding instance irc2001.wikimedia.org to cluster config
Wed Feb 12 08:55:43 2020 adding disks to cluster config
Wed Feb 12 08:55:44 2020  - INFO: Waiting for instance irc2001.wikimedia.org to sync disks
Wed Feb 12 08:55:44 2020  - INFO: - device disk/0:  0.10% done, 42m 36s remaining (estimated)
Wed Feb 12 08:56:44 2020  - INFO: - device disk/0:  5.50% done, 17m 24s remaining (estimated)
Wed Feb 12 08:57:44 2020  - INFO: - device disk/0: 10.90% done, 16m 33s remaining (estimated)
Wed Feb 12 08:58:44 2020  - INFO: - device disk/0: 16.30% done, 14m 58s remaining (estimated)
Wed Feb 12 08:59:45 2020  - INFO: - device disk/0: 21.70% done, 13m 34s remaining (estimated)
Wed Feb 12 09:00:45 2020  - INFO: - device disk/0: 27.30% done, 12m 43s remaining (estimated)
Wed Feb 12 09:01:45 2020  - INFO: - device disk/0: 33.00% done, 11m 46s remaining (estimated)
Wed Feb 12 09:02:45 2020  - INFO: - device disk/0: 38.60% done, 10m 50s remaining (estimated)
Wed Feb 12 09:03:45 2020  - INFO: - device disk/0: 44.20% done, 9m 30s remaining (estimated)
Wed Feb 12 09:04:46 2020  - INFO: - device disk/0: 49.80% done, 8m 29s remaining (estimated)
Wed Feb 12 09:05:46 2020  - INFO: - device disk/0: 55.50% done, 7m 35s remaining (estimated)
Wed Feb 12 09:06:46 2020  - INFO: - device disk/0: 61.20% done, 6m 38s remaining (estimated)
Wed Feb 12 09:07:47 2020  - INFO: - device disk/0: 66.80% done, 5m 44s remaining (estimated)
Wed Feb 12 09:08:47 2020  - INFO: - device disk/0: 72.40% done, 4m 44s remaining (estimated)
Wed Feb 12 09:09:47 2020  - INFO: - device disk/0: 78.00% done, 3m 49s remaining (estimated)
Wed Feb 12 09:10:47 2020  - INFO: - device disk/0: 83.60% done, 2m 50s remaining (estimated)
Wed Feb 12 09:11:48 2020  - INFO: - device disk/0: 89.30% done, 1m 51s remaining (estimated)
Wed Feb 12 09:12:48 2020  - INFO: - device disk/0: 94.90% done, 54s remaining (estimated)
Wed Feb 12 09:13:42 2020  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
Wed Feb 12 09:13:42 2020  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
Wed Feb 12 09:13:42 2020  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
Wed Feb 12 09:13:43 2020  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
Wed Feb 12 09:13:43 2020  - INFO: Instance irc2001.wikimedia.org's disks are in sync
Wed Feb 12 09:13:43 2020  - INFO: Waiting for instance irc2001.wikimedia.org to sync disks
Wed Feb 12 09:13:43 2020  - INFO: Instance irc2001.wikimedia.org's disks are in sync
instance irc2001.wikimedia.org created with MAC aa:00:00:53:f8:01

Change 571680 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Introduce irc2001.wikimedia.org

https://gerrit.wikimedia.org/r/571680

Change 571680 merged by Elukey:
[operations/puppet@production] Introduce irc2001.wikimedia.org

https://gerrit.wikimedia.org/r/571680

Change 571683 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] install_server: fix dhcp config for irc2001

https://gerrit.wikimedia.org/r/571683

Change 571683 merged by Elukey:
[operations/puppet@production] install_server: fix dhcp config for irc2001

https://gerrit.wikimedia.org/r/571683

Change 571685 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] install_server: add partition config for irc2001

https://gerrit.wikimedia.org/r/571685

Change 571685 merged by Elukey:
[operations/puppet@production] install_server: add partition config for irc2001

https://gerrit.wikimedia.org/r/571685

Ok current status:

  • irc2001.wikimedia.org is running
  • puppet is set to role::system::spare, waiting for a new role/cluster combination
  • puppet fails for Error: /Stage[main]/Profile::Standard/Interface::Add_ip6_mapped[main]/Augeas[ens5_v6_token]: Could not evaluate: Saving failed, see debug, need to investigate

Weird thing happened: when I tried to run puppet for the first time via install-console, I got:

Error: Could not request certificate: Failed to open TCP connection to puppet:8140 (getaddrinfo: Name or service not known)

I had to manually add search wikimedia.org in /etc/resolv.conf to make it work.

Mentioned in SAL (#wikimedia-operations) [2020-02-12T18:31:39Z] <mutante> irc2001 - manually run the "${v6_token_cmd} && ${v6_flush_dyn_cmd}" commands from interface::add_ip6_mapped to debug 'Interface::Add_ip6_mapped[main]/Augeas[ens5_v6_token]: Could not evaluate: Saving failed' but it does not reproduce the puppet error ... (T244719)

Debug: Augeas[ens5_v6_token](provider=augeas): sending command 'set' with params ["/files/etc/network/interfaces/iface[. = 'ens5']/pre-up", "/sbin/ip token set ::208:80:153:62 dev ens5"]
Debug: Augeas[ens5_v6_token](provider=augeas): Put failed on one or more files, output from /augeas//error:
Debug: Augeas[ens5_v6_token](provider=augeas): /augeas/files/etc/network/interfaces/error = put_failed
Debug: Augeas[ens5_v6_token](provider=augeas): /augeas/files/etc/network/interfaces/error/path = /files/etc/network/interfaces/
Debug: Augeas[ens5_v6_token](provider=augeas): /augeas/files/etc/network/interfaces/error/lens = /usr/share/augeas/lenses/dist/interfaces.aug:125.13-.63:
Debug: Augeas[ens5_v6_token](provider=augeas): /augeas/files/etc/network/interfaces/error/message = Failed to match tree under /
Debug: Augeas[ens5_v6_token](provider=augeas): Closed the augeas connection
Error: /Stage[main]/Profile::Standard/Interface::Add_ip6_mapped[main]/Augeas[ens5_v6_token]: Could not evaluate: Saving failed, see debug

The primary network interface is missing from /etc/network/interfaces. There is only loopback in there. Why that is is another question.

During puppet run .. Augeas tries to add the "pre-up" line in the 'iface ens5 ...' block but fails to do so. That's how i interpret "Failed to match tree".

Given that Luca also had an error during initial setup related to name resolution, this sounds like some error related to the DNS records for the new host?

My bad, today I reviewed irc2001 with Moritz and it turned up that a GNOME deployment happened. This is because when d-i was running and installing packages, a temporary network glitch happened so I re-run "select and install packages" assuming that it would install only the minimal base system, but I was wrong.

Just re-installed everything, all issues are gone.

Remaining steps are to create the cluster and the new role!

Change 572191 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add new role to irc2001.wikimedia.org

https://gerrit.wikimedia.org/r/572191

Change 572191 merged by Elukey:
[operations/puppet@production] Add new role to irc2001.wikimedia.org

https://gerrit.wikimedia.org/r/572191

elukey claimed this task.

@MoritzMuehlenhoff wrote:
Given that Luca also had an error during initial setup related to name resolution, this sounds like some error related to the DNS records for the new host?

Yea, i thought so too. But i had already looked at DNS and could not see a mistake.

@elukey wrote:
a temporary network glitch happened .. all issues are gone.

This explains it, nice!

I'm making a subtask to decom kraz.