Page MenuHomePhabricator

cloud network: improve automated testing & monitoring
Closed, ResolvedPublic

Description

Per parent task, our cloud network resilience could be improved by adding automated tests & more monitoring.

Idea here: P15659

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+45 -0
operations/puppetproduction+1 -1
operations/puppetproduction+38 -15
operations/puppetproduction+19 -0
operations/cookbookswmcs+145 -0
operations/puppetproduction+13 -2
operations/dnsmaster+2 -1
operations/puppetproduction+23 -11
operations/puppetproduction+10 -7
operations/puppetproduction+1 -1
operations/puppetproduction+0 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+22 -21
operations/puppetproduction+522 -0
labs/privatemaster+9 -0
Show related patches Customize query in gerrit

Event Timeline

aborrero renamed this task from cloud network: improve monitoring to cloud network: improve automated testing & monitoring.Nov 3 2021, 5:17 PM
aborrero created this task.
aborrero updated the task description. (Show Details)
aborrero triaged this task as Medium priority.Nov 3 2021, 6:41 PM

Change 736819 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: introduce network tests

https://gerrit.wikimedia.org/r/736819

Change 737346 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[labs/private@master] secret: add openstack networktests sshkeys placeholders

https://gerrit.wikimedia.org/r/737346

Change 737346 merged by Arturo Borrero Gonzalez:

[labs/private@master] secret: add openstack networktests sshkeys placeholders

https://gerrit.wikimedia.org/r/737346

Change 736819 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud: introduce network tests

https://gerrit.wikimedia.org/r/736819

Change 737392 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: networktests: correct some problems

https://gerrit.wikimedia.org/r/737392

Change 737392 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: networktests: correct some problems

https://gerrit.wikimedia.org/r/737392

Change 737418 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: networktests: add missing ROUTING_SOURCE_IP envvar

https://gerrit.wikimedia.org/r/737418

Change 737418 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud: networktests: add missing ROUTING_SOURCE_IP envvar

https://gerrit.wikimedia.org/r/737418

Change 737613 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: networktests: runner: use expanded_cmd

https://gerrit.wikimedia.org/r/737613

Change 737614 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: networktests: ssh: use -q

https://gerrit.wikimedia.org/r/737614

Change 737613 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud: networktests: runner: use expanded_cmd

https://gerrit.wikimedia.org/r/737613

Change 737614 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud: networktests: ssh: use -q

https://gerrit.wikimedia.org/r/737614

Change 737615 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: networktests: use -q

https://gerrit.wikimedia.org/r/737615

Change 737615 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud: networktests: use -q

https://gerrit.wikimedia.org/r/737615

Change 737620 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: networktests: fix some testcases

https://gerrit.wikimedia.org/r/737620

Change 737620 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud: networktests: fix some testcases

https://gerrit.wikimedia.org/r/737620

Change 737642 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: networktests: rework some of the raw icmp checks

https://gerrit.wikimedia.org/r/737642

Change 737642 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud: networktests: rework some of the raw icmp checks

https://gerrit.wikimedia.org/r/737642

Change 737648 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimediacloud.org: add A records for cloudgw2001-dev/2002-dev

https://gerrit.wikimedia.org/r/737648

Change 737648 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimediacloud.org: add A records for cloudgw2001-dev/2002-dev

https://gerrit.wikimedia.org/r/737648

Change 737667 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/cookbooks@wmcs] wmcs: add openstack network tests cookbook

https://gerrit.wikimedia.org/r/737667

Change 737741 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: monitor: cmd-checklist-runner: exit with a different return code

https://gerrit.wikimedia.org/r/737741

Change 737741 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: monitor: cmd-checklist-runner: exit with a different return code

https://gerrit.wikimedia.org/r/737741

Change 737667 merged by jenkins-bot:

[operations/cookbooks@wmcs] wmcs: add openstack network tests cookbook

https://gerrit.wikimedia.org/r/737667

Change 738068 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: networktests: add systemd timer job to run the test suite

https://gerrit.wikimedia.org/r/738068

Change 738070 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: networktests: run as dedicated user

https://gerrit.wikimedia.org/r/738070

Change 738070 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: networktests: run as dedicated user

https://gerrit.wikimedia.org/r/738070

Change 738068 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: networktests: add systemd timer job to run the test suite

https://gerrit.wikimedia.org/r/738068

Change 738180 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: networktests: fix timer job interval specification

https://gerrit.wikimedia.org/r/738180

Change 738180 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: networktests: fix timer job interval specification

https://gerrit.wikimedia.org/r/738180

Change 738191 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: networktests: introduce in eqiad1

https://gerrit.wikimedia.org/r/738191

Change 738191 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: networktests: introduce in eqiad1

https://gerrit.wikimedia.org/r/738191

Mentioned in SAL (#wikimedia-cloud) [2021-11-11T10:47:09Z] <arturo> add user srv-networktests as project user (T294955)

Mentioned in SAL (#wikimedia-cloud) [2021-11-11T10:50:28Z] <arturo> add user srv-networktests as project user (T294955)

Mentioned in SAL (#wikimedia-cloud) [2021-11-11T10:50:53Z] <arturo> add user srv-networktests as project user (T294955)

Change 738211 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: networktests: update eqiad1 bastion

https://gerrit.wikimedia.org/r/738211

Change 738211 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: networktests: update eqiad1 bastion

https://gerrit.wikimedia.org/r/738211

Change 738212 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: networktests: replace curl silent argument

https://gerrit.wikimedia.org/r/738212

Change 738212 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: networktests: replace curl silent argument

https://gerrit.wikimedia.org/r/738212

Change 738214 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: networktests: fix toolforge.org IP address

https://gerrit.wikimedia.org/r/738214

Change 738214 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: networktests: fix toolforge.org IP address

https://gerrit.wikimedia.org/r/738214

Got to a nice stopping point.

  • tests have been deployed to both codfw1dev and eqiad1
  • a spicerack cookbook has been created to help with automated usage
  • a periodic job has been setup to help us monitor the health of the network
  • however, the periodic job depends on icinga monitoring systemd services, which by the time of this writing is disabled for eqiad1
  • anyway, not sure yet if we want to be paged by errors reported by the testsuite (not sure yet how stable it will be)
  • in any case, I'm leaving the systemd timer job enabled to at least we can see the logs and know if the network wasn't stable at some point
  • some docs have been created https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network/Tests

TODO:

  • decide if we want to be paged by this
  • extend with more checks!