Page MenuHomePhabricator
Feed Advanced Search

Yesterday

Volans added a comment to T245871: Cookbook sre.hosts.downtime displayed on tools.wmflabs.org.

@Etonkovidova what's would be the issue? As far as I know that page is just listing the last few items of the SAL.

Fri, Feb 21, 9:27 PM · SRE-tools, Operations

Wed, Feb 19

Volans added a comment to T245512: Move service::uwsgi logs to logging pipeline.

I suggest that we implement the logic in service::uwsgi allowing for an easy opt-in from all the users of that define, that I bet is used on WMCS too.
For the implementation between the option if we have time probably nicer to implement directly the buster solution and apply it while migrating to it.
But if we want/need to push for a quicker migration then let's do the interim local udp that is compatible with stretch too.

Wed, Feb 19, 10:03 AM · Patch-For-Review, SRE-tools, observability, Wikimedia-Logstash
Volans triaged T229710: read-only user netbox permissions regression as Medium priority.
Wed, Feb 19, 9:53 AM · netbox
Volans added a comment to T229710: read-only user netbox permissions regression.

@bd808 sorry for the trouble, Netbox was updated yesteday and I guess that's a by-product of the update.
I've went ahead and granted permissions in the admin console to the wmf group to view almost everything.
Those are the ones I've left out:

Wed, Feb 19, 9:53 AM · netbox

Tue, Feb 18

Volans added a comment to T242715: Webproxies are a SPOF.

+1 for improving HA of them and I agree that the LVS approach seems the saner one
If we don't plan to do this anytime soon though, maybe we could make an intermediate step with geodns.
We could have webproxy.discovery.wmnet resolving the local proxy by default but in case of maintenance or such we could depool one and have the clients use the proxy in another DC.

Tue, Feb 18, 10:41 PM · Operations
Volans added a comment to T245512: Move service::uwsgi logs to logging pipeline.

@fgiunchedi what's the current best practice here? Debmonitor is just using service::uwsgi that automatically logs to /srv/log/debmonitor/main.log and AFAIK doesn't log to the local syslog normal connections but just restarts of the daemon.

Tue, Feb 18, 2:15 PM · Patch-For-Review, SRE-tools, observability, Wikimedia-Logstash

Sun, Feb 16

Volans triaged T245361: prometheus1003/prometheus1004 /srv/prometheus/ops disk space warning as High priority.

It seems directly related to the bump in retention https://gerrit.wikimedia.org/r/c/operations/puppet/+/564680 as can be clearly seen in the graph below.

Sun, Feb 16, 11:08 AM · observability, Operations

Fri, Feb 14

Volans added a comment to T245288: improve host select for puppet compiler.

@jbond I'm not sure, depends on the current status of the puppet compiler. To my understanding it doesn't have a fully populated PuppetDB but we import only the facts, so I'm not sure to which queries we would be limited, and the difference might be weird as only some queries will work and others not.

Fri, Feb 14, 6:38 PM · puppet-compiler, User-jbond

Wed, Feb 12

Volans reopened T244986: cloudvirt1009: Device not healthy -SMART- as "Open".

Re-opening as it's currently alerting.

Wed, Feb 12, 2:29 PM · ops-eqiad, cloud-services-team (Hardware), Operations
Volans closed T244972: Degraded RAID on db1095 as Invalid.

This was a test, sorry for the noise.

Wed, Feb 12, 9:47 AM · ops-eqiad, Operations
Volans updated the task description for T224549: Track remaining jessie systems in production.
Wed, Feb 12, 9:39 AM · Operations
Volans added a comment to T242705: ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart).

Today this caused quite a spam on the #wikimedia-operations channel because in multiple hosts the nagios NRPE daemon failed too:

nrpe[21365]: fork() failed with error 12, bailing out...
systemd[1]: nagios-nrpe-server.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: nagios-nrpe-server.service: Failed to fork: Cannot allocate memory
Wed, Feb 12, 7:01 AM · Scoring-platform-team (Current), Operations, ORES
Volans added a comment to T242705: ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart).

FWIW all hosts are presenting OOMs, I've renamed the task accordingly:

$ sudo cumin 'ores*' 'zgrep -c "Out of memory: Kill process" /var/log/syslog*'
26 hosts will be targeted:
ores[2001-2009].codfw.wmnet,ores[1001-1009].eqiad.wmnet,orespoolcounter[2003-2004].codfw.wmnet,orespoolcounter[1003-1004].eqiad.wmnet,oresrdb[2001-2002].codfw.wmnet,oresrdb[1001-1002].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) ores1006.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:2
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:94
===== NODE GROUP =====
(1) ores1005.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:5
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:145
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:0
===== NODE GROUP =====
(1) ores1008.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:5
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:4
/var/log/syslog.6.gz:5
/var/log/syslog.7.gz:40
===== NODE GROUP =====
(1) ores2008.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:4
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:1
/var/log/syslog.7.gz:29
===== NODE GROUP =====
(1) ores1007.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:4
/var/log/syslog.1:5
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:2
/var/log/syslog.6.gz:45
/var/log/syslog.7.gz:39
===== NODE GROUP =====
(1) ores2005.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:104
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:0
===== NODE GROUP =====
(1) ores2002.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:15
/var/log/syslog.6.gz:1
/var/log/syslog.7.gz:70
===== NODE GROUP =====
(1) ores2007.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:5
/var/log/syslog.6.gz:2
/var/log/syslog.7.gz:27
===== NODE GROUP =====
(1) ores1004.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:5
/var/log/syslog.6.gz:31
/var/log/syslog.7.gz:41
===== NODE GROUP =====
(1) ores2003.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:9
/var/log/syslog.7.gz:38
===== NODE GROUP =====
(1) ores1009.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:36
===== NODE GROUP =====
(8) orespoolcounter[2003-2004].codfw.wmnet,orespoolcounter[1003-1004].eqiad.wmnet,oresrdb[2001-2002].codfw.wmnet,oresrdb[1001-1002].eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:0
===== NODE GROUP =====
(1) ores1001.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:35
===== NODE GROUP =====
(1) ores1003.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:2
/var/log/syslog.7.gz:72
===== NODE GROUP =====
(1) ores2006.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:1
/var/log/syslog.7.gz:30
===== NODE GROUP =====
(1) ores2009.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:1
/var/log/syslog.7.gz:36
===== NODE GROUP =====
(1) ores2004.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:3
/var/log/syslog.7.gz:47
===== NODE GROUP =====
(1) ores2001.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:30
/var/log/syslog.7.gz:29
===== NODE GROUP =====
(1) ores1002.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:7
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:153
/var/log/syslog.8.gz:0
================
Wed, Feb 12, 6:56 AM · Scoring-platform-team (Current), Operations, ORES
Volans renamed T242705: ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) from Ores celery OOM event in codfw to Ores celery OOM events in all hosts.
Wed, Feb 12, 6:55 AM · Scoring-platform-team (Current), Operations, ORES

Tue, Feb 11

Volans added a comment to T244849: Add SSO support to netbox.

Our current LDAP setup for Netbox is [1], see AUTH_LDAP_USER_FLAGS_BY_GROUP for the current mapping.

Tue, Feb 11, 2:34 PM · netbox, Operations
Volans updated the task description for T244849: Add SSO support to netbox.
Tue, Feb 11, 1:55 PM · netbox, Operations
Volans closed T244362: Homer: commit> no causes stacktrace as Resolved.

Fix merged into master, will be part of the next release.

Tue, Feb 11, 11:11 AM · Operations, SRE-tools
Volans added a comment to T244761: Script to point SRE local machine traffic to another LB.

If we go the /etc/hosts route it seems to me that a quick script should do it. It's sufficient to pass to it a parameter with the DC name and then have the script resolve text-lb.$DC.wikimedia.org and use it as the IP for a predefined static list of services that are behind CDN that we're interested in during an outage. The same parameter could have a special value like reset to clear/comment the same records.

Tue, Feb 11, 10:09 AM · Operations
Volans moved T244840: Evaluate options for non-root operations with cumin and spicerack cookbooks from Backlog to In Progress on the SRE-tools board.
Tue, Feb 11, 10:01 AM · SRE-tools, Operations
Volans added a comment to T244840: Evaluate options for non-root operations with cumin and spicerack cookbooks.
Possible improvements
  • Netbox
    • Have 2 config files in /etc/spicerack/netbox/, one RW and one RO with different permissions.
    • Change Spicerack.netbox() to accept a write=False param and load the appropriate file based on that.
    • Update the cookbooks that require to write on netbox to use write=True.
  • Various spicerack modules that use Remote (cumin) and require root might self-detect if they are running with less privileges and bail out at instantiation time instead of failing in the middle of the execution.
  • conftool/etcd: implement T97972 and adapt spicerack accordingly, possibly having multiple config files with different permissions like the above proposal for Netbox.
Tue, Feb 11, 9:45 AM · SRE-tools, Operations
Volans triaged T244840: Evaluate options for non-root operations with cumin and spicerack cookbooks as Medium priority.
Tue, Feb 11, 9:43 AM · SRE-tools, Operations

Mon, Feb 10

Volans added a comment to T243715: elastic2043 has hardware errors that trigger reboots.

It looks like the same error from racadm lclog view

Mon, Feb 10, 10:11 PM · Discovery, Operations, ops-codfw
Volans reopened T243715: elastic2043 has hardware errors that trigger reboots as "Open".

It just got rebooted

Mon, Feb 10, 10:07 PM · Discovery, Operations, ops-codfw
Volans triaged T244690: Homer: setup CI for static data repositories as Medium priority.
Mon, Feb 10, 7:53 AM · Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), Continuous-Integration-Config, homer
Volans added a watcher for homer: Volans.
Mon, Feb 10, 7:50 AM
Volans added a member for homer: Volans.
Mon, Feb 10, 7:50 AM

Fri, Feb 7

Volans closed T170740: PuppetDB misbehaving on 2017-07-15 as Resolved.

Since the last update a lot of things have changed in our PuppetDB installation, including PuppetDB and OS versions. As we didn't had re-occurrences of this issue I'm resolving it.

Fri, Feb 7, 11:11 PM · Patch-For-Review, Puppet, Operations

Thu, Feb 6

Volans claimed T244363: Homer: commit timeout on MX104 and SRXs.
Thu, Feb 6, 12:27 AM · Operations, SRE-tools
Volans moved T244363: Homer: commit timeout on MX104 and SRXs from Backlog to In Progress on the SRE-tools board.
Thu, Feb 6, 12:27 AM · Operations, SRE-tools
Volans moved T244362: Homer: commit> no causes stacktrace from In Progress to In Code Review on the SRE-tools board.
Thu, Feb 6, 12:11 AM · Operations, SRE-tools
Volans claimed T244362: Homer: commit> no causes stacktrace.
Thu, Feb 6, 12:07 AM · Operations, SRE-tools
Volans moved T244362: Homer: commit> no causes stacktrace from Backlog to In Progress on the SRE-tools board.
Thu, Feb 6, 12:07 AM · Operations, SRE-tools

Wed, Feb 5

Volans triaged T244315: decommission cookbook: add support for decom spreadsheet as Medium priority.
Wed, Feb 5, 12:53 AM · SRE-tools
Volans added a comment to T244314: Figure out how to ideally configure mypy for Python projects.

One option for now would be to run py3{5,6,7,8}-mypy, that defaults the --python-version current one in execution. Given that both 3.4 and 2.7 are out of support.

Wed, Feb 5, 12:44 AM · tox-wikimedia

Tue, Feb 4

Volans moved T243935: Audit all cumin queries in switchdc scripts from Backlog to In Progress on the SRE-tools board.
Tue, Feb 4, 11:09 PM · Patch-For-Review, SRE-tools, Operations
Volans triaged T243935: Audit all cumin queries in switchdc scripts as Medium priority.
Tue, Feb 4, 11:08 PM · Patch-For-Review, SRE-tools, Operations

Wed, Jan 29

Volans added a comment to T223934: Add annotations from ops vendor maintenance calendar to Grafana.

Maybe we could converge this into T222826

Wed, Jan 29, 3:00 PM · Operations

Sun, Jan 26

Volans added a comment to T243715: elastic2043 has hardware errors that trigger reboots.

Current status:

/admin1-> racadm serveraction powerstatus
Server power status: OFF
Sun, Jan 26, 6:16 PM · Discovery, Operations, ops-codfw
Volans added a comment to T243715: elastic2043 has hardware errors that trigger reboots.

As per https://wikitech.wikimedia.org/wiki/Search#Hardware_Failures I'm depooling + downtiming + shutting down the host to avoid that it keeps rebooting and leaving/joining the ES clusters.

Sun, Jan 26, 6:10 PM · Discovery, Operations, ops-codfw
Volans added a comment to T243715: elastic2043 has hardware errors that trigger reboots.

Downtimed host until 2020-02-06 17:04:23 (no onsite dcops this week)

Sun, Jan 26, 6:07 PM · Discovery, Operations, ops-codfw
Volans triaged T243715: elastic2043 has hardware errors that trigger reboots as Medium priority.
Sun, Jan 26, 5:57 PM · Discovery, Operations, ops-codfw

Sat, Jan 25

Volans added a comment to T231068: Spicerack: improve support for Ganeti VMs.

I think that the shortest path to solve this is:

  • add a simple support for Ganeti gnt-* commands on spicerack, at least supporting remove with --force and --shutdown-timeout=0 for now.
  • add to the decommission cookbook the removal of the Ganeti VM directly, probably right before the puppetdb removal
  • add to the decommission cookbook the force run of Ganety VM sync for the affected cluster to ensure Netbox is up to date
Sat, Jan 25, 12:52 AM · Patch-For-Review, SRE-tools

Fri, Jan 24

Volans added a comment to T243634: ulsfo varnish-fe vcache processes overflow on FDs.

I've noticed that on Icinga the 3 checks that matches *_fifo_* were in unknown state with a timeout message.
@CDanis found that the lsof commands in those checks were taking a lot of resources and taking a very long time. Moreover the varnishd process (the child one, the one for the vcache user) was having a very high number of file descriptors, a number very close to 500k.

Fri, Jan 24, 10:09 PM · Operations, Traffic
Volans triaged T243634: ulsfo varnish-fe vcache processes overflow on FDs as Medium priority.
Fri, Jan 24, 10:03 PM · Operations, Traffic

Thu, Jan 23

Volans added a comment to T243550: volans test phab api call.

test comment

Thu, Jan 23, 9:40 PM · SRE-tools
Volans updated the task description for T159045: Update Puppet repo code that uses deprecated maniphest.update/.createtask/.query Conduit API.
Thu, Jan 23, 9:11 PM · Patch-For-Review, Phabricator, Operations, Technical-Debt, SRE-tools
Volans closed T243550: volans test phab api call as Invalid.
Thu, Jan 23, 9:06 PM · SRE-tools
Volans added a comment to T211750: Introduce Python code formatters usage.

As I've seen some efforts to use black also in other part of the org, ideally it would be nice to have a single way to set it up:

  • black configuration (line length, quotes)
  • tox environment to ensure the code is black'ed in CI
  • easy as much automatic as possible way to integrate it into the development (IDE/editor configs, git hook, etc.)
Thu, Jan 23, 6:07 PM · tox-wikimedia, Patch-For-Review, Operations, SRE-tools

Jan 20 2020

Volans raised the priority of T231068: Spicerack: improve support for Ganeti VMs from Low to Medium.
Jan 20 2020, 11:14 AM · Patch-For-Review, SRE-tools
Volans added a comment to T231068: Spicerack: improve support for Ganeti VMs.

Currently we have a bunch of race conditions in the decommission path of VMs. The current actions can be summarized in:

Jan 20 2020, 10:56 AM · Patch-For-Review, SRE-tools
Volans closed T239123: Netbox: Synchronize ganeti VMs from new clusters as Resolved.

Al DCs are properly tracked.

Jan 20 2020, 10:46 AM · User-crusnov, SRE-tools, netbox
Volans updated the task description for T239123: Netbox: Synchronize ganeti VMs from new clusters.
Jan 20 2020, 10:46 AM · User-crusnov, SRE-tools, netbox

Jan 15 2020

Volans created T242910: Add check for changes applied at all runs.
Jan 15 2020, 7:44 PM · Patch-For-Review, User-jbond, Puppet, Operations
Volans closed T239597: Hardware asset tag Netbox/DNS mgmt inconsistencies as Resolved.
Jan 15 2020, 4:38 PM · ops-eqiad, Operations, DC-Ops
Reedy defrocked Volans.
Jan 15 2020, 3:57 PM
Volans closed T242412: ulsfo doesn't have any rack group set in Netbox as Resolved.

As per IRC chat it's ok as is, resolving.

Jan 15 2020, 3:39 PM · DC-Ops, netbox
Reedy empowered Volans as an administrator.
Jan 15 2020, 1:29 PM

Jan 13 2020

Volans added a comment to T238900: add TLS support for smokeping.wikimedia.org.

I was made aware that the two above comments are contradictory. I don't recall the why of my above comment or any limitation on the 2 certs approach. I agree they are separate services and should not depend on each other.

Jan 13 2020, 10:42 AM · netops, Operations, Traffic
Volans added a comment to T242412: ulsfo doesn't have any rack group set in Netbox.

@faidon: I mainly opened this because was the only DC without a rack group, even the network PoPs have one and use the name of the DC raw, not just 1, see https://netbox.wikimedia.org/dcim/rack-groups/

Jan 13 2020, 9:25 AM · DC-Ops, netbox

Jan 10 2020

Volans triaged T242412: ulsfo doesn't have any rack group set in Netbox as Medium priority.
Jan 10 2020, 9:47 AM · DC-Ops, netbox

Jan 8 2020

Volans triaged T242261: wikibugs.wb2-phab: Could not retrieve anchor as Medium priority.
Jan 8 2020, 6:48 PM · Wikibugs

Jan 2 2020

Volans added a comment to T239597: Hardware asset tag Netbox/DNS mgmt inconsistencies.

@Jclark-ctr by any chance do you have an ETA for this task? Just to know and to plan accordingly something related.

Jan 2 2020, 12:20 PM · ops-eqiad, Operations, DC-Ops
Volans closed T239386: memory leak on keyholder-proxy on buster/python 3.7 as Resolved.

Indeed, done :)

Jan 2 2020, 11:43 AM · Acme-chief, Traffic, Operations
Volans committed rOSHO9bd9c7fccb7e: netbox: skip virtual chassis without domain (authored by Volans).
netbox: skip virtual chassis without domain
Jan 2 2020, 10:44 AM
Volans added a comment to T241593: cp1083: ats-tls and varnish-fe crashed due to insufficient memory.

@ema maybe could be related to NUMA utilization? Having a quick look at numastat (both -n and -m) there is a general imbalance between the two nodes (that I think is mostly on purpose due to our custom config), and the varnish process seems the one mostly responsible for it. But there was no spike in the graph either.

Jan 2 2020, 9:25 AM · Wikimedia-Incident, observability, Traffic, Operations

Dec 24 2019

Volans added a comment to T241206: Report image metadata to debmonitor.

The issue for the DELETE has been fixed, I've successfully deleted the image docker-registry.wikimedia.org/python3-build-stretch:0.0.2 that was failing during the tests.
Please ensure that also the /upload endpoint still works as expected too.

Dec 24 2019, 12:23 PM · docker-pkg, Operations, SRE-tools, serviceops

Dec 23 2019

Volans added a comment to T228387: Bare metal cloud: management interfaces.

Thanks, LGTM, feel free to proceed.

Dec 23 2019, 6:25 PM · User-crusnov, Goal, SRE-tools
Volans added a comment to T228387: Bare metal cloud: management interfaces.

@crusnov thanks for the dry-run run, here my comments:

Dec 23 2019, 10:25 AM · User-crusnov, Goal, SRE-tools
Volans added a comment to T239821: decommission elastic10[18-31].eqiad.wmnet.

Interesting, given that the new cookbook kills the hosts that was unexpected, but the cookbook is very quick so I get why it happens.
My suggestion is to add a small sleep (with a log line to tell the user) before this line https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/hosts/decommission.py#167
Probably 10~30s should be enough to run the other actions after any in-flight action.

Dec 23 2019, 9:49 AM · Discovery-Search (Current work), Operations, DC-Ops, decommission
Volans added a comment to T239821: decommission elastic10[18-31].eqiad.wmnet.

@MoritzMuehlenhoff mmmh, according to T239821#5747654 it all worked fine. LMK if I should investigate.

Dec 23 2019, 9:37 AM · Discovery-Search (Current work), Operations, DC-Ops, decommission

Dec 21 2019

Volans updated the task description for T238305: servers freeze across the caching cluster.
Dec 21 2019, 11:27 PM · Operations, Traffic
Volans triaged T241306: cp3051 crashed as Medium priority.
Dec 21 2019, 11:27 PM · Traffic, Operations
Volans added a comment to T240425: cp3055 crashed.

Nothing on the host logs either. For the record it crashed 7 minutes after cp3051 (see T241306) and both are part of the upload esams cluster.

Dec 21 2019, 11:24 PM · Traffic, Operations
Volans added a comment to T240425: cp3055 crashed.

The host crashed again today, nothing in racadm, checked both getsel and lclog view.

Dec 21 2019, 11:12 PM · Traffic, Operations
Volans added a comment to T241306: cp3051 crashed.

Nothing in racadm, checked both getsel and lclog view. Nothing in syslog & co.

Dec 21 2019, 11:04 PM · Traffic, Operations
Volans created T241306: cp3051 crashed.
Dec 21 2019, 10:44 PM · Traffic, Operations

Dec 20 2019

Volans added a comment to T238956: switch prod Phabricator from phab1003 to phab1001.

@Aklapper yes, as the host got reimaged I think the page was not updated, but I cannot edit it unfortunately.

Dec 20 2019, 10:44 PM · serviceops, Release-Engineering-Team

Dec 17 2019

Volans committed rOSHP996f7be39285: Release v0.1.0 (authored by Volans).
Release v0.1.0
Dec 17 2019, 11:30 AM
Volans updated the task description for T228388: Configuration management for network operations.
Dec 17 2019, 10:17 AM · Patch-For-Review, Wikimedia-Incident, Operations, Goal, netops, SRE-tools

Dec 12 2019

Volans updated subscribers of T194031: Setup a new PKI software as an alternative to the puppet CA for managing services certificates.
Dec 12 2019, 10:44 AM · User-jbond, Traffic, Operations

Dec 11 2019

Volans triaged T240457: Debmonitor: backend-changeable settings are stored in the browser's session storage as Medium priority.
Dec 11 2019, 2:39 PM · SRE-tools

Dec 10 2019

Volans added a comment to T167422: Monitoring: add link to graph for Icinga timeseries alarms.

That's great. The idea of the task was to link the specific dashboard that has the same data, while sometimes we use data that is not showed on grafana at all or we link a generic dashboard and not a specific graph.
I don't know though the current state of all those links, so I'll leave it to you best judgement.

Dec 10 2019, 2:45 PM · observability, Operations

Dec 9 2019

Volans added a comment to T239386: memory leak on keyholder-proxy on buster/python 3.7.

So far so good, leaving it open for another week or two to ensure the issue is totally fixed.

Dec 9 2019, 2:34 PM · Acme-chief, Traffic, Operations
Volans added a comment to T238350: Merge all netbox extras into one repository.

Currently open CRs towards the netbox-reports repo should be checked to see if they need to be resent towards the new repo:
https://gerrit.wikimedia.org/r/q/project:operations%252Fsoftware%252Fnetbox-reports+status:open
https://gerrit.wikimedia.org/r/q/project:operations%252Fsoftware%252Fnetbox-deploy+status:open

Dec 9 2019, 2:22 PM · SRE-tools, netbox
Volans closed T238974: Icinga meta-monitoring: don't send recovery if the alert failed to be sent as Resolved.

The OOM issue has been fixed and for now memory, disk and CPU seems to be under control.
Resolve it for now, we can re-open if this will be required anyway.

Dec 9 2019, 1:17 PM · observability, SRE-tools
Volans closed T240193: debmonitor: show OS release name in the host view as Invalid.

I understand that this might seem confusing, but it was decided from the start that debmonitor should not keep track of those, because the idea of a specific release of Debian is very aleatory based on which APT repository you setup in the host and the packages you install.
The other way of looking at it is that a package version in a Debian repository is not for a specific release, a specific release uses that version but the versions are independent of that.
CC @MoritzMuehlenhoff FYI

Dec 9 2019, 10:59 AM · SRE-tools

Dec 8 2019

Volans reopened T239957: Degraded RAID on cloudelastic1002 as "Open".

Re-opening as this has not being yet solved at the md software RAID layer, Icinga is still critical and /proc/mdstat still reports the above degraded status.

Dec 8 2019, 1:23 AM · Discovery-Search (Current work), Discovery, ops-eqiad, Operations

Dec 6 2019

Volans reopened T238956: switch prod Phabricator from phab1003 to phab1001 as "Open".

I've noticed that Phabricator emails are failing the SPF check, re-opening to add details, feel free to move it to a separate task if needed.

Dec 6 2019, 9:15 PM · serviceops, Release-Engineering-Team

Dec 5 2019

Volans committed rOSNE93cd57940e5a: Revert "coherence: Check device names for correct formatting" (authored by Volans).
Revert "coherence: Check device names for correct formatting"
Dec 5 2019, 11:20 PM
Volans added a reverting change for rOSNE70a6dfbf8646: coherence: Check device names for correct formatting: rOSNE93cd57940e5a: Revert "coherence: Check device names for correct formatting".
Dec 5 2019, 11:20 PM
Volans committed rOSNE093fa589ba9c: PuppetDB: fix handle of FAILED status (authored by Volans).
PuppetDB: fix handle of FAILED status
Dec 5 2019, 11:20 PM
Volans committed rOSNE31cdf093f3f0: Add decommissioning status support to reports (authored by crusnov).
Add decommissioning status support to reports
Dec 5 2019, 11:20 PM
Volans committed rOSNEe62a7db29246: Puppetdb: use the is_virtual fact (authored by Volans).
Puppetdb: use the is_virtual fact
Dec 5 2019, 11:19 PM
Volans committed rOSNEbffc03cfa499: PuppetDB: fix typos (authored by Volans).
PuppetDB: fix typos
Dec 5 2019, 11:19 PM
Volans committed rOSNE9e2b7e7d724a: PuppetDB report improvements (authored by Volans).
PuppetDB report improvements
Dec 5 2019, 11:19 PM
Volans updated subscribers of T239901: Disallow 'weight: 0' for MW db config in dbctl.
Dec 5 2019, 11:46 AM · Operations, DBA, Wikimedia-Incident
Volans updated subscribers of T239897: wmf-auto-reimage errors: failure to downtime (w/ no rename), pytho gc whine.

For the first one the downtime cookbook failed to run puppet on the Icinga active host to get the definitions of the reimaged hosts to downtime. Given how much puppet is slow on the icinga host it can happen if there are multiple runs at the same time, that we hit the timeout even with --attempts 30.
My suggestion for running parallel reimages is to open 2~3 tmux and run there sequential reimages and let them start few minutes apart from each other.

Dec 5 2019, 11:45 AM · SRE-tools, Operations

Dec 4 2019

Volans committed rOSNE91ec71539035: Initial setup of repo (authored by Volans).
Initial setup of repo
Dec 4 2019, 5:20 PM
Volans updated subscribers of T239807: Clean up old images on wikitech-static.
Dec 4 2019, 1:37 PM · wikitech.wikimedia.org

Dec 3 2019

Volans added a comment to T237604: Record per-server power usage.

Is there any bug report about this? Are you sure it affects the components we would be using? I understand ipmi-oem does not use the network stack.

Dec 3 2019, 10:35 AM · observability