Page MenuHomePhabricator

Volans (Riccardo Coccioli)
SRE

Projects (8)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Feb 10 2016, 11:25 AM (205 w, 2 d)
Availability
Available
IRC Nick
volans
LDAP User
Volans
MediaWiki User
RCoccioli (WMF) [ Global Accounts ]

Recent Activity

Wed, Jan 15

Volans created T242910: Add check for changes applied at all runs.
Wed, Jan 15, 7:44 PM · Puppet, Operations
Volans closed T239597: Hardware asset tag Netbox/DNS mgmt inconsistencies as Resolved.
Wed, Jan 15, 4:38 PM · ops-eqiad, Operations, DC-Ops
Reedy defrocked Volans.
Wed, Jan 15, 3:57 PM
Volans closed T242412: ulsfo doesn't have any rack group set in Netbox as Resolved.

As per IRC chat it's ok as is, resolving.

Wed, Jan 15, 3:39 PM · DC-Ops, netbox
Reedy empowered Volans as an administrator.
Wed, Jan 15, 1:29 PM

Mon, Jan 13

Volans added a comment to T238900: add TLS support for smokeping.wikimedia.org.

I was made aware that the two above comments are contradictory. I don't recall the why of my above comment or any limitation on the 2 certs approach. I agree they are separate services and should not depend on each other.

Mon, Jan 13, 10:42 AM · netops, Operations, Traffic
Volans added a comment to T242412: ulsfo doesn't have any rack group set in Netbox.

@faidon: I mainly opened this because was the only DC without a rack group, even the network PoPs have one and use the name of the DC raw, not just 1, see https://netbox.wikimedia.org/dcim/rack-groups/

Mon, Jan 13, 9:25 AM · DC-Ops, netbox

Fri, Jan 10

Volans triaged T242412: ulsfo doesn't have any rack group set in Netbox as Medium priority.
Fri, Jan 10, 9:47 AM · DC-Ops, netbox

Wed, Jan 8

Volans triaged T242261: wikibugs.wb2-phab: Could not retrieve anchor as Medium priority.
Wed, Jan 8, 6:48 PM · Wikibugs

Thu, Jan 2

Volans added a comment to T239597: Hardware asset tag Netbox/DNS mgmt inconsistencies.

@Jclark-ctr by any chance do you have an ETA for this task? Just to know and to plan accordingly something related.

Thu, Jan 2, 12:20 PM · ops-eqiad, Operations, DC-Ops
Volans closed T239386: memory leak on keyholder-proxy on buster/python 3.7 as Resolved.

Indeed, done :)

Thu, Jan 2, 11:43 AM · Acme-chief, Traffic, Operations
Volans committed rOSHO9bd9c7fccb7e: netbox: skip virtual chassis without domain (authored by Volans).
netbox: skip virtual chassis without domain
Thu, Jan 2, 10:44 AM
Volans added a comment to T241593: cp1083: ats-tls and varnish-fe crashed due to insufficient memory.

@ema maybe could be related to NUMA utilization? Having a quick look at numastat (both -n and -m) there is a general imbalance between the two nodes (that I think is mostly on purpose due to our custom config), and the varnish process seems the one mostly responsible for it. But there was no spike in the graph either.

Thu, Jan 2, 9:25 AM · observability, Traffic, Operations

Tue, Dec 24

Volans added a comment to T241206: Report image metadata to debmonitor.

The issue for the DELETE has been fixed, I've successfully deleted the image docker-registry.wikimedia.org/python3-build-stretch:0.0.2 that was failing during the tests.
Please ensure that also the /upload endpoint still works as expected too.

Tue, Dec 24, 12:23 PM · docker-pkg, Operations, SRE-tools, serviceops

Mon, Dec 23

Volans added a comment to T228387: Bare metal cloud: management interfaces.

Thanks, LGTM, feel free to proceed.

Mon, Dec 23, 6:25 PM · Patch-For-Review, User-crusnov, Goal, SRE-tools
Volans added a comment to T228387: Bare metal cloud: management interfaces.

@crusnov thanks for the dry-run run, here my comments:

Mon, Dec 23, 10:25 AM · Patch-For-Review, User-crusnov, Goal, SRE-tools
Volans added a comment to T239821: decommission elastic10[18-31].eqiad.wmnet.

Interesting, given that the new cookbook kills the hosts that was unexpected, but the cookbook is very quick so I get why it happens.
My suggestion is to add a small sleep (with a log line to tell the user) before this line https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/hosts/decommission.py#167
Probably 10~30s should be enough to run the other actions after any in-flight action.

Mon, Dec 23, 9:49 AM · Discovery-Search (Current work), Operations, DC-Ops, decommission
Volans added a comment to T239821: decommission elastic10[18-31].eqiad.wmnet.

@MoritzMuehlenhoff mmmh, according to T239821#5747654 it all worked fine. LMK if I should investigate.

Mon, Dec 23, 9:37 AM · Discovery-Search (Current work), Operations, DC-Ops, decommission

Sat, Dec 21

Volans updated the task description for T238305: servers freeze across the caching cluster.
Sat, Dec 21, 11:27 PM · Traffic, Operations
Volans triaged T241306: cp3051 crashed as Medium priority.
Sat, Dec 21, 11:27 PM · Traffic, Operations
Volans added a comment to T240425: cp3055 crashed.

Nothing on the host logs either. For the record it crashed 7 minutes after cp3051 (see T241306) and both are part of the upload esams cluster.

Sat, Dec 21, 11:24 PM · Traffic, Operations
Volans added a comment to T240425: cp3055 crashed.

The host crashed again today, nothing in racadm, checked both getsel and lclog view.

Sat, Dec 21, 11:12 PM · Traffic, Operations
Volans added a comment to T241306: cp3051 crashed.

Nothing in racadm, checked both getsel and lclog view. Nothing in syslog & co.

Sat, Dec 21, 11:04 PM · Traffic, Operations
Volans created T241306: cp3051 crashed.
Sat, Dec 21, 10:44 PM · Traffic, Operations

Fri, Dec 20

Volans added a comment to T238956: switch prod Phabricator from phab1003 to phab1001.

@Aklapper yes, as the host got reimaged I think the page was not updated, but I cannot edit it unfortunately.

Fri, Dec 20, 10:44 PM · serviceops, Release-Engineering-Team

Dec 17 2019

Volans committed rOSHP996f7be39285: Release v0.1.0 (authored by Volans).
Release v0.1.0
Dec 17 2019, 11:30 AM
Volans updated the task description for T228388: Configuration management for network operations.
Dec 17 2019, 10:17 AM · Patch-For-Review, Wikimedia-Incident, Operations, Goal, netops, SRE-tools

Dec 12 2019

Volans updated subscribers of T194031: Setup a new PKI software as an alternative to the puppet CA for managing services certificates.
Dec 12 2019, 10:44 AM · User-jbond, Traffic, Operations

Dec 11 2019

Volans triaged T240457: Debmonitor: backend-changeable settings are stored in the browser's session storage as Medium priority.
Dec 11 2019, 2:39 PM · SRE-tools

Dec 10 2019

Volans added a comment to T167422: Monitoring: add link to graph for Icinga timeseries alarms.

That's great. The idea of the task was to link the specific dashboard that has the same data, while sometimes we use data that is not showed on grafana at all or we link a generic dashboard and not a specific graph.
I don't know though the current state of all those links, so I'll leave it to you best judgement.

Dec 10 2019, 2:45 PM · observability, Operations

Dec 9 2019

Volans added a comment to T239386: memory leak on keyholder-proxy on buster/python 3.7.

So far so good, leaving it open for another week or two to ensure the issue is totally fixed.

Dec 9 2019, 2:34 PM · Acme-chief, Traffic, Operations
Volans added a comment to T238350: Merge all netbox extras into one repository.

Currently open CRs towards the netbox-reports repo should be checked to see if they need to be resent towards the new repo:
https://gerrit.wikimedia.org/r/q/project:operations%252Fsoftware%252Fnetbox-reports+status:open
https://gerrit.wikimedia.org/r/q/project:operations%252Fsoftware%252Fnetbox-deploy+status:open

Dec 9 2019, 2:22 PM · SRE-tools, netbox
Volans closed T238974: Icinga meta-monitoring: don't send recovery if the alert failed to be sent as Resolved.

The OOM issue has been fixed and for now memory, disk and CPU seems to be under control.
Resolve it for now, we can re-open if this will be required anyway.

Dec 9 2019, 1:17 PM · observability, SRE-tools
Volans closed T240193: debmonitor: show OS release name in the host view as Invalid.

I understand that this might seem confusing, but it was decided from the start that debmonitor should not keep track of those, because the idea of a specific release of Debian is very aleatory based on which APT repository you setup in the host and the packages you install.
The other way of looking at it is that a package version in a Debian repository is not for a specific release, a specific release uses that version but the versions are independent of that.
CC @MoritzMuehlenhoff FYI

Dec 9 2019, 10:59 AM · SRE-tools

Dec 8 2019

Volans reopened T239957: Degraded RAID on cloudelastic1002 as "Open".

Re-opening as this has not being yet solved at the md software RAID layer, Icinga is still critical and /proc/mdstat still reports the above degraded status.

Dec 8 2019, 1:23 AM · Discovery-Search (Current work), Discovery, ops-eqiad, Operations

Dec 6 2019

Volans reopened T238956: switch prod Phabricator from phab1003 to phab1001 as "Open".

I've noticed that Phabricator emails are failing the SPF check, re-opening to add details, feel free to move it to a separate task if needed.

Dec 6 2019, 9:15 PM · serviceops, Release-Engineering-Team

Dec 5 2019

Volans committed rOSNE93cd57940e5a: Revert "coherence: Check device names for correct formatting" (authored by Volans).
Revert "coherence: Check device names for correct formatting"
Dec 5 2019, 11:20 PM
Volans added a reverting change for rOSNE70a6dfbf8646: coherence: Check device names for correct formatting: rOSNE93cd57940e5a: Revert "coherence: Check device names for correct formatting".
Dec 5 2019, 11:20 PM
Volans committed rOSNE093fa589ba9c: PuppetDB: fix handle of FAILED status (authored by Volans).
PuppetDB: fix handle of FAILED status
Dec 5 2019, 11:20 PM
Volans committed rOSNE31cdf093f3f0: Add decommissioning status support to reports (authored by crusnov).
Add decommissioning status support to reports
Dec 5 2019, 11:20 PM
Volans committed rOSNEe62a7db29246: Puppetdb: use the is_virtual fact (authored by Volans).
Puppetdb: use the is_virtual fact
Dec 5 2019, 11:19 PM
Volans committed rOSNEbffc03cfa499: PuppetDB: fix typos (authored by Volans).
PuppetDB: fix typos
Dec 5 2019, 11:19 PM
Volans committed rOSNE9e2b7e7d724a: PuppetDB report improvements (authored by Volans).
PuppetDB report improvements
Dec 5 2019, 11:19 PM
Volans updated subscribers of T239901: Disallow 'weight: 0' for MW db config in dbctl.
Dec 5 2019, 11:46 AM · Operations, DBA, Wikimedia-Incident
Volans updated subscribers of T239897: wmf-auto-reimage errors: failure to downtime (w/ no rename), pytho gc whine.

For the first one the downtime cookbook failed to run puppet on the Icinga active host to get the definitions of the reimaged hosts to downtime. Given how much puppet is slow on the icinga host it can happen if there are multiple runs at the same time, that we hit the timeout even with --attempts 30.
My suggestion for running parallel reimages is to open 2~3 tmux and run there sequential reimages and let them start few minutes apart from each other.

Dec 5 2019, 11:45 AM · SRE-tools, Operations

Dec 4 2019

Volans committed rOSNE91ec71539035: Initial setup of repo (authored by Volans).
Initial setup of repo
Dec 4 2019, 5:20 PM
Volans updated subscribers of T239807: Clean up old images on wikitech-static.
Dec 4 2019, 1:37 PM · wikitech.wikimedia.org

Dec 3 2019

Volans added a comment to T237604: Record per-server power usage.

Is there any bug report about this? Are you sure it affects the components we would be using? I understand ipmi-oem does not use the network stack.

Dec 3 2019, 10:35 AM · observability

Dec 2 2019

Volans triaged T239597: Hardware asset tag Netbox/DNS mgmt inconsistencies as Medium priority.
Dec 2 2019, 11:58 AM · ops-eqiad, Operations, DC-Ops
Volans created T239597: Hardware asset tag Netbox/DNS mgmt inconsistencies.
Dec 2 2019, 11:58 AM · ops-eqiad, Operations, DC-Ops
Volans added a comment to T224564: Reimage wezen to Buster (and rename to centrallog2001).

I've updated the mgmt DNS name record in Netbox that was still reporting wezen. I've also a patch to cleanup the wezen record from DNS, will push it later today.

Dec 2 2019, 11:20 AM · User-fgiunchedi, observability, Operations
Volans added a comment to T225128: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN..

I've updated the mgmt interface's DNS names on Netbox that were still reporting the old names cloudvirtan*.

Dec 2 2019, 11:11 AM · Analytics-Kanban, ops-eqiad, Operations, netops, Analytics
Volans added a comment to T237464: Netbox Coherence Report enhancements .

Not sure if it can be considered in scope for this task as the title is pretty generic.
Another check we need is to ensure that the hostname part of some DNS names matches the device name, in particular:

  • primary_ip4
  • primary_ip6
  • mgmt ip address
Dec 2 2019, 11:03 AM · Patch-For-Review, netbox
Volans added a comment to T232126: Decommission old mw2231/WMF6435 replaced with WMF6403.

Forgot to mention that https://netbox.wikimedia.org/ipam/ip-addresses/687/ had still the old name, I've updated it.

Dec 2 2019, 10:12 AM · Operations, ops-codfw
Volans added a comment to T232126: Decommission old mw2231/WMF6435 replaced with WMF6403.

@Papaul thanks, just a small detail, I've deleted also the 'mgmt' interface from 'mw2231 old' ( https://netbox.wikimedia.org/dcim/devices/1185/ ) given that it's offline (unracked).

Dec 2 2019, 9:45 AM · Operations, ops-codfw

Dec 1 2019

Volans added a comment to T167035: stretch acct monthly cron will spam when /var/log/wtmp.1 doesn't exist.

FYI This still happens in buster too, the Debian bug is still open.
We've 88 hosts that don't have /var/log/wtmp.1 and they spammed cronspam today.

Dec 1 2019, 10:02 AM · Operations

Nov 30 2019

Volans added a comment to T232126: Decommission old mw2231/WMF6435 replaced with WMF6403.

@Papaul given we're setting the DNS name of the ip address in Netbox, that one too needs to be updated, see the links above:

IP: 10.193.2.251/16
Assignment:	mw2231 (mgmt)
DNS Name:	graphite2002.mgmt.codfw.wmnet

and

IP: 10.193.1.118/16
Assignment:	mw2231 old (mgmt)
DNS Name:	mw2231.mgmt.codfw.wmnet

My understanding is that 10.193.1.118 is the mgmt IP assigned to the new mw2231 (but please double check it). In that case we should attach Netbox IP 10.193.1.118/16 to the mw2231's mgmt interface and delete the 10.193.2.251/16 if is not anymore used.

Nov 30 2019, 6:58 PM · Operations, ops-codfw
Volans reopened T232126: Decommission old mw2231/WMF6435 replaced with WMF6403 as "Open".

It seems that Netbox's ip address has not been updated and still reports graphite2002 in the DNS name, see
https://netbox.wikimedia.org/ipam/ip-addresses/687/

Nov 30 2019, 6:32 PM · Operations, ops-codfw
Volans reopened T238526: Decommission db2061.codfw.wmnet, a subtask of T228258: Decommission db2043-db2070, as Open.
Nov 30 2019, 6:30 PM · Operations, DBA
Volans reopened T238526: Decommission db2061.codfw.wmnet as "Open".

Netbox status is currently Decommissioning, if the host has been unracked it should be Offline.

Nov 30 2019, 6:29 PM · Operations, ops-codfw, decommission
Volans reopened T221068: decom ms-be201[345] as "Open".

ms-be2013 and ms-be2014 are marked as Decommissioning in Netbox, if they were unracked their status should be changed to Offline.

Nov 30 2019, 6:03 PM · decommission, ops-codfw, SRE-swift-storage, User-fgiunchedi, Operations
Volans reopened T235125: Move kafka200[123] to logstash202[012] as "Open".

Re-opening as the DNS name of the interfaces attached to those hosts have not been modified in Netbox.
Things like:

IP address: 10.193.1.23/16	
Parent: logstash2020
DNS name: kafka2001.mgmt.codfw.wmnet
Nov 30 2019, 5:57 PM · DC-Ops, Operations, ops-codfw

Nov 28 2019

Volans added a comment to T239449: cp1087 reboot.

It might be another occurrence of T238305 (model matches)

Nov 28 2019, 11:18 PM · Operations, Traffic
Volans added a comment to T239334: Python3 style guide.

@jbond on the CI instances you have 3.4, 3.5, 3.6 and 3.7 available although the system one is 3.5. Faidon did the packaging a while ago and if you see the CI jobs of many repos they run all environments from tox.

Nov 28 2019, 3:25 PM · Patch-For-Review, User-ArielGlenn, User-jbond, Operations, Puppet
Volans triaged T239412: Librenms sessions are stored inside the deployment directory as Medium priority.
Nov 28 2019, 1:36 PM · netops, Operations
Volans added a comment to T238919: Cleanup Netbox stuff from netmon hosts.

It looks like not all the puppet code was made ensure=>'absent'. We might have many more small things still laying around as a result.

Nov 28 2019, 1:29 PM · netbox
Volans added a comment to T238919: Cleanup Netbox stuff from netmon hosts.

I've also removed the crontab entries for wmf_auto_restart_uwsgi-netbox and prometheus-postgres-exporter.

Nov 28 2019, 1:27 PM · netbox
Volans added a comment to T238919: Cleanup Netbox stuff from netmon hosts.

Postgres user and related crontab are still present on the hosts and triggered a failure in the backup because there is no more DB to backup.
I've just removed the crontab for now.

Nov 28 2019, 1:22 PM · netbox
Volans added a comment to T239386: memory leak on keyholder-proxy on buster/python 3.7.

I was able to debug the issue using tracemalloc:

Nov 28 2019, 11:02 AM · Acme-chief, Traffic, Operations
Volans added a comment to T239386: memory leak on keyholder-proxy on buster/python 3.7.

I'm doing a quick debug attempt on acmechief-test2001

Nov 28 2019, 9:37 AM · Acme-chief, Traffic, Operations

Nov 27 2019

Volans added a comment to T239334: Python3 style guide.

I think the best way is to have it easily integrated in some form in the local workflow in our dev envs, so that when you tests locally they passes and when you commit you commit already the formatted version. Then CI is just ensuring that the code is already formatted according to the tool.
Otherwise the resulting workflow would be annoying: make a patch, send it to Gerrit, get V-1 from Jenkins, check the output, either run the tool manually locally or fix manually the format issues, send a second PS.

Nov 27 2019, 6:03 PM · Patch-For-Review, User-ArielGlenn, User-jbond, Operations, Puppet

Nov 26 2019

Volans added a comment to T238919: Cleanup Netbox stuff from netmon hosts.

@Volans what do you mean by "any remaining puppet code" ?

Nov 26 2019, 6:22 PM · netbox

Nov 25 2019

Volans added a project to T238974: Icinga meta-monitoring: don't send recovery if the alert failed to be sent: observability.
Nov 25 2019, 4:08 PM · observability, SRE-tools
Volans added a comment to T238305: servers freeze across the caching cluster.

If needed, full list of R440 available here: https://puppetboard.wikimedia.org/fact/productname/PowerEdge+R440 (intentionally not mentioning their count here)

Nov 25 2019, 1:52 PM · Traffic, Operations
Volans added a comment to T238900: add TLS support for smokeping.wikimedia.org.

No problem for me for 1 cert, it seems a reasonable approach.

Nov 25 2019, 1:34 PM · netops, Operations, Traffic

Nov 23 2019

Volans updated the task description for T238974: Icinga meta-monitoring: don't send recovery if the alert failed to be sent.
Nov 23 2019, 10:31 AM · observability, SRE-tools
Volans triaged T238974: Icinga meta-monitoring: don't send recovery if the alert failed to be sent as Medium priority.
Nov 23 2019, 10:26 AM · observability, SRE-tools

Nov 22 2019

Volans added a comment to T238833: Create NRPE check to alert when cergen certificates are due to expire.

I'll try to find some time soon to make cergen chmod after creating files.

Nov 22 2019, 6:51 PM · Patch-For-Review, User-jbond, Puppet, Operations
Volans added a comment to T238900: add TLS support for smokeping.wikimedia.org.

@Volans @crusnov @ayounsi we need some clarification regarding TLS material on netmon boxes, right now they get access to librenms and netbox acme-chief managed certificates. Is netbox still needed there?

Nov 22 2019, 12:29 PM · netops, Operations, Traffic
Volans triaged T238919: Cleanup Netbox stuff from netmon hosts as Medium priority.
Nov 22 2019, 12:29 PM · netbox
Volans created T238919: Cleanup Netbox stuff from netmon hosts.
Nov 22 2019, 12:29 PM · netbox

Nov 21 2019

Volans added a comment to T238833: Create NRPE check to alert when cergen certificates are due to expire.

It's hard to reply from the description, there is no quote task description button AFAIK.

Nov 21 2019, 3:49 PM · Patch-For-Review, User-jbond, Puppet, Operations
Volans added a comment to T223292: Netbox: generate CSV backups.

The above patch [1] has not yet been merged.

Nov 21 2019, 2:31 PM · netbox

Nov 20 2019

Volans added a comment to T238727: Include zone+subnet checks for DNS validation.

@fgiunchedi I think is fair request, but given we're in process of auto-generating all mgmt and then server's DNS records this might have less benefit that in the current situation. Would be ok to treat it as lower priority?

Nov 20 2019, 5:51 PM · Traffic, Operations, DNS, SRE-tools
Volans added a comment to T237469: Netbox: Fix hostname case ambiguity.

I don't mind the additional check, but again, I'm not sure how much is in scope for this task. If we do the extensive check then we should define a policy first, that is not strictly defined yet AFAIK.

Nov 20 2019, 11:04 AM · netbox, DC-Ops

Nov 19 2019

Volans added a comment to T237587: Determine & implement near-term method for escalating network alerts.

@herron @fgiunchedi I don't think that much, I guess you have to do the triggering part, I'm not super clear what you have in mind, a script to run from somewhere or what. I'll be careful with an email alias as it could be easily abused.

Nov 19 2019, 9:49 PM · Operations, netops, observability
Volans added a comment to T237469: Netbox: Fix hostname case ambiguity.

I actually think that's not enough if we want to enforce a policy, although it's not clear if that's the scope of this task.

Nov 19 2019, 5:01 PM · netbox, DC-Ops
Volans added a comment to T237469: Netbox: Fix hostname case ambiguity.

I've reverted the above patch as it was reporting most servers as false positive for the new name coherence report. The regex was wrong and not able to match our current hostnames.
Also is not totally clear to me the goal of this check, as it was opened for the asset tag names specifically but the check was expanded to all hosts.
For asset tag hostnames for example we should check that the hostname matches the asset tag of the same device, and if those are already lowercase, that would be already enough.

Nov 19 2019, 1:20 PM · netbox, DC-Ops
Volans added a comment to T236277: Extend Puppet CA Expiry date .

The debmonitor test didn't test much as the debmonitor client sends the puppet client cert (not the CA) and it's the server that validates it with the CA.

Nov 19 2019, 11:56 AM · DBA, Patch-For-Review, User-jbond, Puppet, Operations
Volans added a comment to T236277: Extend Puppet CA Expiry date .

@jbond just to be on the safe side and to verify the theory, if possible make a quick test that the new cert in the CR is able to verify exiting puppet node certs and cergen certs.

Nov 19 2019, 10:26 AM · DBA, Patch-For-Review, User-jbond, Puppet, Operations

Nov 15 2019

Volans added a comment to T233183: Automate generation of Management DNS records from Netbox.

A simplified version could be to use a cookbook to couple stuff:

Nov 15 2019, 7:42 PM · User-jbond, Traffic, Operations, User-crusnov, Goal, SRE-tools
Volans added a comment to T233183: Automate generation of Management DNS records from Netbox.

Seems sane! The only thing I'm a little iffy about iis from the "SHA1 written to etcd" onwards. I'm not sure it's a bad approach, but I'm not sure I've thought through all the implications, either. I think the key things to think about in that part of the flow that might be missing are, is emergency updates to the netbox-defined data. e.g. if all the things are borked and we need to manually edit a DNS entry in the netbox-derived zonefile fragments.... is there a way we can do that from the authdns servers with authdns-update? (e.g. a local commit and an override of the SHA1 argument?).

Nov 15 2019, 2:42 PM · User-jbond, Traffic, Operations, User-crusnov, Goal, SRE-tools

Nov 14 2019

Volans added a member for Security: Tgr.
Nov 14 2019, 4:57 AM
Volans added a member for WMF-NDA: Tgr.
Nov 14 2019, 4:56 AM

Nov 13 2019

Volans added a comment to T224946: Netbox Alert Cleanups.

I am not sure this is related, but we get many alerts of

  • PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL
  • PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed

If those are actionable from dc-ops or if we are getting more false positives than we should, we must fix it.

Nov 13 2019, 2:34 PM · Operations, observability, User-crusnov, netbox, SRE-tools
Volans added a comment to T238200: debmonitor TLS termination.

As we discussed a while ago about this, the easiest solution is to pick another port for the public TLS server on the debmonitor servers as the 443 is already taken for the internal clients to report the package list to it and it's used to perform authz/n with the client certificate.

Nov 13 2019, 12:16 PM · Operations, Traffic

Nov 9 2019

Volans triaged T237803: Netbox reports Icinga checks timeout as High priority.
Nov 9 2019, 12:19 PM · Operations, SRE-tools, netbox
Volans created T237803: Netbox reports Icinga checks timeout.
Nov 9 2019, 12:19 PM · Operations, SRE-tools, netbox

Nov 8 2019

Dzahn awarded T179816: Cumin: create external backend for WMCS Puppet API a Love token.
Nov 8 2019, 5:45 PM · cloud-services-team (Kanban), SRE-tools
Volans added a comment to T237691: cloud-cumin-01: HTTPSConnectionPool - Max retries exceeded with url.

@Krenair the default ones HTTPSConnectionPool(host='localhost', port=443)
@Dzahn that would be T179816, but it seems that there haven't been much interest in that backend.

Nov 8 2019, 5:23 PM · SRE-tools, Cloud-VPS