Page MenuHomePhabricator

Volans (Riccardo Coccioli)
SRE

Projects (13)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Feb 10 2016, 11:25 AM (428 w, 3 d)
Availability
Available
IRC Nick
volans
LDAP User
Volans
MediaWiki User
RCoccioli (WMF) [ Global Accounts ]

Recent Activity

Thu, Apr 18

Volans added a comment to T351418: Upgrade from ISC-DHCP Server to KEA-DHCP Server.

Some Juniper equipment relies on DHCP for ZTP as well, and maybe there are other uses of DHCP. Any idea if anything else relies on DHCP too?

Thu, Apr 18, 10:17 AM · Infrastructure-Foundations
Volans added a comment to T360029: Integrate dbctl IP changes as part of VLAN changes. .

@Marostegui As it turns out, plain old confctl can be used to do this already.

Thu, Apr 18, 9:50 AM · conftool, Data-Persistence, SRE, Infrastructure-Foundations
Volans added a comment to T362786: Enable dbctl for parsercache.

I think that treating them as x2 with "omit_replicas_in_mwconfig": true might just work. The spares could be either set as candidate masters for each section or just simple replicas for each section, given the above config. Upon promotion of a space to master in one section it should be removed from all the other sections and replaced with the new spare.

Thu, Apr 18, 9:32 AM · Infrastructure-Foundations, Data-Persistence, conftool

Tue, Apr 16

Volans triaged T362629: Allow interacting with Toolforge PuppetDB from wmcs-cookbooks as Low priority.
Tue, Apr 16, 2:01 PM · Cumin, cloud-services-team, Toolforge, Infrastructure-Foundations
Volans added a comment to T362629: Allow interacting with Toolforge PuppetDB from wmcs-cookbooks.

The change would not be very small as to make it general we would need to make cumin support multiple instances of each backend, each one with their own settings and also a way to select them via the query language. Definitely a breaking change for the existing config and query language.

Tue, Apr 16, 2:00 PM · Cumin, cloud-services-team, Toolforge, Infrastructure-Foundations
Volans closed T187709: Cumin feature idea: Prometheus backend as Declined.

Given the lack of interest in the last few years closing it as declined. Can be re-opened if there is renewed interest in working on this.

Tue, Apr 16, 11:39 AM · Cumin, Infrastructure-Foundations
Volans set the point value for T207898: Cumin PuppetDB backend: allow to filter by last run metadata to 2.5.
Tue, Apr 16, 11:37 AM · Cumin, Infrastructure-Foundations
Volans set the point value for T197458: Cumin: add option when --batch=1 to skip deduplication to 2.
Tue, Apr 16, 11:36 AM · Cumin, Infrastructure-Foundations
Volans set the point value for T355811: Feature request: When cumin is running with -b (and -s), it should display the current host being affected to 2.5.

I see only one case where the implementation is straightforward and clean on the UI side, the one with --batch 1.

Tue, Apr 16, 10:24 AM · SRE, Cumin, Infrastructure-Foundations
Volans set the point value for T244840: Evaluate options for non-root operations with cumin and spicerack cookbooks to 5.

Cumin is currently working with the running user from the cuminunpriv1001 host (after a kinit) towards kerberized hosts, like for example the install hosts.

Tue, Apr 16, 9:43 AM · SRE-tools, Spicerack, Infrastructure-Foundations, SRE
Volans set the point value for T212783: cumin: Make output path sane and flexible (was: allow to suppress output and progress bars) to 4.
Tue, Apr 16, 9:36 AM · Cumin, Infrastructure-Foundations
Volans set the point value for T213296: Cumin: batch_sleep is waited after last execution in some cases to 1.
Tue, Apr 16, 9:33 AM · Cumin, Infrastructure-Foundations
Volans set the point value for T179816: Cumin: create external backend for WMCS Puppet API to 3.
Tue, Apr 16, 9:29 AM · Cumin, cloud-services-team, Infrastructure-Foundations
Volans moved T179816: Cumin: create external backend for WMCS Puppet API from Backlog to On hold on the Cumin board.
Tue, Apr 16, 9:29 AM · Cumin, cloud-services-team, Infrastructure-Foundations
Volans changed the point value for T205900: Cumin: add backend for Netbox from 0.1 to 3.
Tue, Apr 16, 9:29 AM · Cumin, Infrastructure-Foundations, Patch-For-Review, netbox, SRE
Volans set the point value for T205900: Cumin: add backend for Netbox to 0.1.
Tue, Apr 16, 9:20 AM · Cumin, Infrastructure-Foundations, Patch-For-Review, netbox, SRE
Volans moved T205900: Cumin: add backend for Netbox from Backlog to Nice to have on the Cumin board.
Tue, Apr 16, 9:18 AM · Cumin, Infrastructure-Foundations, Patch-For-Review, netbox, SRE
Volans closed T164587: cumin could use randomization/splay options as Declined.

See also T224097 for a similar use case.

Tue, Apr 16, 9:05 AM · Cumin, Infrastructure-Foundations, SRE
Volans merged T224097: Make spicerack / cumin cluster aware into T164587: cumin could use randomization/splay options.
Tue, Apr 16, 9:03 AM · Cumin, Infrastructure-Foundations, SRE
Volans merged task T224097: Make spicerack / cumin cluster aware into T164587: cumin could use randomization/splay options.
Tue, Apr 16, 9:03 AM · Cumin, Infrastructure-Foundations
Volans added a comment to T224097: Make spicerack / cumin cluster aware.

The underlying logic is very similar to T164587, I'm merging this into that one.

Tue, Apr 16, 9:02 AM · Cumin, Infrastructure-Foundations
Volans closed T222480: Cumin leading zeros in host grouping alter hostname as Declined.

The issue has been solved upstream in v1.9.0 and is included in the version in Debian bookworm. Nothing to do on our side specifically, will automatically get fixed once the underlying hosts will be upgraded.

Tue, Apr 16, 9:01 AM · Cumin, Infrastructure-Foundations, Upstream
Volans merged task T325773: Cumin/Openstack: multi-project commands are extremely slow into T346453: [cumin] [openstack] Openstack backend fails when project is not set.
Tue, Apr 16, 8:55 AM · Cumin, Cloud-VPS, cloud-services-team, Patch-For-Review, Infrastructure-Foundations
Volans added a comment to T325773: Cumin/Openstack: multi-project commands are extremely slow.

Merging this with T346453 as the testing plan outlined in T346453#9713036 will cover also this use use case.

Tue, Apr 16, 8:55 AM · Cumin, Cloud-VPS, cloud-services-team, Patch-For-Review, Infrastructure-Foundations
Volans merged T325773: Cumin/Openstack: multi-project commands are extremely slow into T346453: [cumin] [openstack] Openstack backend fails when project is not set.
Tue, Apr 16, 8:53 AM · cloud-services-team (FY2023/2024-Q3-Q4), Patch-For-Review, Infrastructure-Foundations, Cloud-VPS, Cumin

Mon, Apr 15

Volans closed T342130: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time as Resolved.

With exclusive locking now in place for the sre.dns.netbox cookbook I think we can consider this resolved.

Mon, Apr 15, 3:50 PM · Spicerack, SRE-tools, Infrastructure-Foundations
Volans closed T342130: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time, a subtask of T341973: Spicerack: add distributed locking support, as Resolved.
Mon, Apr 15, 3:50 PM · Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack
Volans removed projects from T225694: Create cookbook to do `nodetool repair` across cassandra cluster: Spicerack, Infrastructure-Foundations.
Mon, Apr 15, 3:43 PM · Cassandra, SRE-tools, User-Joe, SRE
Volans removed projects from T203943: Spicerack cookbooks TODO list: Spicerack, Infrastructure-Foundations.
Mon, Apr 15, 3:42 PM · SRE-tools, User-jijiki, User-Joe
Volans added a comment to T315560: spicerack.dnsdisc.Discovery should not allow pooling active/passive services in both datacenters.

@JMeybohm Is this something still needed?

Mon, Apr 15, 3:42 PM · Infrastructure-Foundations, SRE-tools, Spicerack
Volans removed projects from T203948: Covert deploy_apache_change.sh to a spicerack cookbook: Spicerack, Infrastructure-Foundations.
Mon, Apr 15, 3:40 PM · SRE-tools, User-Joe
Volans removed projects from T203944: Create a spicerack cookbook for restoring an etcd cluster from backups: Spicerack, Infrastructure-Foundations.
Mon, Apr 15, 3:40 PM · SRE-tools, User-jijiki, User-Joe
Volans removed projects from T282775: Revert workaround for cumin output verbosity on RemoteExecution (CuminExecution) abstraction: Cumin, Infrastructure-Foundations.

Removing cumin, and I/F tags as there isn't anything pending from this side.

Mon, Apr 15, 3:31 PM · User-Kormat, Data-Persistence-Backup, DBA
Volans closed T345337: spicerack: tox fails to install PyYAML using python 3.11 on bookworm as Resolved.

Resolving then, thanks all that contributed to the fix! Feel free to re-open if there is still any related issue for 3.11. For 3.12 we have a different one tracked in T354410.

Mon, Apr 15, 2:09 PM · cloud-services-team (FY2023/2024-Q3-Q4), Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack
Volans closed T345337: spicerack: tox fails to install PyYAML using python 3.11 on bookworm, a subtask of T348726: [wmcs-cookbooks] tox is failing, as Resolved.
Mon, Apr 15, 2:08 PM · User-dcaro, cloud-services-team (FY2023/2024-Q1-Q2), Cloud-VPS
Volans added a comment to T345337: spicerack: tox fails to install PyYAML using python 3.11 on bookworm.

With the above patch I think the issue should be solved and we can resolve the task. Anyone could try to repro it again?

Mon, Apr 15, 1:29 PM · cloud-services-team (FY2023/2024-Q3-Q4), Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack
Volans renamed T346722: Sao Paulo, Brazil, South America POP tracking task from Sao Paulo, Brazil, South America POP tracking tack to Sao Paulo, Brazil, South America POP tracking task.
Mon, Apr 15, 10:33 AM · ops-magru, Patch-For-Review

Wed, Apr 10

Volans claimed T361306: Decommission cookbook: stop when user inputs "abort".
Wed, Apr 10, 1:32 PM · Patch-For-Review, Infrastructure-Foundations, Data-Platform-SRE
Volans placed T361647: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks up for grabs.

De-assigning it from me as Brian is working on this.

Wed, Apr 10, 1:28 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, cloud-services-team (FY2023/2024-Q3-Q4), Infrastructure-Foundations, SRE-tools, Spicerack
Volans added a comment to T351418: Upgrade from ISC-DHCP Server to KEA-DHCP Server.

Thanks for the summary, well outlined. I've spoken a bit with Arzhel and I think that the general idea of using Netbox data in a more streamlined way for the DHCP is sound. There are some comments/concerns/caveats that I would like to highlight, but nothing is a hard blocker:

Wed, Apr 10, 9:31 AM · Infrastructure-Foundations

Tue, Apr 9

Volans closed T282019: sre.hosts.decommission: don't FAIL when unable to set icinga downtime as Declined.

This specific failure is due to the special nature of the secondary Icinga host that is not monitored by Icinga. The downtime is already performed best-effort by the cookbook. The issue should be solved in the Icinga puppettization instead in T362137. Closing as declined.

Tue, Apr 9, 7:46 AM · Infrastructure-Foundations, SRE-tools
Volans created T362137: Icinga secondary host is not monitored.
Tue, Apr 9, 7:44 AM · SRE Observability, Icinga, observability
Volans added a comment to T358636: etcdmirror does not recover from a cleared waitIndex.

Thanks a lot for the detailed plan outline. The plan looks sane to me, I agree that the in-place migration is probably the less risky path.
Just one nit, we need to give plenty of advance notice to avoid long-running scripts that might touch conftool such as long-running cookbooks and long-running DBAs scripts that call dbctl at random times.

Tue, Apr 9, 7:27 AM · Patch-For-Review, serviceops

Mon, Apr 8

Volans closed T201317: wmf-auto-reimage: 'execution expired' on first puppet run as Declined.

Too long has passed since then and doesn't seem to happen anymore.

Mon, Apr 8, 3:08 PM · User-ema, Infrastructure-Foundations, SRE, SRE-tools
Volans closed T260077: netbox dumps: fix permissions and timestamp as Resolved.

Since the last update we've removed the Netbox CSV dumps all-together. Resolving

Mon, Apr 8, 2:48 PM · Infrastructure-Foundations, netbox, SRE-tools
Volans claimed T361647: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks.
Mon, Apr 8, 2:36 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, cloud-services-team (FY2023/2024-Q3-Q4), Infrastructure-Foundations, SRE-tools, Spicerack
Volans claimed T361218: spicerack.puppet.PuppetHostsError: Unable to find CSR fingerprints for all hosts, detected errors are: Another puppet instance is already running and the waitforlock setting is set to 0; exiting.
Mon, Apr 8, 2:32 PM · SRE-tools, Infrastructure-Foundations, Spicerack, Cloud-VPS
Volans added a comment to T361525: Degraded RAID on elastic2088.

The host is alerting in Icinga, should it be downtimed?

Mon, Apr 8, 7:56 AM · ops-codfw, Data-Platform-SRE (2024.04.15 - 2024.05.05)

Thu, Apr 4

Volans added a comment to T361762: Improve etcdmirror shutdown behavior.

nice finding!

Thu, Apr 4, 8:19 AM · Patch-For-Review, serviceops

Wed, Apr 3

Volans added a comment to T358636: etcdmirror does not recover from a cleared waitIndex.

Wow, that was quite an investigation for a /test key, thanks for the thorough analysis. As for the test2 value that could have been me when deploying the spicerack locks. I have in my bash history from the now defunct cumin1001 this for example:

sudo etcdctl -C https://conf1008.eqiad.wmnet:4001 set /test/volans '{"test": "value"}'

that although not the same might have failed if test was a key and not a directory (as it looks like) and I might have retried it with different values.
This just to say that I think it's safe to remove /test.

Wed, Apr 3, 9:38 AM · Patch-For-Review, serviceops

Tue, Apr 2

Volans claimed T360293: Spicerack puppetserver.destroy() raises an exception when certificate does not exist.
Tue, Apr 2, 8:56 PM · Patch-For-Review, Infrastructure-Foundations, Puppet (Puppet 7.0), SRE-tools, Spicerack
Volans added a comment to T360293: Spicerack puppetserver.destroy() raises an exception when certificate does not exist.

I've sent a proposal implementation in the patch above

Tue, Apr 2, 8:56 PM · Patch-For-Review, Infrastructure-Foundations, Puppet (Puppet 7.0), SRE-tools, Spicerack
Volans added a comment to T345337: spicerack: tox fails to install PyYAML using python 3.11 on bookworm.

@bking I'm not sure what do you mean. As mentioned earlier in T345337#9658807 Debian has v5.8.1 for python3-elasticsearch-curator and that's the current version used in production. The depedency in setup.py is defined as elasticsearch-curator~=5.0.

Tue, Apr 2, 8:17 PM · cloud-services-team (FY2023/2024-Q3-Q4), Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack
Volans added a comment to T361604: "FAIL: debmonitor-client" Email Alerts for db2202.codfw.wmnet.

This is a duplicate of T355422#9664626

Tue, Apr 2, 3:19 PM · SecTeam-Processed, DBA

Fri, Mar 29

Volans closed T353558: Re-images sometimes fail as the cert request goes to the wrong puppet master as Resolved.

Many things have changed since december on the Puppet7 migration and I don't think we're seeing the same issue anymore. Tentatively resolving it, feel free to re-open if it happens again.

Fri, Mar 29, 9:11 AM · SRE-tools, Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
Volans closed T353558: Re-images sometimes fail as the cert request goes to the wrong puppet master, a subtask of T348319: Update reimage cookbooks to work with puppet7, as Resolved.
Fri, Mar 29, 9:11 AM · Patch-For-Review, SRE-tools, Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
Volans added a comment to T361306: Decommission cookbook: stop when user inputs "abort".

I've looked at the logs and the code, some clarification/questions/comments:

  1. because the cookbook was prompting the user, it means it was already stopped, waiting for user input. If no answer would have been entered the cookbook would have stayed there doing nothing, allowing for the operator to investigate the situation.
  2. the abort there is to interrupt the execution of that command raising an exception, what a cookbook would do after that is outside of the scope of the confirmation asking. In particular that confirmation was to commit or not the interface changes on the switch.
    • Was the decommission cookbook re-run on the aborted host (elastic2050) after the incident was resolved to ensure all the decom steps were performed?
  3. the current implementation of the decommission cookbook is to execute the decom on all selected hosts catching any exception from the single host run and reporting them at the end. It could of course be changed maybe to prompt the user for confirmation to continue or not on error.
    • The current implementation doesn't catch a Ctrl+c, so that would have interrupted the cookbook execution all-together.
  4. the decommission cookbook performs destructive actions and as such has already various warnings and prompts to the user to make sure is not run on the wrong hosts. The incident report (as of now) doesn't clarify why the cookbook was run on the wrong hosts and what could have prevented it.
Fri, Mar 29, 8:57 AM · Patch-For-Review, Infrastructure-Foundations, Data-Platform-SRE
Volans removed a project from T361218: spicerack.puppet.PuppetHostsError: Unable to find CSR fingerprints for all hosts, detected errors are: Another puppet instance is already running and the waitforlock setting is set to 0; exiting: SRE-tools.
Fri, Mar 29, 8:34 AM · SRE-tools, Infrastructure-Foundations, Spicerack, Cloud-VPS
Restricted Application added a project to T361218: spicerack.puppet.PuppetHostsError: Unable to find CSR fingerprints for all hosts, detected errors are: Another puppet instance is already running and the waitforlock setting is set to 0; exiting: Infrastructure-Foundations.

Yeah, it's clearly a race condition that could be solved in both places (cookbook or spicerack), no strong opinion.
The problem is that from the current code it seems that puppet doesn't clearly return a proper exit code that could help understand the problem and parsing the output is brittle.
We could add an @retry with few attempts or check the puppet lock file on error.
For context the regenerate certificate is run seldom in production, is it run more frequently in WMCS?

Fri, Mar 29, 8:34 AM · SRE-tools, Infrastructure-Foundations, Spicerack, Cloud-VPS

Mar 28 2024

Volans added a comment to T358506: Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7.

Sorry, ignore my previous comments, there was some misunderstanding:

Mar 28 2024, 9:41 AM · Infrastructure-Foundations, Patch-For-Review, SRE Observability (FY2023/2024-Q3), SRE
Volans added a comment to T358506: Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7.

Yes but to which endpoint is trying to connect? Please try to use puppetdb-api.discovery.wmnet:8090 and let us know if that works or not (that's a proxy that allows only some queries and not others, so it might need tweaking based on which queries naggen does).

Mar 28 2024, 8:54 AM · Infrastructure-Foundations, Patch-For-Review, SRE Observability (FY2023/2024-Q3), SRE

Mar 27 2024

Volans added a comment to T358506: Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7.

The premise seems to mix different things. PuppetDB is a totally separated service from the PuppetMaster/PuppetServer ones and runs on their own hosts. Are you saying that naggen fails to connect to PuppetDB?

Mar 27 2024, 5:08 PM · Infrastructure-Foundations, Patch-For-Review, SRE Observability (FY2023/2024-Q3), SRE
Volans added a comment to T353878: Service implementation for elastic2087-2109.

elastic2088 is unreachable and reported as missing from PuppetDB by Netbox report. No host should be powered on with puppet disabled or not working for longer period of time. Please either reimage it or shut it down now and reimage it at a later stage (before powering it on).

Mar 27 2024, 10:18 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Patch-For-Review
Volans added a comment to T358882: Decommission elastic2037-2054.

elastic2037 is reported by Netbox for not being anymore in puppetdb, please either decommission it or shut it down. No host should be powered on without puppet running for extensive period of time.

Mar 27 2024, 10:13 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
Volans added a comment to T355422: Productionize db2196-db2220.

What's the status of db2202? It has puppet disabled since 22 days! Puppet should never be disabled for long periods, anad now it's gone from puppetdb/monitoring/everything, is a ghost host only reported by a Netbox report and also spamming daily root@ due to expired cert for the debmonitor client.

Mar 27 2024, 9:25 AM · database-backups, Patch-For-Review, DBA

Mar 26 2024

Volans added a comment to T348036: sre.hardware.upgrade-firmware cookbook: product slug parsing.

@BTullis indeed, that's another new device type created with the wrong slug. I've updated the slug in Netbox to fix it.

Mar 26 2024, 1:10 PM · netbox, DC-Ops, SRE, Infrastructure-Foundations

Mar 25 2024

Volans added a comment to T345337: spicerack: tox fails to install PyYAML using python 3.11 on bookworm.

Will you take care also of debian packaging it and any required dependencies?
Because spicerack is deployed with debian packages and upstream debian has 5.8.1 as the most recent release (not sure if has anything to do with the licencing) and 7.17.6 for python3-elasticsearch.

Mar 25 2024, 6:04 PM · cloud-services-team (FY2023/2024-Q3-Q4), Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack
Volans added a comment to T360029: Integrate dbctl IP changes as part of VLAN changes. .

Yes, correct.

Mar 25 2024, 3:54 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations

Mar 21 2024

Volans added a comment to T345337: spicerack: tox fails to install PyYAML using python 3.11 on bookworm.

We're going to upgrade curator (as well as its library) soon, as it's causing other problems (see T345337 ).

Mar 21 2024, 10:00 PM · cloud-services-team (FY2023/2024-Q3-Q4), Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack

Mar 18 2024

Volans added a comment to T360293: Spicerack puppetserver.destroy() raises an exception when certificate does not exist.

We do have get_certificate_metadata() that raises spicerack.puppet.PuppetServerCheckError if the cert is not found (as opposed to other errors).
What I was suggesting is that we could do that check directly in destroy() in the puppetserver class so that it behaves the same of the old puppetmaster one.

Mar 18 2024, 3:22 PM · Patch-For-Review, Infrastructure-Foundations, Puppet (Puppet 7.0), SRE-tools, Spicerack
Volans created T360297: Take advantage of 10Gb NICs in the new network stack.
Mar 18 2024, 11:45 AM · Infrastructure-Foundations, DC-Ops, netops
Volans triaged T360293: Spicerack puppetserver.destroy() raises an exception when certificate does not exist as Medium priority.

That's indeed the current behaviour and clearly an error, thanks for reporting it!
The exit codes of the puppetserver ca clean command are not documented in Puppet, or at least I couldn't find them in the public docs/manpage/help messages/source code.
Ideally puppetserver should report two different set of errors, the ones in which there is a certificate but it failed to perform some cleaning operations and the one where the certificate does not exists at all, but it doesn't seem the case.
Given that it doesn't, I think we shouldn't rely on specific output messages of the CLI and exit codes as it could hide other errors now or in the future.

Mar 18 2024, 11:17 AM · Patch-For-Review, Infrastructure-Foundations, Puppet (Puppet 7.0), SRE-tools, Spicerack

Mar 7 2024

Volans added a comment to T358506: Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7.

But is this still task still valid? The alert hosts were migrated to bookworm this week and puppet is running fine there.

Mar 7 2024, 11:18 AM · Infrastructure-Foundations, Patch-For-Review, SRE Observability (FY2023/2024-Q3), SRE

Mar 6 2024

Volans committed rOSNEa50a99d2f510: validators: fix existing bugs.
validators: fix existing bugs
Mar 6 2024, 12:31 PM
Volans committed rOSNEd8ec116f3eea: validators: improve IPs DNS name validation.
validators: improve IPs DNS name validation
Mar 6 2024, 12:26 PM
Volans committed rOSNEfdf49f4d8dd2: validators: add field name to fail messages.
validators: add field name to fail messages
Mar 6 2024, 12:26 PM
Volans created T359326: Inconsistent data in Netbox for some msw device.
Mar 6 2024, 12:21 PM · SRE, ops-eqiad, DC-Ops
Volans triaged T359320: Set MTU on mr1 interfaces as Low priority.
Mar 6 2024, 12:13 PM · Infrastructure-Foundations, netops

Mar 5 2024

Volans closed T355343: Q3:rack/setup/install es[2035-2040] as Resolved.

Got the list of affected hosts with nodeset -S '","' -e "es20[35-40]" on a cumin host, then I run the following code on Netbox:

>>> import uuid
>>> request_id = uuid.uuid4()
>>> user = User.objects.get(username='volans')
>>> def update(d):
...     ip = d.primary_ip6
...     ip.dns_name = ""
...     ip.save()
...     log = ip.to_objectchange('update')
...     log.request_id = request_id
...     log.user = user
...     log.save()
...
>>> devices = Device.objects.filter(name__in=["es2035","es2036","es2037","es2038","es2039","es2040"])
>>> len(devices)
6
>>> [d.name for d in devices]
['es2035', 'es2036', 'es2037', 'es2038', 'es2039', 'es2040']
>>> for device in devices:
...     update(device)
...
Mar 5 2024, 8:58 PM · SRE, Data-Persistence, ops-codfw, DC-Ops
Volans reopened T355343: Q3:rack/setup/install es[2035-2040] as "Open".

Re-opening as AAAA records were erroneously added to the hosts (AAAA records:N). I'll remove them programmatically.

Mar 5 2024, 8:43 PM · SRE, Data-Persistence, ops-codfw, DC-Ops
Volans added a comment to T358636: etcdmirror does not recover from a cleared waitIndex.

This got me thinking: if we're not really interested in what is in /spicerack, we could add to etcdmirror the ability to watch a keyspace but ignore some sub-keyspaces in replication.

Mar 5 2024, 10:15 AM · Patch-For-Review, serviceops

Mar 4 2024

Volans added a comment to T358542: Netbox errors caused by system board replacement .

Sounds good to me, let me know once done so that I can make the related changes to the report to include those too.

Mar 4 2024, 11:13 PM · SRE, ops-codfw
Volans triaged T358581: Routinator: CVE-2024-1622 as Medium priority.
Mar 4 2024, 3:33 PM · Vuln-VulnComponent, SecTeam-Processed, netops, Infrastructure-Foundations, Infrastructure Security, Security
Volans added a comment to T358542: Netbox errors caused by system board replacement .

@wiki_willy yes, if we go that way then I guess a separate tab on the accounting sheet with both asset tags (chassis and motherboard), compiled only for the hosts that have had the motherboard replaced but the asset tag not reset, should be enough information for the report to be adapted to include that information.

Mar 4 2024, 2:18 PM · SRE, ops-codfw

Mar 2 2024

Volans added a comment to T358809: Netbox:Report:PhysicalHosts: mistmach model issue.

FYI I've re-renamed PowerEdge R450 - Restbase-1G to PowerEdge R450 - ConfigRestbase-1G or we'd have issues in the firmware upgrades as outlined in T348036.

Mar 2 2024, 12:57 PM · Infrastructure-Foundations, DC-Ops, netbox
Volans added a comment to T345337: spicerack: tox fails to install PyYAML using python 3.11 on bookworm.

@Volans I sent you an invite next week to pair on this. Hopefully we should be able to figure this out before our offsite (starting 11 March).

Mar 2 2024, 12:44 PM · cloud-services-team (FY2023/2024-Q3-Q4), Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack

Mar 1 2024

Volans added a comment to T358825: Fix requestctl naming collision on "sites".

@RLazarus please update also requestctl-generator when you do it.
I've added an item to the task description. At the moment because superset is in the process to be migrated to k8s the requestctl-generator file lives in two different places and needs to be modified in both place, the current prod one will disappear soon though.

Mar 1 2024, 8:31 AM · Traffic, conftool
Volans updated the task description for T358825: Fix requestctl naming collision on "sites".
Mar 1 2024, 8:27 AM · Traffic, conftool

Feb 29 2024

Volans added a comment to T358542: Netbox errors caused by system board replacement .

I don't think there is a clean solution if the iDrac doesn't allow to override the value on the motherboard when done outside of warranty.
We could check if there is a way on the host to get both values and decide which one we want to export.
But it there isn't, then we'll need to keep both old and new values in at least one place to make sure we can cross-check them. That place IMHO could be either the accounting spreadsheet or Netbox , and then the report will be modified accordingly.

Feb 29 2024, 11:58 PM · SRE, ops-codfw
Volans placed T358809: Netbox:Report:PhysicalHosts: mistmach model issue up for grabs.

There is no mapping, the reported device types are just not following the correct naming scheme, as you can see here comparing with the others: https://netbox.wikimedia.org/dcim/device-types/?q=PowerEdge and as previously discussed in T348036

Feb 29 2024, 9:53 PM · Infrastructure-Foundations, DC-Ops, netbox
Volans added a comment to T358636: etcdmirror does not recover from a cleared waitIndex.

Of course I don't want to add additional lag to any live-traffic data (pybal, mwconfig, dbctl) and if we deem adding spicerack locks to the replication might cause that let's find another solution. For example we could have a failover etcd cookbook that when run will read the active locks from the primary cluster and manually replicate them on the secondary one explicitely. Or any other viable option.

Feb 29 2024, 8:19 AM · Patch-For-Review, serviceops

Feb 28 2024

Volans added a comment to T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work.

Even from itself? As in what happens if an operator runs authdns-update on a depooled host?

The NAMESERVERS list that is populated by confd affects only the hosts to which we SSH and run authdns-update, not the host itself. So if you depool dns1004 and run authdns-update from there, nothing changes. If you run authdns-update from dns1005 (or anywhere else), it won't touch dns1004. On my end, I think this behaviour makes sense. But is it fine from your perspective of automation and cookbooks?

Feb 28 2024, 6:33 PM · Traffic
Volans added a comment to T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work.

Do we need to change anything on the sre.dns.netbox cookbook?
It currently runs:

cd {git} && utils/deploy-check.py -g {netbox} --deploy
Feb 28 2024, 6:27 PM · Traffic
Volans added a comment to T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work.

Even from itself? As in what happens if an operator runs authdns-update on a depooled host?

Feb 28 2024, 6:21 PM · Traffic
Volans added a comment to T358636: etcdmirror does not recover from a cleared waitIndex.

But a second instance wouldn't prevent the current issue, right?

Feb 28 2024, 3:46 PM · Patch-For-Review, serviceops
Volans created T358648: SystemdUnitFailed alert aggregation issues.
Feb 28 2024, 10:12 AM · SRE, Observability-Alerting
Volans added a comment to T358636: etcdmirror does not recover from a cleared waitIndex.

That etcdmirror is mirroring only the /conftool keys it's totally news to me, I assumed it was replicating the whole content of etcd. But indeed it does not:

Feb 28 2024, 8:51 AM · Patch-For-Review, serviceops

Feb 27 2024

Volans added a comment to T358542: Netbox errors caused by system board replacement .

If updating the Accounting sheet is acceptable, I can do that. I will also update the servers with journal notes to keep track of what has been changed with which device.

Feb 27 2024, 5:56 PM · SRE, ops-codfw
Volans added a comment to T358594: Remove IPV6 dns records from new database hosts.

I've checked all the devices with names starting in db and es and the only ones with IPv6 AAAA records are: dbprov1004 and dbprov2004

Feb 27 2024, 3:35 PM · Data-Persistence, DC-Ops
Volans added a comment to T358594: Remove IPV6 dns records from new database hosts.

Cleanup completed, leaving the task open for DCOps to prevent this from happening.

Feb 27 2024, 3:21 PM · Data-Persistence, DC-Ops