Page MenuHomePhabricator

Volans (Riccardo Coccioli)
SRE

Projects (13)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Feb 10 2016, 11:25 AM (407 w, 3 d)
Availability
Available
IRC Nick
volans
LDAP User
Volans
MediaWiki User
RCoccioli (WMF) [ Global Accounts ]

Recent Activity

Fri, Dec 1

Volans added a comment to T253173: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK).

@MoritzMuehlenhoff I see that ganeti[2009-2024] and ganeti[1009-1022] are lacking AAAA records while the rest have it. Can we add them to the rest of the cluster?

Fri, Dec 1, 5:15 PM · Infrastructure-Foundations, IPv6, User-jbond, netbox
Volans added a comment to T271140: Some Data Persistence clusters apparently do not support IPv6.

Any update for the ms-be cluster that is still mixed? Can it be migrated to all have IPv6?

Fri, Dec 1, 5:13 PM · Data-Persistence, IPv6
Volans added a project to T312555: Some Search clusters have inconsistent AAAA DNS records for the primary IPv6 of the hosts: Data-Engineering.
Fri, Dec 1, 5:08 PM · Data-Engineering, Discovery-Search, IPv6
Volans added a comment to T312555: Some Search clusters have inconsistent AAAA DNS records for the primary IPv6 of the hosts.

Any update on this? The cluster is still mixed with some hosts having AAAA records and some without.

Fri, Dec 1, 5:03 PM · Data-Engineering, Discovery-Search, IPv6
Volans added a comment to T271142: Some Service Operations clusters apparently do not support IPv6.

@akosiaris I see that:

  • mw[1349-1413]
  • mw[2259-2376]
  • mc[2042-2055]
  • parse[2001-2020]
Fri, Dec 1, 4:34 PM · Patch-For-Review, Infrastructure-Foundations, Dumps-Generation, IPv6, serviceops, SRE-tools

Thu, Nov 30

Volans added a comment to T349273: Puppet execution sometimes interrupted when running from PoPs.

@jhathaway ack, if we're not seeing any more failures in puppetboard let's close it and re-open in case they happen again.

Thu, Nov 30, 7:20 PM · Infrastructure-Foundations, Puppet-Infrastructure
Volans added a comment to T350615: Support ipv6 address or to bind to all (::).

I've commented on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/972724/4/includes/poolcounter/PoolCounterConnectionManager.php#84 what it looked like the possible issue with the patch.

Thu, Nov 30, 4:24 PM · Patch-For-Review, MW-1.40-notes, MW-1.41-notes, MW-1.42-notes (1.42.0-wmf.7; 2023-11-28), PoolCounter, MediaWiki-Platform-Team
Volans renamed T352438: Unable to log in to SUL account from No one is able to log in to SUL account to Unable to log in to SUL account.
Thu, Nov 30, 3:33 PM · MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
Volans lowered the priority of T352438: Unable to log in to SUL account from Unbreak Now! to High.

This doesn't seem to be a widespread login problem at this time. (lowering the priority)
All indications so far points to a rate-limiting issue with multiple people sharing the same public IP.

Thu, Nov 30, 3:32 PM · MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth

Wed, Nov 29

Volans added a comment to T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work.

@ssingh what's your timeline to switch to use this new method to get what DNS hosts are pooled? As you know we need to adjust spicerack/cookbooks accordingly.

Wed, Nov 29, 4:57 PM · Patch-For-Review, Traffic
Volans added a project to T348036: sre.hardware.upgrade-firmware cookbook: product slug parsing: netbox.
Wed, Nov 29, 11:20 AM · netbox, DC-Ops, SRE, Infrastructure-Foundations
Volans triaged T348036: sre.hardware.upgrade-firmware cookbook: product slug parsing as Low priority.

Given no objections I went ahead and fixed ALL names and slug to adhere to the standard. Triaging as low and leaving the task open to add a validator later.

Wed, Nov 29, 11:20 AM · netbox, DC-Ops, SRE, Infrastructure-Foundations

Tue, Nov 28

Volans triaged T352163: cr2-esams Transit Tele2 down as High priority.
Tue, Nov 28, 11:34 AM · Infrastructure-Foundations, netops
Volans created T352163: cr2-esams Transit Tele2 down.
Tue, Nov 28, 11:33 AM · Infrastructure-Foundations, netops
Volans added a comment to T351891: Abstract a bit more the server provisioning process.

I had a quick thought about the ENC++ problem as you have named it and I think in the end given a netbox device object (hostname + location + eventually other data) + hardware specs (auto-detection via Redfish?) we will need something to map this to:

  • Puppet role (currently in site.pp)
  • Hardware profile [BIOS virtualization + hardware RAID configuration] (currently manually set via cookbook argument and manually setup)
  • Network profile [VLAN, skip IPv6, cassandra IPs, etc...] (currently manually set via Netbox provision script arguments]
Tue, Nov 28, 11:18 AM · Infrastructure-Foundations, SRE-tools
Volans merged T318787: Investigate converting LBRemoteCluster cookbooks to SRELBBatchRunnerBase into T317855: Migrate existing cookbooks related to rolling restarts/reboots to SREBatchBase.
Tue, Nov 28, 10:11 AM · Spicerack, Infrastructure-Foundations, SRE-tools, SRE
Volans merged task T318787: Investigate converting LBRemoteCluster cookbooks to SRELBBatchRunnerBase into T317855: Migrate existing cookbooks related to rolling restarts/reboots to SREBatchBase.
Tue, Nov 28, 10:11 AM · Spicerack, Infrastructure-Foundations, SRE-tools, SRE
Volans added a comment to T318787: Investigate converting LBRemoteCluster cookbooks to SRELBBatchRunnerBase.

As all the above cookbooks are already listed in T317855 I'm resolving this as duplicate.

Tue, Nov 28, 10:11 AM · Spicerack, Infrastructure-Foundations, SRE-tools, SRE
Volans moved T319277: wait_for_optimal() should ignore acked alerts from Backlog to Easy Wins on the Spicerack board.
Tue, Nov 28, 10:09 AM · Infrastructure-Foundations, Spicerack, SRE-tools
Volans moved T328911: Expose hosts from MysqlLegacyRemoteHosts in spicerack from Backlog to Easy Wins on the Spicerack board.
Tue, Nov 28, 10:08 AM · Infrastructure-Foundations, SRE-tools, Spicerack, serviceops, Datacenter-Switchover, SRE
Volans moved T335879: spicerack.phabricator: Don't fail when logging to a restricted task from Backlog to Easy Wins on the Spicerack board.
Tue, Nov 28, 10:06 AM · Spicerack, Infrastructure-Foundations, SRE-tools
Volans triaged T311050: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() as Medium priority.

Perfect, thanks for the update.

Tue, Nov 28, 10:05 AM · SRE-tools, Infrastructure-Foundations, Spicerack
Volans triaged T335879: spicerack.phabricator: Don't fail when logging to a restricted task as Low priority.

As the main blocker was resolved giving more permissions to the bot in T314917, setting the priority lower for a general solution in the future.

Tue, Nov 28, 9:53 AM · Spicerack, Infrastructure-Foundations, SRE-tools
Volans closed T347093: [spicerack] Add remote command output to log file as Declined.

As there is already a workaround to do that in the cookbooks on demand and it will be even simpler with the cumin work mentioned, I'm declining this for now as it didn't get much traction. Happy to reopen it in the future if we feel it's necessary.

Tue, Nov 28, 9:41 AM · Infrastructure-Foundations, SRE-tools, cloud-services-team, Spicerack
Volans removed projects from T350565: Switch conftool to use the version 3 etcd datastore: Spicerack, SRE-tools.

Untagged sre-tools and spicerack as I've created the dedicated sub-tasks for them.

Tue, Nov 28, 9:33 AM · conftool, Infrastructure-Foundations, Data-Persistence, Traffic, serviceops
Volans created T352155: Spicerack: migrate distributed locking to etcd v3.
Tue, Nov 28, 9:31 AM · Infrastructure-Foundations, Spicerack, SRE-tools
Volans created T352153: Spicerack: adapt conftool module for etcd v3.
Tue, Nov 28, 9:30 AM · Infrastructure-Foundations, Spicerack, SRE-tools
Volans added a comment to T339243: ServiceLVS without monitor breaks spicerack.

We had only a couple of changes in the service.yaml schema in the last months and both were sent to Spicerack before hitting production on the Puppet side, so nothing broke in those cases.
What we were thinking is to instead of a refactor of the whole thing in spicerack maybe it would be simpler to have a CI check in puppet that checks that the fields are all there.

Tue, Nov 28, 9:26 AM · Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack
Volans added a comment to T352128: No on-call page notification when shift override was set on November 27.

As we got an email from VO about unassigned overrides I think that the issue here is that only one rotation was assigned and not the one that actually pages:

Tue, Nov 28, 8:11 AM · Incident Tooling

Mon, Nov 27

Volans changed hashtags for SRE-tools, added #httpbb; removed #httpb.
Mon, Nov 27, 3:45 PM
Volans added a comment to T311050: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() .

@JMeybohm could you confirm the above or give me more context?

Mon, Nov 27, 2:56 PM · SRE-tools, Infrastructure-Foundations, Spicerack
Volans placed T351950: taavi's netbox-next account is stuck up for grabs.
Mon, Nov 27, 8:44 AM · Patch-For-Review, Infrastructure-Foundations, netbox
Volans updated subscribers of T351950: taavi's netbox-next account is stuck.

Interesting, I can confirm that on netbox-next admin the user taavi doesn't have any groups associated and as such doesn't have the additional privileges.
But looking at the ops group in the same DB taavi is reported in the Available users but not on the Chosen users, see https://netbox-next.wikimedia.org/admin/auth/group/8/change/

Mon, Nov 27, 8:44 AM · Patch-For-Review, Infrastructure-Foundations, netbox
Volans added a comment to T350694: Infrastructure Foundation Alerts to migrate.

We got this today in the I/F IRC channel:

Mon, Nov 27, 8:13 AM · Patch-For-Review, Infrastructure-Foundations, SRE Observability (FY2023/2024-Q2), Observability-Alerting

Sat, Nov 25

Volans claimed T351950: taavi's netbox-next account is stuck.

I see that on netbox-next you have 2 accounts, one with taavi and a personal email address and one with your wmf email and the username you're reporting.
Given that next is for experimentation and the DB is cloned from production on demand from time to time, I took the liberty to delete both users.
Could you try to re-login and see if this time it works?

Sat, Nov 25, 3:14 PM · Patch-For-Review, Infrastructure-Foundations, netbox

Thu, Nov 23

Volans closed T327408: wmflib: improve interactive.ask_input to support free-form responses as Resolved.

This was fixed in wmflib v1.2.1 released on Feb. 2nd.

Thu, Nov 23, 9:03 PM · Infrastructure-Foundations, SRE-tools
Volans closed T346134: spicerack.decorators.retry: dynamic_params_callbacks=(set_tries,) dosn't seem to work as expected as Resolved.

The change has been merged and released with Spicerack v7.3.0 on Oct. 4th. Resolving.

Thu, Nov 23, 9:01 PM · Infrastructure-Foundations, Spicerack, SRE-tools
Volans added a comment to T295774: WMCS VIPs: Netbox netmask inconsistencies.

Trying to run the import puppetdb script on cloudgw1002 is now a noop, but for cloudgw2002-dev fails with this exception:

Thu, Nov 23, 9:00 PM · Patch-For-Review, SRE, Infrastructure-Foundations, SRE-tools
Volans closed T336547: Error creating device in netbox as Resolved.

Thanks for reporting this. The issue was caused by a bug in one of the new custom validators that was hit only during the creation of a new device but not while editing an existing one.
The fix has been deployed to production. As an example this is a new test device created on netbox-next: https://netbox-next.wikimedia.org/dcim/devices/4642/

Thu, Nov 23, 8:56 PM · Infrastructure-Foundations, netbox, SRE
Volans added a comment to T350152: Automation to change a server's vlan.

Some random additions:

Thu, Nov 23, 5:30 PM · Patch-For-Review, SRE-tools, Infrastructure-Foundations
Volans added a comment to T351891: Abstract a bit more the server provisioning process.

As Arzhel defined it there would be one table, and the host the script (be that existing Netbox ProvisionServerNetwork or replacement cookbook) would select the entry based on the name of the host it was operating on?

Thu, Nov 23, 4:24 PM · Infrastructure-Foundations, SRE-tools
Volans added a comment to T351891: Abstract a bit more the server provisioning process.

In addition I think that we need to solve first another problem, that is a pre-requisite for this and other similar requests of automation: an authoritative mapping between hostnames and what you called specs table.

I fully understand the first problem, which is not easy to fix. But I'm wondering why that would be a pre-requisite to the "specs table" mapping, for instance just updating the existing Netbox ProvisionServerNetwork script to allocate IPs/vlan/dns names based on such a table? That doesn't seem related to me at first glance but perhaps I'm missing something.

Thu, Nov 23, 3:30 PM · Infrastructure-Foundations, SRE-tools
Volans added a comment to T351891: Abstract a bit more the server provisioning process.

When we introduced the sre.hosts.provision cookbook we envision
Piling many changes together simplifies the user interaction but leaves a lot of open questions to be answered before automating the process regarding what to do in case of errors:

Thu, Nov 23, 3:16 PM · Infrastructure-Foundations, SRE-tools

Mon, Nov 20

Volans renamed T329297: puppetmasters: investigate if the puppetmasters still need a checkout of operations/software from pupetmastrs: investigate if the puppetmasteres still need a checkout of operations/software to puppetmasters: investigate if the puppetmasters still need a checkout of operations/software.
Mon, Nov 20, 3:49 PM · Puppet-Infrastructure, Infrastructure-Foundations
Volans added a comment to T351643: Warning re: excessive directory entries on prometheus with puppet7.

I think that the problem is that the directory is defined in puppet with recurse=true in modules/prometheus/manifests/init.pp. Is that necessary? Could puppet just manage some subdirectories?

Mon, Nov 20, 1:49 PM · Infrastructure-Foundations, Puppet-Core, Puppet (Puppet 7.0), SRE
Volans added a comment to T349925: Q2:rack/setup/install ganeti103[5-8].

The hosts were setup in Netbox with a public VLAN and FQDN (wikimedia.org) while they should have been setup with the private one (eqiad.wmnet FQDNs).
The changes were not committed to the DNS (running the sre.dns.netbox cookbook), as a result Icinga has been alerting for Uncommitted DNS changes in Netbox since Friday.
I've noticed that the provision cookbook was run for all the hosts, and failed for all of them. That's because the connection to the Redfish API of the iDRAC is via IP address but then the check that remote IPMI works uses the DNS and the management DNS records were not committed.

Mon, Nov 20, 10:58 AM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops
Volans added a comment to T350656: dbconfig bug - "2 instances found for query ...".

Great, thanks. Then I think T350656#9312531 should explain everything :)

Mon, Nov 20, 9:57 AM · Data-Persistence, Patch-For-Review, conftool

Wed, Nov 15

Volans closed T351333: build python-phabricator package for bullseye (and bookworm?), a subtask of T327068: Bullseye upgrade for remaining Collab hosts, as Invalid.
Wed, Nov 15, 6:17 PM · collaboration-services
Volans closed T351333: build python-phabricator package for bullseye (and bookworm?) as Invalid.

There is no python2 in our setup of bullseye or bookworm. python3-phabricator is on Debian (see https://packages.debian.org/bookworm/python3-phabricator )

Wed, Nov 15, 6:17 PM · Packaging, Infrastructure-Foundations, Phabricator, collaboration-services

Tue, Nov 14

Volans closed T348319: Update reimage cookbooks to work with puppet7 as Resolved.

This is now done.

Tue, Nov 14, 1:22 PM · Patch-For-Review, SRE-tools, Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
Volans closed T348319: Update reimage cookbooks to work with puppet7, a subtask of T330490: Next steps for Puppet 7, as Resolved.
Tue, Nov 14, 1:21 PM · Puppet-Infrastructure, Puppet (Puppet 7.0), Patch-For-Review, Infrastructure-Foundations, SRE
jcrespo awarded T212783: cumin: Make output path sane and flexible (was: allow to suppress output and progress bars) a Grey Medal token.
Tue, Nov 14, 12:43 PM · Cumin, Infrastructure-Foundations
Volans added a comment to T212783: cumin: Make output path sane and flexible (was: allow to suppress output and progress bars).

yes, that's correct

Tue, Nov 14, 11:59 AM · Cumin, Infrastructure-Foundations

Mon, Nov 13

Volans claimed T348319: Update reimage cookbooks to work with puppet7.
Mon, Nov 13, 5:00 PM · Patch-For-Review, SRE-tools, Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
Volans added a comment to T341496: spicerack: update spicerack to work with the newer puppet infrastructure.

Update: for the production side of things this is completed. Leaving open for now as the https://doc.wikimedia.org/spicerack/master/api/spicerack.puppet.html#spicerack.puppet.PuppetHosts.get_ca_servers method doesn't yet support SRV records but is currently used only in WMCS.

Mon, Nov 13, 4:06 PM · Patch-For-Review, Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE

Thu, Nov 9

Volans added a comment to T349244: Q1:Install cp11[00-15] and rotate into production.

cp1108 completed: see T350179#9321006

Thu, Nov 9, 8:40 PM · ops-eqiad, DC-Ops, Traffic, SRE
Volans added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

I got from traffic cp1108 to try, I run in parallel a tcpdump on the install host (following https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#DHCP_issues ) and there was NO REQUEST incoming matching ANY of the MAC addresses of the host:

  • eno12399np0: 04:32:01:14:b5:80 the active one
  • eno12409np1: 04:32:01:14:b5:81
  • eno8303: b4:45:06:f6:5a:be
  • eno8403: b4:45:06:f6:5a:bf
Thu, Nov 9, 8:07 PM · SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
Volans added a comment to T350479: Netbox PuppetDB Import Script Failing for cloudnet1006.

For now I'll update customscripts/_common.py so that it fails cleanly if this should occur. Not sure what else to do without knowing exactly how it happened.

Thu, Nov 9, 3:08 PM · netops, Infrastructure-Foundations, SRE

Wed, Nov 8

Volans added a comment to T350694: Infrastructure Foundation Alerts to migrate.

Another thing that is strictly related to icinga at the moment is the raid_handler that is triggered by any raid alert and creates a task with the output of a script run on the fly via nrpe. See for example T316565

Wed, Nov 8, 3:21 PM · Patch-For-Review, Infrastructure-Foundations, SRE Observability (FY2023/2024-Q2), Observability-Alerting
Volans added a comment to T330882: transferpy should not log cumin subcomands as ERRORs on a normal, succesful run.

If you're not interested in the report of the results of the commands you can set the worker.reporter property to the NullReporter (
from cumin.transports.clustershell import NullReporter).
If you're not interested in the progress bars being printed for some commands you can set worker.progress_bars = False.
Here a full example:

Wed, Nov 8, 2:30 PM · Patch-For-Review, database-backups, Data-Persistence-Backup
Volans added a comment to T350656: dbconfig bug - "2 instances found for query ...".

Indeed, do you have a dbctl host that is already out of production or not used for any reason, or we could pick a replica with low weight in the secondary dc

Wed, Nov 8, 10:18 AM · Data-Persistence, Patch-For-Review, conftool
Volans added a comment to T350656: dbconfig bug - "2 instances found for query ...".

But your history has a little bit later in the file:

Wed, Nov 8, 10:09 AM · Data-Persistence, Patch-For-Review, conftool

Tue, Nov 7

Volans added a comment to T350694: Infrastructure Foundation Alerts to migrate.

Does it makes sense to migrate those fairly complex alerts that are reporting a lot of information in the alert itself to alertmanager?
How many metrics would a raid_megaraid for example need to generate to have the same level of information? (per host and per disk)

Tue, Nov 7, 3:38 PM · Patch-For-Review, Infrastructure-Foundations, SRE Observability (FY2023/2024-Q2), Observability-Alerting
Volans added a comment to T350656: dbconfig bug - "2 instances found for query ...".

Ok, I think I found the problem, the write_callback(self, callback, id, **args) doesn't pass any datacenter selection when calling obj = self.get(*id), while its signature allows for it: get(self, name, dc=None). (that in turns calls get_all() but the dc is always propagated there).

Tue, Nov 7, 2:10 PM · Data-Persistence, Patch-For-Review, conftool
Volans added a comment to T350656: dbconfig bug - "2 instances found for query ...".

There is no automatic expiration on any key written/edited by dbctl AFAIK.
As for the history content it's a bit complicated but from a quick look at the mirror logs I see that there was a db2103 key in eqiad at some point this morning:

Tue, Nov 7, 1:58 PM · Data-Persistence, Patch-For-Review, conftool
Volans added a comment to T350479: Netbox PuppetDB Import Script Failing for cloudnet1006.

The code is not checking if he autoselection of the parent is None or not. That said re-running the script now works fine. What was changed in the Netbox data to fix the issue?

Tue, Nov 7, 1:29 PM · netops, Infrastructure-Foundations, SRE
Volans added a comment to T350656: dbconfig bug - "2 instances found for query ...".

I've also run while read line; do sudo dbctl instance "${line}" get; sleep 1; done < dblist getting the dblist from etcd and I couldn't repro the error.

Tue, Nov 7, 1:22 PM · Data-Persistence, Patch-For-Review, conftool
Volans added a comment to T344164: VMs requested for stewards.

@Dzahn once he above patch is merged you can proceed directly running the reimage cookbook on the host as the VM was correctly created and the last step was calling the reimage cookbook.

Tue, Nov 7, 8:43 AM · Stewards-Onboarding-Tool, collaboration-services, Infrastructure-Foundations, Stewards-and-global-tools, SRE, vm-requests
Volans added a comment to T344164: VMs requested for stewards.
Exception raised while parsing arguments for cookbook sre.hosts.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 334, in _safe_call
    ret_value = func(*args, **kwargs)
  File "/usr/lib/python3.9/argparse.py", line 1830, in parse_args
    args, argv = self.parse_known_args(args, namespace)
  File "/usr/lib/python3.9/argparse.py", line 1863, in parse_known_args
    namespace, args = self._parse_known_args(args, namespace)
  File "/usr/lib/python3.9/argparse.py", line 1907, in _parse_known_args
    option_tuple = self._parse_optional(arg_string)
  File "/usr/lib/python3.9/argparse.py", line 2194, in _parse_optional
    if not arg_string[0] in self.prefix_chars:
TypeError: 'int' object is not subscriptable
Tue, Nov 7, 8:22 AM · Stewards-Onboarding-Tool, collaboration-services, Infrastructure-Foundations, Stewards-and-global-tools, SRE, vm-requests
Volans added a comment to T350656: dbconfig bug - "2 instances found for query ...".

I was having a look, I checked etcd and I didn't find two records that could match the name db2103. I also can't repro it, both on cumin1001 and cumin2002:

Tue, Nov 7, 8:14 AM · Data-Persistence, Patch-For-Review, conftool

Mon, Nov 6

Volans added a project to T148494: Add shell scripts CI validations: Puppet CI.
Mon, Nov 6, 4:32 PM · Puppet CI, Infrastructure-Foundations, SRE, Continuous-Integration-Config, SRE-tools
Volans added a comment to T330490: Next steps for Puppet 7.

@jbond I think that the decommission cookbook needs some adjustment too, both because it checks some git checkout on the puppetmaster's CA and also because it does remove the certificate.

Mon, Nov 6, 2:33 PM · Puppet-Infrastructure, Puppet (Puppet 7.0), Patch-For-Review, Infrastructure-Foundations, SRE

Nov 2 2023

Volans added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

My understanding is that all those hosts have been already reimaged into their related insetup::* role. I'm wondering why you need to re-image them again instead of just switching role in site.pp and run puppet. The insetup role just installs the same base system that any other role would do (if the appropriate insetup role has been chosen).

Nov 2 2023, 2:44 PM · SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad

Oct 31 2023

Volans added a comment to T348129: Create automation to move servers in Netbox from old to new switch.

this way there is no check to ensure that reality corresponds to what we do with the automation

Not sure I understand this.

Oct 31 2023, 4:07 PM · Infrastructure-Foundations, netops, SRE
Volans added a comment to T348129: Create automation to move servers in Netbox from old to new switch.

@cmooney thanks for the summary, couple of questions:

Oct 31 2023, 2:11 PM · Infrastructure-Foundations, netops, SRE

Oct 27 2023

Volans closed T276749: Flapping Prometheus metrics for netbox_device_statistics as Resolved.

This hasn't happened in a long time. Resolving.

Oct 27 2023, 3:47 PM · Infrastructure-Foundations, observability, netbox
Volans closed T342345: sre.hosts.reimage: fails to get uptime in debian installer as Resolved.

Resolving for now, feel free to re-open in case it happens again.

Oct 27 2023, 3:27 PM · DC-Ops, Infrastructure-Foundations, SRE-tools

Oct 19 2023

Volans created T349273: Puppet execution sometimes interrupted when running from PoPs.
Oct 19 2023, 7:03 AM · Infrastructure-Foundations, Puppet-Infrastructure
Volans added a comment to T341973: Spicerack: add distributed locking support.

Disributed locking is now live in Spicerack and used by the Cookbooks.
For a general overview see https://doc.wikimedia.org/spicerack/master/introduction.html#distributed-locking

Oct 19 2023, 6:43 AM · Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack

Oct 18 2023

Volans added a comment to T342176: Q1:rack/setup/install db12[26-33].

@Jclark-ctr:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, No puppet role has been assigned to this node. (file: /etc/puppet/manifests/site.pp, line: 2939, column: 9) on node db1229.eqiad.wmnet
Oct 18 2023, 7:41 PM · SRE, Data-Persistence, ops-eqiad, DC-Ops
Volans added a comment to T349176: Route systemd unit alerts to the correct team.

Thanks for the task! I think another potential use case are the docker-reporter* units on the build host.

Oct 18 2023, 9:39 AM · SRE Observability (FY2023/2024-Q2), Observability-Alerting

Oct 16 2023

Volans added a comment to T348876: Container image reports in debmonitor are broken.

The original idea for the report of images to debmonitor was that they should be reported at creation time, and, given their immutability, it shouldn't require the need to report them again until deletion. Given the lack of a way to properly cleanup them the current implementation, as you know, is different.

Oct 16 2023, 10:09 AM · collaboration-services, GitLab, Patch-For-Review, serviceops, Infrastructure-Foundations

Oct 12 2023

Volans added a comment to T348734: Port defs_from_etcd logic to nftables.

Mentioning T348525 too to avoid duplicate work.

Oct 12 2023, 12:15 PM · Patch-For-Review, Infrastructure-Foundations, SRE

Oct 11 2023

Volans added a project to T270071: SVC DNS zonefiles and source of truth: DNS.

I really think that we need to find a solution for this. It has been pending for too long.

Oct 11 2023, 12:21 PM · Traffic, DNS, Infrastructure-Foundations, serviceops-radar, SRE-tools, SRE
Volans added a comment to T348632: k8s-ingress-aux.svc.codfw.wmnet marked as Active in Netbox.

I noticed also that aux-k8s-ctrl.svc.eqiad.wmnet is missing the PTR record in the operations/dns repository.

Oct 11 2023, 12:01 PM · Infrastructure-Foundations
Volans reopened T299700: Remove legacy ELK LVS entries, a subtask of T281266: Decommission old ELK5 Logstash cluster, as Open.
Oct 11 2023, 11:52 AM · SRE Observability (FY2021/2022-Q3), SRE
Volans reopened T299700: Remove legacy ELK LVS entries as "Open".

FYI the service IPs are still allocated in Netbox:
https://netbox.wikimedia.org/ipam/ip-addresses/?q=kibana.svc
https://netbox.wikimedia.org/ipam/ip-addresses/?q=logstash.svc

Oct 11 2023, 11:52 AM · SRE Observability (FY2023/2024-Q2), Traffic, Patch-For-Review, SRE
Volans triaged T348632: k8s-ingress-aux.svc.codfw.wmnet marked as Active in Netbox as Medium priority.
Oct 11 2023, 11:42 AM · Infrastructure-Foundations
Volans reopened T316296: Sunset search.wikimedia.org service as "Open".

FYI the SVC addresses are still allocated in Netbox: https://netbox.wikimedia.org/ipam/ip-addresses/?q=apple-search
I guess they should be removed. When doing so remember to run the sre.dns.netbox cookbook too.

Oct 11 2023, 11:37 AM · Patch-For-Review, Discovery-Search, serviceops, collaboration-services, Technical-Debt, SRE
Volans triaged T348631: tegola-vector-tiles SVC records missing reverse PTRs as Medium priority.
Oct 11 2023, 11:32 AM · serviceops
Volans reopened T242855: Undeploy graphoid as "Open".

FYI The service IPs in Netbox are still allocated to the service and probably needs cleanup:
https://netbox.wikimedia.org/ipam/ip-addresses/?q=graphoid

Oct 11 2023, 11:26 AM · MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), Patch-For-Review, MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), Platform Engineering (Icebox), serviceops, SRE, Graphoid
Volans reopened T242855: Undeploy graphoid, a subtask of T274738: Archive the graphoid service and deploy repos, as Open.
Oct 11 2023, 11:25 AM · Graphoid, translatewiki.net, Wikimedia-GitHub, Diffusion-Repository-Administrators, Projects-Cleanup
Volans added a comment to T347278: Decommission ORES configurations and servers.

@klausman the DNS step is marked as done, but I see the ORES SVC records still existing in Netbox ( https://netbox.wikimedia.org/ipam/ip-addresses/?q=ores ) is that a leftover or pending some other step? (when removed a run of the sre.dns.netbox cookbook is needed)

Oct 11 2023, 11:21 AM · Patch-For-Review, Machine-Learning-Team

Oct 10 2023

Volans triaged T348525: etcd increased QGET traffic since January 2023 as Medium priority.
Oct 10 2023, 11:32 AM · serviceops, SRE, Infrastructure-Foundations

Oct 9 2023

Volans updated subscribers of T341973: Spicerack: add distributed locking support.

For the record as Giuseppe is out, I had a chat with @CDanis going over the plan and numbers and we didn't find anything worrisome or blockers. I'll proceed with the current implementation, anyway it will be off by default and switched on only with a puppet change to update the config file, that will also allow to easily stop using the locks in case there is any issue.

Oct 9 2023, 8:10 AM · Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack
Volans added a comment to T327938: Plan codfw row A/B top-of-rack switch refresh.

@cmooney adding a note here to not forget. We'll need to check how it will work for Ganeti VMs, in particular the makevm cookbook has a knowledge of DCs that have per-rack subnets and to treat them differently, but it needs to be aware of rows then it needs some refactoring and possible get the information live instead of being hardcoded.

Oct 9 2023, 8:08 AM · netops, Infrastructure-Foundations, SRE

Oct 4 2023

fgiunchedi awarded T347954: Reimage cookbook with --conftool should suggest to repool only the reimaged host(s) a Like token.
Oct 4 2023, 8:31 AM · Cumin, Infrastructure-Foundations
Volans closed T347954: Reimage cookbook with --conftool should suggest to repool only the reimaged host(s) as Resolved.

This should be resolved, feel free to re-open in case you have any issue.

Oct 4 2023, 8:24 AM · Cumin, Infrastructure-Foundations
Volans added a project to T348036: sre.hardware.upgrade-firmware cookbook: product slug parsing: DC-Ops.

IMHO I think we should stick to the agreed format in T284614#7214588 and T284614#7222919 and rename (and re-slug) the 3 non matching ones into the format PowerEdge R440 - ConfigFundraising 202107 and so on. @wiki_willy what do you think?

Oct 4 2023, 7:50 AM · netbox, DC-Ops, SRE, Infrastructure-Foundations

Oct 3 2023

Volans claimed T347954: Reimage cookbook with --conftool should suggest to repool only the reimaged host(s).
Oct 3 2023, 1:32 PM · Cumin, Infrastructure-Foundations