Page MenuHomePhabricator

Volans (Riccardo Coccioli)
SRE

Projects (8)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Feb 10 2016, 11:25 AM (196 w, 6 d)
Availability
Available
IRC Nick
volans
LDAP User
Volans
MediaWiki User
RCoccioli (WMF) [ Global Accounts ]

Recent Activity

Today

Volans added a comment to T237469: Netbox: Fix hostname case ambiguity.

I've reverted the above patch as it was reporting most servers as false positive for the new name coherence report. The regex was wrong and not able to match our current hostnames.
Also is not totally clear to me the goal of this check, as it was opened for the asset tag names specifically but the check was expanded to all hosts.
For asset tag hostnames for example we should check that the hostname matches the asset tag of the same device, and if those are already lowercase, that would be already enough.

Tue, Nov 19, 1:20 PM · netbox, DC-Ops
Volans added a comment to T236277: Extend Puppet CA Expiry date .

The debmonitor test didn't test much as the debmonitor client sends the puppet client cert (not the CA) and it's the server that validates it with the CA.

Tue, Nov 19, 11:56 AM · DBA, Patch-For-Review, User-jbond, Puppet, Operations
Volans added a comment to T236277: Extend Puppet CA Expiry date .

@jbond just to be on the safe side and to verify the theory, if possible make a quick test that the new cert in the CR is able to verify exiting puppet node certs and cergen certs.

Tue, Nov 19, 10:26 AM · DBA, Patch-For-Review, User-jbond, Puppet, Operations

Fri, Nov 15

Volans added a comment to T233183: Automate generation of Management DNS records from Netbox.

A simplified version could be to use a cookbook to couple stuff:

Fri, Nov 15, 7:42 PM · User-jbond, Operations, Traffic, Patch-For-Review, User-crusnov, Goal, SRE-tools
Volans added a comment to T233183: Automate generation of Management DNS records from Netbox.

Seems sane! The only thing I'm a little iffy about iis from the "SHA1 written to etcd" onwards. I'm not sure it's a bad approach, but I'm not sure I've thought through all the implications, either. I think the key things to think about in that part of the flow that might be missing are, is emergency updates to the netbox-defined data. e.g. if all the things are borked and we need to manually edit a DNS entry in the netbox-derived zonefile fragments.... is there a way we can do that from the authdns servers with authdns-update? (e.g. a local commit and an override of the SHA1 argument?).

Fri, Nov 15, 2:42 PM · User-jbond, Operations, Traffic, Patch-For-Review, User-crusnov, Goal, SRE-tools

Thu, Nov 14

Volans added a member for Security: Tgr.
Thu, Nov 14, 4:57 AM
Volans added a member for WMF-NDA: Tgr.
Thu, Nov 14, 4:56 AM

Wed, Nov 13

Volans added a comment to T224946: Netbox Alert Cleanups.

I am not sure this is related, but we get many alerts of

  • PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL
  • PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed

If those are actionable from dc-ops or if we are getting more false positives than we should, we must fix it.

Wed, Nov 13, 2:34 PM · Operations, observability, User-crusnov, netbox, SRE-tools
Volans added a comment to T238200: debmonitor TLS termination.

As we discussed a while ago about this, the easiest solution is to pick another port for the public TLS server on the debmonitor servers as the 443 is already taken for the internal clients to report the package list to it and it's used to perform authz/n with the client certificate.

Wed, Nov 13, 12:16 PM · Operations, Traffic

Sat, Nov 9

Volans triaged T237803: Netbox reports Icinga checks timeout as High priority.
Sat, Nov 9, 12:19 PM · Operations, SRE-tools, netbox
Volans created T237803: Netbox reports Icinga checks timeout.
Sat, Nov 9, 12:19 PM · Operations, SRE-tools, netbox

Fri, Nov 8

Dzahn awarded T179816: Cumin: create external backend for WMCS Puppet API a Love token.
Fri, Nov 8, 5:45 PM · cloud-services-team (Kanban), SRE-tools
Volans added a comment to T237691: cloud-cumin-01: HTTPSConnectionPool - Max retries exceeded with url.

@Krenair the default ones HTTPSConnectionPool(host='localhost', port=443)
@Dzahn that would be T179816, but it seems that there haven't been much interest in that backend.

Fri, Nov 8, 5:23 PM · SRE-tools, Cloud-VPS
Volans closed T157002: Puppet compiler: re-add the concurrency option NUM_THREADS as Resolved.

I don't recall the details, too much time has passed, but indeed, it seem it's still supported by the puppet compiler code, so I'll resolve it.

Fri, Nov 8, 4:45 PM · User-jbond, puppet-compiler, Operations, SRE-tools
Volans updated the task description for T229586: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099.
Fri, Nov 8, 3:21 PM · ops-eqiad, DC-Ops, decommission, Operations
Volans added a comment to T237587: Determine & implement near-term method for escalating network alerts.

I'd rather not do (3), seems a step back (not respecting awake hours and such).

Fri, Nov 8, 9:44 AM · Operations, netops, observability
Volans updated subscribers of T237604: Record per-server power usage.

I've some concerns to proceed with this. In our experience the BMCs are not that stable and an excessive interaction with them seems to aggravate the situation, statistically causing more BMCs to become unresponsive and requiring a reset.
For this reason we've kept to a minimum our checks of BMCs and I'd rather not add something that query the BMC so often.

Fri, Nov 8, 9:22 AM · observability

Wed, Nov 6

Volans triaged T234358: wmf-auto-reimage-host on HP gen10 WARNING: unable to verify that BIOS boot parameters are back to normal, got: as Low priority.

No it's ok to keep it open and look at it at some point, no priority though.

Wed, Nov 6, 3:19 PM · SRE-tools
Volans moved T233183: Automate generation of Management DNS records from Netbox from Backlog to In Progress on the SRE-tools board.
Wed, Nov 6, 3:18 PM · User-jbond, Operations, Traffic, Patch-For-Review, User-crusnov, Goal, SRE-tools
Volans closed T236684: sre.hosts.downtime fails with "No hosts provided" as Resolved.

We've migrated to the new puppetdb hosts with the newer version. The queue size is under control for now. Resolving.

Wed, Nov 6, 3:17 PM · User-jbond, Patch-For-Review, SRE-tools, Operations
Volans updated subscribers of T155705: confctl: log to SAL even if the selection doesn't match any host.
Wed, Nov 6, 3:02 PM · Operations, SRE-tools
Volans added a comment to T149589: Puppet tab in Horizon unusably slow.

It's great to see some movement in this space! I've tried it and indeed is much better.

Wed, Nov 6, 11:22 AM · cloud-services-team (Kanban), Patch-For-Review, Operations, Puppet, Cloud-Services

Tue, Nov 5

Volans triaged T233183: Automate generation of Management DNS records from Netbox as Normal priority.
Tue, Nov 5, 7:43 PM · User-jbond, Operations, Traffic, Patch-For-Review, User-crusnov, Goal, SRE-tools
Volans added a comment to T233183: Automate generation of Management DNS records from Netbox.

@BBlack the current proposal is:

Tue, Nov 5, 7:43 PM · User-jbond, Operations, Traffic, Patch-For-Review, User-crusnov, Goal, SRE-tools

Mon, Nov 4

Volans added a comment to T236277: Extend Puppet CA Expiry date .

@CDanis the problem is that all of those identify clients, while for the CA validation we're mostly interested in the server side. So while that surely would help, it's a 1:1 mapping. Also there might be places that have hardcoded the path to the CA cert for validation, either in the puppet repo or, potentially, in other repos too (as a default for example, dunno).
I don't know if this CA is also used in the k8s world for example.

Mon, Nov 4, 3:14 PM · DBA, Patch-For-Review, User-jbond, Puppet, Operations
Volans added a comment to T205885: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks.

I've not had a chance to work on this in a while, I hope to get back to it soon. Leaving it open in the meanwhile. Most of the reimage single host script has been migrated, but some bits here and there were missing.

Mon, Nov 4, 2:05 PM · SRE-tools
Volans added a comment to T205867: Expand Spicerack library and SRE Cookbooks - Q2 2018-19 Goal.

All the wmf-* scripts but the reimage ones were migrated to cookbooks. Most of the modules and functionalities in the library have been added to spicerack. I've not had a chance to work on this in a while, I hope to get back to it soon. Leaving it open in the meanwhile.

Mon, Nov 4, 2:04 PM · SRE-tools, Operations, Goal
Volans added a comment to T205884: Spicerack: split wmf-auto-reimage-lib into Spicerack modules.

Most of the modules and functionalities have been added. I've not had a chance to work on this in a while, I hope to get back to it soon. Leaving it open in the meanwhile as some minor bits were still missing to include all the current functionalities of the reimage library.

Mon, Nov 4, 2:01 PM · SRE-tools
Volans updated the task description for T213114: Q3 2018/19 Goal: TEC6: Build automated workflows for server provisioning (Tracking Task).
Mon, Nov 4, 1:55 PM · User-crusnov, SRE-tools
Volans moved T144169: Flake8 for python files without extension in puppet repo from In Code Review to Backlog on the SRE-tools board.
Mon, Nov 4, 1:45 PM · User-jbond, cloud-services-team (Kanban), Patch-For-Review, Operations, SRE-tools
Volans closed T169304: Cumin masters: simplify usage in case of emergency as Resolved.

Resolving as the known hosts backend in Cumin allows to use the already existing host list file present in the cumin hosts as part of the known hosts file.
In case we'd want to protect us from a combined failure of PuppetDB and the disappearance of the known hosts file we could resume the above patch and adapt it (as it's quite old and surely cannot be merged as is).

Mon, Nov 4, 1:44 PM · SRE-tools
Volans added a comment to T222837: Discussion about synchronizing Ganeti VM network interfaces to Netbox.

@crusnov what release/deployment is this task pending for?

Mon, Nov 4, 1:35 PM · SRE-tools
Volans added a project to T236277: Extend Puppet CA Expiry date : DBA.

One thing to take into account: we're using certificates signed by the Puppet CA in many places:

  • the puppet client certificate exposed via puppet code, see base::expose_puppet_certs
  • certificates in the private puppet repo generated via the utils/create_ecdsa_cert script and cergen
Mon, Nov 4, 1:11 PM · DBA, Patch-For-Review, User-jbond, Puppet, Operations
Volans added a comment to T237016: Update router ACLs for newer bacula hosts.

While you wait for @ayounsi I can maybe fill some gap. Homer is already a thing and Arzhel is using and testing it, but it doesn't have yet proper documentation for a wider usage (it will soon though). In the meanwhile, if you do manual changes to network devices is good in any case to have a patch for Homer's templates (when applicable) to keep things in sync.
So thanks for the patch, it's great to have it!

Mon, Nov 4, 12:33 PM · Operations, netops

Sun, Nov 3

Volans created T237198: Kubernetes workers frequent oom-killer in action.
Sun, Nov 3, 3:35 PM · Operations, serviceops
Volans created T237197: Kubernetes hosts raid check make facter fail.
Sun, Nov 3, 3:26 PM · Patch-For-Review, serviceops, Operations

Fri, Nov 1

Volans added a comment to T229710: read-only user netbox permissions regression.

The only problem is how to keep adding those permissions every time a new Netbox release introduces some new model...

Fri, Nov 1, 9:53 PM · netbox

Thu, Oct 31

Volans closed T217072: Spicerack module for Netbox, a subtask of T213114: Q3 2018/19 Goal: TEC6: Build automated workflows for server provisioning (Tracking Task), as Resolved.
Thu, Oct 31, 11:36 PM · User-crusnov, SRE-tools
Volans closed T217072: Spicerack module for Netbox as Resolved.
Thu, Oct 31, 11:36 PM · netbox, Patch-For-Review, User-crusnov, SRE-tools
Volans added a comment to T222629: Netbox: Set up deploy groups for scap to ensure primary is deployed before secondary.

@crusnov should this be resolved?

Thu, Oct 31, 11:34 PM · User-crusnov, netbox, SRE-tools
Volans added a comment to T223292: Netbox: generate CSV backups.

As I've not yet fully understood the use case of those files given that AFAIK most of them cannot be re-imported as is into Netbox it's hard for me to give a feedback on the frequency of the backups and their retention.
If I have to ballpark it while keeping it simple then the standard hourly for a week, daily for the rest of the retention period might be a good compromise.

Thu, Oct 31, 11:31 PM · netbox
Volans updated subscribers of T229710: read-only user netbox permissions regression.

As chatted with Bryan earlier today yes it was caused by the 2.6 upgrade that included the above linked view permission.

Thu, Oct 31, 11:23 PM · netbox
Volans assigned T233728: Netbox: netbox_dump_run service failed to crusnov.

@crusnov based on the recent change in the settings for the proxy wtr url scheme, do you think this issue no longer exists?

Thu, Oct 31, 3:08 PM · netbox
Volans updated subscribers of T233774: Netbox: tracking of hardware errors / grouping servers in order/batches.

In Netbox we can already filter devices by purchase date, support expiry date and procurement ticket. That should be enough to pinpoint the batch as far as I can tell.
In this specific case for example picking the procurement ticket and the purchase date from Netbox for db1075 you could:

Thu, Oct 31, 3:06 PM · Operations, netbox
Volans updated subscribers of T237007: Add a Netbox check for duplicate cable IDs.
Thu, Oct 31, 2:52 PM · Patch-For-Review, DC-Ops, SRE-tools, netbox
Volans added a comment to T228388: Configuration management for network operations.

Basic integration with Netbox has been developed and is now merged, pending the next release. Some improvements are already WIP and should be ready for CR later today.

Thu, Oct 31, 2:47 PM · Patch-For-Review, Wikimedia-Incident, Operations, Goal, netops, SRE-tools
Volans updated the task description for T228388: Configuration management for network operations.
Thu, Oct 31, 2:34 PM · Patch-For-Review, Wikimedia-Incident, Operations, Goal, netops, SRE-tools

Wed, Oct 30

Volans reassigned T236684: sre.hosts.downtime fails with "No hosts provided" from Volans to jbond.

Confirmed it's a puppetdb slowness:

2019-10-30 12:22:57,863 [DEBUG puppetdb.py:256 in _execute] Queried puppetdb for '["or", ["=", "certname", "cp5008.eqsin.wmnet"]]', got '0' results.

vs

2019-10-30 12:23:02,941 INFO  [p.p.command] [206393-1572438182779] [153ms] 'replace facts' command processed for cp5008.eqsin.wmnet
Wed, Oct 30, 2:36 PM · User-jbond, Patch-For-Review, SRE-tools, Operations

Mon, Oct 28

Volans added a comment to T236684: sre.hosts.downtime fails with "No hosts provided".

In this case the query to puppetdb returned no matching host. After a first look I think that it might be related to the queue size in Puppetdb that apparently has grown quite a lot in the last month, see:
https://grafana.wikimedia.org/d/000000477/puppetdb?panelId=19&fullscreen&orgId=1&from=1568392819818&to=1572273876961

Mon, Oct 28, 3:49 PM · User-jbond, Patch-For-Review, SRE-tools, Operations
Volans closed T222074: Icinga meta-monitoring: automatically sync contact list as Resolved.

This is all done, resolving. Feel free to re-open if any issue is found.

Mon, Oct 28, 3:21 PM · observability, Operations
Volans closed T198784: Degraded RAID on cp3048 as Declined.

Closing as the host has been decommissioned as part of T236454

Mon, Oct 28, 7:46 AM · Traffic, ops-esams, Operations

Fri, Oct 25

Volans added a comment to T175691: Geoip lookup - Misidentifying country due to travelling.

I can confirm this as it happened to me today. I'm seeing the fund raising banner on enwiki with "Hi, reader in France," while I'm in Italy. I've been in France recently but I left it ~20 days ago. I've two GeoIP cookies, both set with Session as expiration, one for .wikimedia.org and the other for .wikipedia.org.
We could force a refresh of the cookie if the IP doesn't match anymore the one in the cookie, if it's not a too heavy operation given that clients will surely change IP much more frequently than country.

Fri, Oct 25, 9:31 PM · Operations, Traffic, FR-Q2-FY2019-20-cleanup-list, Fundraising-Backlog, MediaWiki-extensions-CentralNotice
Volans added a comment to T229998: decom cookbook: dry-run mode not working / PuppetDB and Debmonitor removals can fail.

@MoritzMuehlenhoff it failed the power off, as reported by the script, see https://phabricator.wikimedia.org/T208585#5599005

Fri, Oct 25, 5:41 PM · Operations
Volans added a comment to T236478: update failed puppet checks so that they go critical 24 hours.

@jbond I had already opened T236345 for this. I guess that can probably be merged into this at this point.

Fri, Oct 25, 5:34 PM · User-jbond, Puppet, Operations, observability

Thu, Oct 24

Volans triaged T236345: Icinga last puppet run check: re-enable relaxed per-host check as Normal priority.
Thu, Oct 24, 8:18 AM · Operations, observability
Volans created T236345: Icinga last puppet run check: re-enable relaxed per-host check.
Thu, Oct 24, 8:17 AM · Operations, observability

Wed, Oct 23

Volans triaged T234452: Puppet breakage in automation-framework VMs as High priority.
Wed, Oct 23, 10:51 AM · Operations
Volans added a comment to T234452: Puppet breakage in automation-framework VMs.

There are also local modifications in the private repo fwiw.

Wed, Oct 23, 10:51 AM · Operations

Tue, Oct 22

Volans added a comment to T236152: wmf-auto-reimage, decommission & Server_lifecycle documentation for virtual machines reimage confusing.

@jcrespo the decommissioning cookbook supports VMs and can be use with them. The output will tell if there is any manual step to perform because not yet supported.
As for the wmf-auto-reimage scripts, those are not yet fully migrated to cookbooks and don't support VMs, because:

  • VMs first installation is a bit different and basically self-done, so given the low number of manual steps was not a priority to automate
  • VMs usually don't get reimaged, a new one is created and the old one decommissioned. At least those were the assumptions when the script were developed. If that has changed let me know.
Tue, Oct 22, 4:04 PM · SRE-tools, Documentation
Volans added a comment to T234452: Puppet breakage in automation-framework VMs.

This should be fixed now.

Tue, Oct 22, 7:40 AM · Operations

Oct 17 2019

Volans committed rCUMIN13086365f386: doc: update requests doc link (authored by Volans).
doc: update requests doc link
Oct 17 2019, 9:12 PM

Oct 16 2019

Volans renamed T234452: Puppet breakage in automation-framework VMs from Puppet breakage in automation-feedback VMs to Puppet breakage in automation-framework VMs.
Oct 16 2019, 2:43 PM · Operations

Oct 15 2019

Volans created T235488: Jobrunners: allow to check that they are in sync with the etcd data.
Oct 15 2019, 10:31 AM · Operations, serviceops

Oct 9 2019

Volans added a comment to T234653: Wikimedia Technical Conference 2019 Session: Standardizing QA best practices.

@kaldari thanks for the offer, but I think this deserves some preparatory work that I cannot commit to at the moment, I'm sorry.

Oct 9 2019, 1:27 PM · International-Developer-Events, Wikimedia-Technical-Conference-2019
Volans added a comment to T187709: Cumin feature idea: Prometheus backend.

Why double? Curly braces don't need escape in the shell.

No? Here {foo,bar} means foo bar, and is equivalent to * if the content of the current directory is foo bar.

Oct 9 2019, 9:26 AM · SRE-tools
Volans closed T231066: Host decommission improvements as Resolved.

I'm marking this as resolved as the cookbook has been used many times at this point and both Phabricator templated and wikitech documentation have been updated accordingly. I'll send an email to ops as FYI to spread a bit more the word today.

Oct 9 2019, 9:09 AM · Operations, DC-Ops, SRE-tools

Oct 7 2019

Volans added a comment to T187709: Cumin feature idea: Prometheus backend.

On a more detailed level: there will be at least one conflict in syntax I can think of right now, { and } can appear in both Cumin and Prometheus. I'm assuming the latter will need escaping (?)

Ugh. Yeah. And double-escaping too, because we need to escape the shell first. That's rather inconvenient.

Oct 7 2019, 7:51 AM · SRE-tools
Volans updated subscribers of T234785: Degraded RAID on analytics1049.
Oct 7 2019, 6:43 AM · Patch-For-Review, ops-eqiad, Operations
Volans added a comment to T229686: #dbctl: manage 'externalLoads' data.

@CDanis my 2 cents are with Manuel to use es[123] on the dbctl side and have the mapping es->cluster in mediawiki-config code as it is right now (with comments in the db-$dc.php files).
One thing that I don't remember if it was mentioned is that the es1 cluster is read-only and doesn't have any replication setup between them. Some additional code/knob is probably needed to allow this setup IIRC.

Oct 7 2019, 6:39 AM · Performance-Team, DBA, conftool

Oct 6 2019

Volans added a comment to T233183: Automate generation of Management DNS records from Netbox.

Thanks @BBlack for the very detailed and precise summary.

Oct 6 2019, 5:02 PM · User-jbond, Operations, Traffic, Patch-For-Review, User-crusnov, Goal, SRE-tools

Oct 3 2019

Volans committed rOHPU80317dfecc9f: Add .gitreview (authored by QChris).
Add .gitreview
Oct 3 2019, 11:11 AM
Volans updated subscribers of T233183: Automate generation of Management DNS records from Netbox.

Monving discussion from https://gerrit.wikimedia.org/r/c/operations/software/netbox-deploy/+/539013 here (+ brandon)

Oct 3 2019, 10:44 AM · User-jbond, Operations, Traffic, Patch-For-Review, User-crusnov, Goal, SRE-tools
Volans added a comment to T231066: Host decommission improvements.

Updated Lifecycle page accordingly: https://wikitech.wikimedia.org/w/index.php?title=Server_Lifecycle&type=revision&diff=1839914&oldid=1837183

Oct 3 2019, 9:46 AM · Operations, DC-Ops, SRE-tools

Oct 2 2019

Volans added a comment to T230449: Automate selection of IP address for interface.

This script has been released and appears to work correctly!

Oct 2 2019, 2:41 PM · User-crusnov, Goal, SRE-tools
Volans updated subscribers of T187709: Cumin feature idea: Prometheus backend.

hi! i've been looking into this again and I think i might start looking at making a backend for Prometheus myself. It seems the first step would be to design a grammar for the Prometheus queries that wouldn't conflict with the existing selectors.

Oct 2 2019, 11:16 AM · SRE-tools
Volans added a comment to T234358: wmf-auto-reimage-host on HP gen10 WARNING: unable to verify that BIOS boot parameters are back to normal, got:.

Yes that's kinda normal for many systems, I don't have a stats of them to be able to see if depends on generation or brand, etc... I could probably relax the check as the important parameter we want to check is just the force PXE boot.
Having a bit of time we could grab probably from the logs of the script on both hosts some stats on which hosts does it and find some pattern.

Oct 2 2019, 10:31 AM · SRE-tools

Sep 29 2019

Volans committed rCUMINeefdd9202c6c: tests: update requests_mock URI registration (authored by Volans).
tests: update requests_mock URI registration
Sep 29 2019, 10:47 AM

Sep 27 2019

Volans updated the task description for T228388: Configuration management for network operations.
Sep 27 2019, 6:22 AM · Patch-For-Review, Wikimedia-Incident, Operations, Goal, netops, SRE-tools

Sep 25 2019

Volans added a comment to T231066: Host decommission improvements.

I tested the cookbook on ms-be1027 in T233289, the host is powered down and not coming back (faulty hw) and the cookbook stopped when trying to get to the host, whereas IMHO it should have continued (and/or prompt) with the remaining steps

Sep 25 2019, 2:49 PM · Operations, DC-Ops, SRE-tools

Sep 24 2019

Volans triaged T233728: Netbox: netbox_dump_run service failed as Normal priority.
Sep 24 2019, 2:56 PM · netbox
Volans created T233728: Netbox: netbox_dump_run service failed.
Sep 24 2019, 2:56 PM · netbox
Volans closed T233189: Requesting access to Ops Group for papaul@ as Resolved.

This is approved.

Sep 24 2019, 7:47 AM · Operations, SRE-Access-Requests
Volans created T233685: Tracking task for DCOps privileged commands.
Sep 24 2019, 7:45 AM · SRE-tools, DC-Ops

Sep 19 2019

Volans reassigned T233189: Requesting access to Ops Group for papaul@ from Volans to faidon.

Patch ready, pending approval.

Sep 19 2019, 3:36 PM · Operations, SRE-Access-Requests

Sep 18 2019

Volans added a comment to T232767: Netbox API Occasionally 500s and Netbox2001 dumpcsv fails.

From your description it seems that Netbox doesn't preserve the URL schema on pagination.

Sep 18 2019, 9:09 AM · SRE-tools

Sep 6 2019

Volans added a comment to T231066: Host decommission improvements.

@wiki_willy the related patch above should already help a lot, but as you know I'm off those days and I cannot give it the necessary testing for merging it, but if anyone else want to volunteer to merge+test it is welcome ;) Otherwise I'll take care of it as soon as I'm back.
As a workaround clearly is possible to add more permissions to dcops, it's a trivial change in puppet that anyone can do, but it's not to me to decide, that should be considered an access request to be decided by the owners of the group (SRE, usually discussed in the weekly meeting).

Sep 6 2019, 4:34 PM · Operations, DC-Ops, SRE-tools

Aug 31 2019

Volans added a comment to T223291: Netbox: move it to dedicated Ganeti VMs.

@crusnov in case you missed my IRC ping yesterday, please re-install the two public ones before proceeding with the installation as they had no firewall (see above hotfix patch).

Aug 31 2019, 9:30 AM · netbox

Aug 30 2019

Volans added a comment to T229686: #dbctl: manage 'externalLoads' data.

Here's a mini-design proposal for the dbctl feature itself (@Volans and @Joe please review):

  • Add a flavor enum to the section schema. The default will be regular, which will cause the section to be output in the config in sectionLoads, as usual.
  • The flavor externalstore will cause the section to be output in externalLoads.
    • Any groups set in the instances of a section with flavor=externalstore will be ignored.
Aug 30 2019, 8:47 PM · Performance-Team, DBA, conftool

Aug 28 2019

Volans added a comment to T209182: Setup Swift Storage for Netbox image (was: netbox won't allow me to upload photos of the rack).

Thanks for the update.

Aug 28 2019, 3:38 PM · Patch-For-Review, netbox, Operations

Aug 27 2019

Volans added a comment to T229134: Degraded RAID on sulfur.

@wiki_willy The data gathering failed because of host unreachable, but is this still a commissioned host? I cannot see the records in the DNS repo, just the management ones are there. Also see its decom task: T224475

Aug 27 2019, 8:16 PM · ops-eqiad, Operations
Volans closed T231278: NTT Transit link flapping, now BGP session down as Resolved.

It was a maintenance, tracked with GIN-1-2116159603, that was not present to the calendar because sent to noc@ and not the maint announce ML. We need to update their settings.

Aug 27 2019, 9:45 AM · netops, Operations
Volans added a comment to T231278: NTT Transit link flapping, now BGP session down.

It seems that the session is misconfigured on their side:

Aug 27 09:10:38  cr2-eqord rpd[13953]: bgp_process_open:4072: NOTIFICATION sent to 2001:418:0:5000::a34 (External AS 2914): code 2 (Open Message Error) subcode 2 (bad peer AS number), Reason: peer 2001:418:0:5000::a34 (External AS 2914) claims 65000, 2914 configured
Aug 27 09:10:42  cr2-eqord rpd[13953]: bgp_process_open:4072: NOTIFICATION sent to 128.241.2.53 (External AS 2914): code 2 (Open Message Error) subcode 2 (bad peer AS number), Reason: peer 128.241.2.53 (External AS 2914) claims 65000, 2914 configured
Aug 27 2019, 9:17 AM · netops, Operations
Volans triaged T231278: NTT Transit link flapping, now BGP session down as High priority.
Aug 27 2019, 9:02 AM · netops, Operations

Aug 23 2019

Volans updated the task description for T231068: Spicerack: improve support for Ganeti VMs.
Aug 23 2019, 10:46 AM · SRE-tools
Volans triaged T231068: Spicerack: improve support for Ganeti VMs as Normal priority.
Aug 23 2019, 10:39 AM · SRE-tools
Volans created T231068: Spicerack: improve support for Ganeti VMs.
Aug 23 2019, 10:39 AM · SRE-tools
Volans moved T231066: Host decommission improvements from Backlog to In Progress on the SRE-tools board.
Aug 23 2019, 10:31 AM · Operations, DC-Ops, SRE-tools
Volans triaged T231066: Host decommission improvements as Normal priority.
Aug 23 2019, 9:18 AM · Operations, DC-Ops, SRE-tools

Aug 22 2019

Volans added a comment to T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC.

@CDanis @Volans can you confirm this command will set wikitech (db1073 is its master) on read-only?:

# set read-only
dbctl --scope eqiad section wikitech ro "Maintenance on wikitech T229657 " && dbctl config commit -m "Set wikitech as read-only for maintenance T229657"

Thanks!

Aug 22 2019, 1:04 PM · cloud-services-team (Kanban), wikitech.wikimedia.org, Operations, DBA

Aug 19 2019

Volans updated subscribers of T230712: sre.ganeti.makevm cook book only allows specifying RAM size in full gigabytes.

That's because we pass memory={memory}g to the gnt-instance add command. We could instead accept a float in the cookbook, convert it to MB and use m in the command.
I'm fine either way. CCing @elukey that used it a lot and @crusnov

Aug 19 2019, 11:23 AM · SRE-tools