Page MenuHomePhabricator

Volans (Riccardo Coccioli)
Operations Software Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Feb 10 2016, 11:25 AM (166 w, 3 d)
Availability
Available
IRC Nick
volans
LDAP User
Volans
MediaWiki User
RCoccioli (WMF) [ Global Accounts ]

Recent Activity

Thu, Apr 18

Volans added a subtask for T220726: 1.34.0-wmf.1 deployment blockers: T221365: MassMessage not delivering.
Thu, Apr 18, 12:56 PM · Patch-For-Review, Release-Engineering-Team (Kanban), Release, Train Deployments
Volans added a parent task for T221365: MassMessage not delivering: T220726: 1.34.0-wmf.1 deployment blockers.
Thu, Apr 18, 12:56 PM · Patch-For-Review, MassMessage, Operations
Volans added a comment to T221265: Discussion: Explore push notifications options .

I agree that it would be nice to have push notifications for iOS and Android available as an option in addition to the paging system but I have some doubts/questions about the specific choice of Prowl.

Thu, Apr 18, 11:08 AM · Operations

Wed, Apr 17

Volans added a comment to T198592: Debmonitor: add search capability.

I'm sure that others would prefer the opposite ;), to go to the source package instead to check the binaries generated by that source package that needs to be rebuilt and to find the hosts affected.

Wed, Apr 17, 5:56 PM · Patch-For-Review, Operations-Software-Development
Volans closed T198592: Debmonitor: add search capability as Resolved.

Search capability deployed!

Wed, Apr 17, 12:01 PM · Patch-For-Review, Operations-Software-Development
Volans triaged T221212: spicerack/cookbook: add additional arguments IRC/SAL logging as Normal priority.

Yes indeed, that's already in the plan for improvements. The problem here is that the logging is done by the framework and the framework doesn't have (as of now) a way to know for each cookbook which parameters are safe to be logged and which not.

Wed, Apr 17, 10:13 AM · Patch-For-Review, Operations-Software-Development, Operations

Tue, Apr 16

Dzahn awarded T220783: labtestcontrol2003 - UNKNOWN power supply status a Like token.
Tue, Apr 16, 11:07 PM · monitoring, Operations, ops-codfw
Volans updated subscribers of T221125: cumin aliases not matching any hosts .
Tue, Apr 16, 6:04 PM · cloud-services-team, Operations, Operations-Software-Development
Volans created T221115: labpuppetmaster logs 'cannot collect exported resources without storeconfigs being set'.
Tue, Apr 16, 5:20 PM · cloud-services-team, Operations
Volans added a comment to T221038: Cumin: shell expression requires variables to be escaped.

@hashar this has nothing to do with Cumin but the local bash on the Cumin master.
If you use double quotes the bash interprets what's inside and replaces any variable with their value. If a variable is not defined it replaces it with empty string, so $foo will be replaced by empty string e Cumin will receive as parameter echo ''.

Tue, Apr 16, 10:27 AM · Operations-Software-Development

Mon, Apr 15

Volans changed the status of T221038: Cumin: shell expression requires variables to be escaped from Resolved to Invalid.
Mon, Apr 15, 8:18 PM · Operations-Software-Development
Volans closed T221038: Cumin: shell expression requires variables to be escaped as Resolved.

Use "df | awk '{ print \$6}'"

Mon, Apr 15, 8:09 PM · Operations-Software-Development
hashar awarded T198592: Debmonitor: add search capability a Love token.
Mon, Apr 15, 1:05 PM · Patch-For-Review, Operations-Software-Development

Fri, Apr 12

Volans closed T220783: labtestcontrol2003 - UNKNOWN power supply status as Resolved.

I've reset the mgmt card (see https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card ), wait that it rebooted, run ipmi-sensors that told me that the cache was outdated and needed a flush, run ipmi-sensors -f and we where good to go.
Sensors are working again, and Icinga is happy, resolving.

Fri, Apr 12, 9:33 AM · monitoring, Operations, ops-codfw
Volans closed T220783: labtestcontrol2003 - UNKNOWN power supply status, a subtask of T218403: Degraded RAID on labtestcontrol2003, as Resolved.
Fri, Apr 12, 9:33 AM · Operations, ops-codfw
Volans updated subscribers of T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool.

In addition io T220787#5106275, from the top of my head I think we need also:

  • check if the DSA script we're using to alarm on HP raid ( modules/raid/files/dsa-check-hpssacli ) has been updated upstream (Debian) and update it or patch it and send the patch upstream (cc @faidon )
  • adapt modules/raid/files/get-raid-status-hpssacli.sh to detect which executable is available and act accordingly, assuming they have the same options. If not we need to adapt the script to handle the two different exectuables.
Fri, Apr 12, 9:17 AM · Patch-For-Review, Operations, Icinga, monitoring

Wed, Apr 10

Volans added a comment to T219908: Build an API for generating boot options for iPXE from Netbox et al. based on Serial Number.

Okay the only question that seems open in my mind is how does the service map serial to fqdn?

Wed, Apr 10, 9:43 PM · User-crusnov, Operations-Software-Development
Volans added a comment to T219908: Build an API for generating boot options for iPXE from Netbox et al. based on Serial Number.
  1. spicerack/cumin calls an end-point, say PUT https://<deployment>/ipxe/<serial> with DH URL and parameters
Wed, Apr 10, 2:39 PM · User-crusnov, Operations-Software-Development

Mon, Apr 8

Volans added a comment to T196336: Icinga passive checks go awol and downtime stops working.

The log from today makes me thing that there is some sort of race-condition when we reload icinga (triggered by puppet usually) and the passive checks coming in from NSCA.

Mon, Apr 8, 9:31 AM · Patch-For-Review, Operations, Icinga, monitoring

Sun, Apr 7

Volans triaged T220297: Icinga process too many open files as Normal priority.
Sun, Apr 7, 11:32 AM · monitoring, Operations
Volans created T220297: Icinga process too many open files.
Sun, Apr 7, 11:32 AM · monitoring, Operations
Volans closed T163286: Tegmen: process spawn loop + failed icinga + failing puppet as Resolved.

Since this task last update we've migrated Icinga to new hosts (jessie -> stretch) and slightly different version of Icinga. Resolving.

Sun, Apr 7, 11:30 AM · Patch-For-Review, Operations, monitoring

Fri, Apr 5

Volans committed rOSACc905ce82f69c: acme-chief-api: Add support for puppet HTTP API search operation (authored by Vgutierrez).
acme-chief-api: Add support for puppet HTTP API search operation
Fri, Apr 5, 11:13 PM
Mill <mill@mail.com> committed rCUMIN45d0157ae812: (lbaaaaaaaaaaa (authored by Volans).
(lbaaaaaaaaaaa
Fri, Apr 5, 10:29 PM

Thu, Apr 4

Volans added a comment to T198939: Decommission servermon.

And when we do, can we also drop the package_updates custom fact?

Thu, Apr 4, 4:53 PM · Patch-For-Review, Operations
Volans added a comment to T219908: Build an API for generating boot options for iPXE from Netbox et al. based on Serial Number.

I suppose the conversation we need is:

  • Where will this live?
Thu, Apr 4, 10:42 AM · User-crusnov, Operations-Software-Development
Volans added a comment to T219854: Broken disk on ms-be2026.

@fgiunchedi what are your thoughts on T219854#5076968 ? That's the last remaining part of this task I guess.

Thu, Apr 4, 10:27 AM · Patch-For-Review, Operations, ops-codfw

Tue, Apr 2

Volans updated subscribers of T219854: Broken disk on ms-be2026.

Forgot to mention that during the reboot it printed:

Slot 3 Port 1 : Smart Array P840 Controller - (4096 MB, V3.56) 14 Logical
Drive(s) - Operation Failed
 - 1719-Slot 3 Drive Array - A controller failure event occurred prior
   to this power-up.  (Previous lock up code = 0x13) Action: Install the
   latest controller firmware. If the problem persists, replace the
   controller.
Tue, Apr 2, 11:47 PM · Patch-For-Review, Operations, ops-codfw
Volans committed rCUMIN21a07c436585: Make the puppetdb backend process primitive types for queries (authored by crusnov).
Make the puppetdb backend process primitive types for queries
Tue, Apr 2, 3:16 PM
Volans added a comment to T219854: Broken disk on ms-be2026.

After the reboot the host is back up and running, all seems good so far. Keeping open for a bit to see if it holds.

Tue, Apr 2, 1:47 PM · Patch-For-Review, Operations, ops-codfw
Volans added a comment to T219854: Broken disk on ms-be2026.

So the dsa-check-hpssacli check is happily returning 0 exit code and this output:

OK: Slot 0: no logical drives --- Slot 0: no drives

Given that IIRC we add the HP raid check only on the hosts that have it, we might consider patching this imported script to fails in the case there is a controller but has no drives configured (both no logical and no physical?)

Tue, Apr 2, 9:50 AM · Patch-For-Review, Operations, ops-codfw
Volans added a comment to T219854: Broken disk on ms-be2026.

I can ssh into it via cumin.

Tue, Apr 2, 9:17 AM · Patch-For-Review, Operations, ops-codfw

Mon, Apr 1

Volans added a comment to T190992: prometheus: slow dashboards due to suboptimal query_range performance.

@ema given the speedup due to prometheus 2 do you think this still needs to be worked on or could be resolved?

Mon, Apr 1, 3:20 PM · Traffic, monitoring, Operations
Volans closed T217599: Create an external check for Icinga as Resolved.

Check is live since a bit, contact list will be slowly grow, resolving.

Mon, Apr 1, 3:18 PM · Patch-For-Review, monitoring
Volans closed T217599: Create an external check for Icinga, a subtask of T213084: Build an understanding of our needs around external monitoring services - Q3 2018/19 goal, as Resolved.
Mon, Apr 1, 3:18 PM · User-CDanis, monitoring, Goal
Volans closed T219775: wmf-auto-reimage-host: puppet first run error leads to some weird behaviour as Resolved.
Mon, Apr 1, 2:57 PM · Patch-For-Review, Operations-Software-Development, Operations
Volans moved T219775: wmf-auto-reimage-host: puppet first run error leads to some weird behaviour from Backlog to In Code Review on the Operations-Software-Development board.
Mon, Apr 1, 11:18 AM · Patch-For-Review, Operations-Software-Development, Operations
Volans added a comment to T219775: wmf-auto-reimage-host: puppet first run error leads to some weird behaviour.

It's all in the logs mentioned at the start of the script:

Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Could not find data item profile::openstack::codfw1dev::observer_password in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/openstack/codfw1dev/observerenv.pp:4:26 on node cloudcontrol2001-dev.wikimedia.org
Mon, Apr 1, 10:58 AM · Patch-For-Review, Operations-Software-Development, Operations
Volans claimed T219775: wmf-auto-reimage-host: puppet first run error leads to some weird behaviour.

It's kinda expected, the line when it says:

Scheduled delayed downtime on Icinga

spawn a subprocess that downtime the host on icinga with a delay, this is due to the fact that we use exported resources for Icinga checks and to downtime the host we need first that the puppetmaster compiles the catalog for the host and exports it into puppet db. Only after that we can run puppet on Icinga to gather the new host configuration and then downtime it.

Mon, Apr 1, 10:54 AM · Patch-For-Review, Operations-Software-Development, Operations

Thu, Mar 28

Volans added a comment to T218736: Discussions around having a Ganeti RAPI R/W User.

I'd start with the RO user and see where we're going with the spicerack Ganeti module and when we start feeling blocked by this re-evaluate.
Having all RO operations done via the API and just the RW via ssh might also be an option as final solution if we have concerns for the security of the RW API user.

Thu, Mar 28, 4:02 PM · Operations-Software-Development, User-crusnov
Volans added a comment to T219454: Make Spicerack cookbook to resize ganeti VM.

Cluster capacity it's already directly exposed by Ganeti, see https://wikitech.wikimedia.org/wiki/Ganeti#Listing_cluster_nodes

Thu, Mar 28, 10:13 AM · Operations-Software-Development

Wed, Mar 27

Volans created T219400: Make authdns-update compatible with local emergency changes.
Wed, Mar 27, 3:00 PM · Traffic, Operations

Tue, Mar 26

Volans renamed T219333: apt-get update broken on jessie: jessie-updates and jessie-backports removed by Debian from jessie-updates and jessie-backports removed by Debian to apt-get update broken on jessie: jessie-updates and jessie-backports removed by Debian.
Tue, Mar 26, 10:02 PM · Patch-For-Review, Operations
Volans triaged T219333: apt-get update broken on jessie: jessie-updates and jessie-backports removed by Debian as High priority.
Tue, Mar 26, 10:01 PM · Patch-For-Review, Operations
Volans created T219333: apt-get update broken on jessie: jessie-updates and jessie-backports removed by Debian.
Tue, Mar 26, 10:01 PM · Patch-For-Review, Operations

Sun, Mar 24

Volans added a comment to T184435: Puppet tox: properly lint both Py2 and Py3 files.

Given that py2 EOL is at the end of 2019, I'm not sure it's worth to spend our energies to make this check smart at this point.
An alternative proposal could at some point to migrate CI to test our Python files with Python 3.4 (jessie), 3.5 (stretch) and 3.7 (buster) [optionally 3.6 too but seems unnecessary in our current infrastructure].

Sun, Mar 24, 12:33 PM · Patch-For-Review, Operations-Software-Development, Operations

Thu, Mar 21

Volans closed T218441: Cumin: allow querying PuppetDB over HTTP as Resolved.
Thu, Mar 21, 7:24 PM · Patch-For-Review, Operations-Software-Development
Volans committed rCUMIN8e49a21f5758: PuppetDB backend: allow to override URL scheme in config (authored by TheAnarcat).
PuppetDB backend: allow to override URL scheme in config
Thu, Mar 21, 6:43 PM
Volans moved T218441: Cumin: allow querying PuppetDB over HTTP from In Progress to In Code Review on the Operations-Software-Development board.
Thu, Mar 21, 6:41 PM · Patch-For-Review, Operations-Software-Development
Volans moved T218441: Cumin: allow querying PuppetDB over HTTP from Backlog to In Progress on the Operations-Software-Development board.
Thu, Mar 21, 6:41 PM · Patch-For-Review, Operations-Software-Development

Mar 20 2019

Volans added a comment to T218441: Cumin: allow querying PuppetDB over HTTP.

@TheAnarcat yes we open the firewall on the PuppetDB hosts only from specific hosts (puppetmasters and cumin masters basically) and use the Puppet certificate for the HTTPS part.

Mar 20 2019, 3:42 PM · Patch-For-Review, Operations-Software-Development

Mar 19 2019

Krenair awarded T218723: Unable to push to a certain gerrit changeset due to "missing revisions" a Barnstar token.
Mar 19 2019, 8:18 PM · Gerrit
Volans closed T218723: Unable to push to a certain gerrit changeset due to "missing revisions" as Resolved.

They should be all ok. Feel free to re-open in case of issues.

Mar 19 2019, 8:16 PM · Gerrit

Mar 18 2019

Volans added a comment to T218544: ms-be1043 sdk failed.

Also worth mentioning that tracking the IDs for missing ones is not enough because if the last one fails we should know in advance how many are supposed to be there.

Mar 18 2019, 10:50 AM · Patch-For-Review, monitoring, Operations-Software-Development, Operations, ops-eqiad
Volans added a comment to T218544: ms-be1043 sdk failed.

@fgiunchedi agree that this is a new issue, and we need to fix two different scripts to have an automatic task created for this:

Mar 18 2019, 10:36 AM · Patch-For-Review, monitoring, Operations-Software-Development, Operations, ops-eqiad

Mar 15 2019

Volans triaged T218441: Cumin: allow querying PuppetDB over HTTP as Normal priority.

First of all thanks a lot for the report and the patch @TheAnarcat.
The choice of hardcoding HTTPS as a protocol in the PuppetDB backend was done assuming that most likely Puppet, PuppetDB and Cumin would be installed on different hosts, and given the private nature of the data stored in PuppetDB, requiring HTTPS seems a natural choice. Of course if all of them are on the same host and listen only on localhost it doesn't really matter.

Mar 15 2019, 11:15 PM · Patch-For-Review, Operations-Software-Development
Volans added a comment to T215229: Keep Ganeti VMs synchronized in Netbox.

One thing that is missing are the physical devices that belongs to a cluster, see https://netbox.wikimedia.org/virtualization/clusters/3/

Mar 15 2019, 11:45 AM · Patch-For-Review, User-crusnov, Operations-Software-Development
Volans merged task T172708: HP RAID (Service Check Timed Out) on swift hosts into T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts.
Mar 15 2019, 11:04 AM · media-storage, Operations, monitoring
Volans merged T172708: HP RAID (Service Check Timed Out) on swift hosts into T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts.
Mar 15 2019, 11:04 AM · User-fgiunchedi, Operations, monitoring
Volans updated subscribers of T172708: HP RAID (Service Check Timed Out) on swift hosts.

Given that this is quite old I'm closing it as duplicate of T210723 that has a more recent discussion of possible solutions. (CC @colewhite )

Mar 15 2019, 11:04 AM · media-storage, Operations, monitoring

Mar 14 2019

Volans added a comment to T218188: Import issue (bug?) on Python 3.4/3.5 + multiprocessing affecting Cumin.

So, I tried upstream opening https://bugs.python.org/issue36284 but it got closed because 3.4 and 3.5 are security fix only at this point.
I'll look into adding a workaround into Cumin itself but is not super trivial because the reload() does mess up a bit with existing things.

Mar 14 2019, 10:49 AM · Operations-Software-Development

Mar 13 2019

Gerrit Code Review <gerrit@wikimedia.org> committed rOSNBdf709fbf558c: Modify access rules (authored by Volans).
Modify access rules
Mar 13 2019, 7:01 PM
Volans moved T218188: Import issue (bug?) on Python 3.4/3.5 + multiprocessing affecting Cumin from Backlog to In Progress on the Operations-Software-Development board.
Mar 13 2019, 12:06 PM · Operations-Software-Development
Volans updated subscribers of T218188: Import issue (bug?) on Python 3.4/3.5 + multiprocessing affecting Cumin.

So, I was able to repro this on Python 3.4 and 3.5 but not on 3.6 and 3.7 where it works like a charm.

Mar 13 2019, 11:50 AM · Operations-Software-Development
Volans added a comment to T218188: Import issue (bug?) on Python 3.4/3.5 + multiprocessing affecting Cumin.

Ok, I was able to repro without any cumin involvment, I've created the following structure:

$ tree repro/
repro/
├── fail.py
└── __init__.py
Mar 13 2019, 11:35 AM · Operations-Software-Development
Volans added a comment to T218188: Import issue (bug?) on Python 3.4/3.5 + multiprocessing affecting Cumin.

This is quite weird an require some more in-depth analysis unfortunately.

Mar 13 2019, 11:27 AM · Operations-Software-Development

Mar 12 2019

Volans added a comment to T214760: icinga1001 crashed.

@RobH Icinga runs on both hosts and generates the same load, being active or passive changes very little. I've been monitoring this host with our beta icinga meta-monitoring and so far so good. It has now 11 days of uptime and I have a proposal for tomorrow's Foundation's meeting to failback to icinga1001 as active server either this Thu. or next Mon. given that is seems stable now.

Mar 12 2019, 6:09 PM · Patch-For-Review, ops-eqiad, monitoring, Operations

Mar 11 2019

Volans added a comment to T205897: Netbox: fill network topology.

I've had a chat with @ayounsi about this.

Mar 11 2019, 5:04 PM · Operations

Mar 6 2019

Volans added a comment to T213527: Prepare our base system layer for Debian buster.

@jcrespo that's T216832 and we were thinking to just create a home for the user (cc @MoritzMuehlenhoff )

Mar 6 2019, 4:38 PM · Patch-For-Review, Operations
Volans added a comment to T217686: Document service owner in Netbox.

My main concern here is that the concept of a single service owner is limited and doesn't reflect reality.
We have multiple roles for each server/service, not in all cases we have all those "roles" but I think it can be boiled down to:

Mar 6 2019, 11:15 AM · Operations

Mar 5 2019

Volans moved T217599: Create an external check for Icinga from Backlog to In progress on the monitoring board.
Mar 5 2019, 1:30 PM · Patch-For-Review, monitoring
Dvorapa awarded T191764: CI: run tests with multiple Python3 versions a Love token.
Mar 5 2019, 9:22 AM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure

Mar 4 2019

Volans created T217599: Create an external check for Icinga.
Mar 4 2019, 8:35 PM · Patch-For-Review, monitoring
Volans added a comment to T212526: Implement netbox reports which check against PuppetDB.

Unless has a spare::system role, in that case it should be staged 😉

Mar 4 2019, 4:57 PM · Patch-For-Review, User-crusnov, Operations-Software-Development
Volans added a comment to T217429: Update several hosts status in Netbox.

@Marostegui the different states and their transitions (when they are supposed to be updated) are described here:

Mar 4 2019, 4:56 PM · Operations, ops-eqiad

Feb 27 2019

Volans added a comment to T214760: icinga1001 crashed.

@Cmjohnson my tests on icinga1001 are completed, so feel free to shutdown at will when parts are available.

Feb 27 2019, 7:20 PM · Patch-For-Review, ops-eqiad, monitoring, Operations
Volans awarded T217231: puppet leaks sensitive cryptographic acme-chief material a Like token.
Feb 27 2019, 8:54 AM · Operations, Traffic, Acme-chief

Feb 26 2019

Volans added a comment to T187987: 100% of Prometheus traffic served by Prometheus v2.

Would that mean that the missing hours will be totally lost? In that case probably better to ask the users that were asking for longer retention to make sure we're not loosing any required data. (my 2 cents)

Feb 26 2019, 11:06 AM · Patch-For-Review, monitoring, Operations
Volans added a comment to T187987: 100% of Prometheus traffic served by Prometheus v2.

From the error reported in the upstream issue it seems that is data-dependent. Have you tried by any chance any other retention between 8500h and 10500h?
Can we enable any more debugging to get a better idea of which metric is throwing the error so that maybe we can just skip it instead?

Feb 26 2019, 10:33 AM · Patch-For-Review, monitoring, Operations

Feb 25 2019

Volans moved T217038: Cumin: replace colorama from Backlog to Up next on the Operations-Software-Development board.
Feb 25 2019, 2:24 PM · Patch-For-Review, Operations-Software-Development
Volans triaged T217038: Cumin: replace colorama as High priority.
Feb 25 2019, 2:24 PM · Patch-For-Review, Operations-Software-Development
Volans added a comment to T216985: google safe browsing icinga checks sporadic UNKNOWN due to 404.

There could be some throttling ongoing. Also from a very quick look at [1] we might be using an older API version/url...

Feb 25 2019, 11:13 AM · Patch-For-Review, monitoring, Operations

Feb 22 2019

Volans added a comment to T214760: icinga1001 crashed.

FYI it crashed again:

--------------------------------------------------------------------------------
SeqNumber       = 481
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2019-02-22 04:04:12
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 480
Message ID      = RAC0703
Category        = Audit
AgentID         = RACLOG
Severity        = Information
Timestamp       = 2019-02-22 04:04:10
Message         = Requested system hardreset.
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------
Feb 22 2019, 11:23 AM · Patch-For-Review, ops-eqiad, monitoring, Operations

Feb 21 2019

Volans added a comment to T215229: Keep Ganeti VMs synchronized in Netbox.

I've merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/492202 to fix the configuration and forced a puppet run on A:ganeti as ferm failed on all of them

Feb 21 2019, 10:22 PM · Patch-For-Review, User-crusnov, Operations-Software-Development
Volans added a comment to T214760: icinga1001 crashed.

@RobH and it crashed again already! I'll leave it down in case @Cmjohnson wants to attach a physical console.
Anyway, it's all yours, can be shutdown/reboot at will.

Feb 21 2019, 11:25 AM · Patch-For-Review, ops-eqiad, monitoring, Operations
Volans added a comment to T214760: icinga1001 crashed.

Hardware logs;

Feb 21 2019, 9:41 AM · Patch-For-Review, ops-eqiad, monitoring, Operations
Volans reopened T214760: icinga1001 crashed as "Open".

icinga1001 is unresponsive this morning (no ping, no ssh, black console), re-opening

Feb 21 2019, 9:35 AM · Patch-For-Review, ops-eqiad, monitoring, Operations
Volans reopened T214760: icinga1001 crashed, a subtask of T210108: icinga1001 mysterious reboots, as Open.
Feb 21 2019, 9:35 AM · ops-eqiad, DC-Ops, Operations

Feb 20 2019

Volans added a comment to T215378: Figure out how to make Netbox Reports actionable / alertable.

yeah, I think they are a bit overwhelmed by the activity on GitHub, all the issues, etc. In the contributing page they state:

Due to an excessive backlog of feature requests, we are not currently accepting any proposals which substantially extend NetBox's functionality beyond its current feature set
Feb 20 2019, 10:28 PM · Operations-Software-Development
Volans added a comment to T212016: Create a repository for sharing ad-hoc local development tools.

I was made aware today of the existence of a wmf-utils repository that might have been created with that in mind but doesn't seem very used:
https://gerrit.wikimedia.org/g/wmf-utils/+/refs/heads/master

Feb 20 2019, 5:51 PM · Developer Productivity
Volans added a comment to T215378: Figure out how to make Netbox Reports actionable / alertable.

Any insight on upstream plans for reports based on open issues or their mailing list?
Because we can surely go around it for some specific cases like the sort, but if we want to expand the usage of reports those kind of "hacks" will not be enough.

Feb 20 2019, 9:33 AM · Operations-Software-Development

Feb 19 2019

Volans closed T212990: Degraded RAID on helium as Resolved.

It seems all good from megacli:

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
=== RaidStatus completed
Feb 19 2019, 11:31 AM · ops-eqiad, Operations
Volans added a comment to T212010: Degraded RAID on sodium.

It seems that one disk if failed in a way that is not even reported by megacli. The new version of the script reports:

=== RaidStatus (does not include components in optimal state)
name: Adapter #0
Feb 19 2019, 11:29 AM · ops-eqiad, Operations
Volans added a comment to T212990: Degraded RAID on helium.
Feb 19 2019, 11:28 AM · ops-eqiad, Operations
Volans added a comment to T215892: Degraded RAID on cloudvirt1024.

It seems that also PD: 8 is failed now:

			PD: 8 Information
			Enclosure Device ID: 32
			Slot Number: 8
			Drive's position: DiskGroup: 0, Span: 0, Arm: 8
			Media Error Count: 0
			Other Error Count: 110
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0
Feb 19 2019, 11:21 AM · cloud-services-team (Kanban), ops-eqiad, Operations
Volans moved T216469: Netbox: cable termination names report from Backlog to Up next on the Operations-Software-Development board.
Feb 19 2019, 10:28 AM · Operations-Software-Development, Operations
Volans added a project to T216469: Netbox: cable termination names report: Operations-Software-Development.
Feb 19 2019, 10:28 AM · Operations-Software-Development, Operations

Feb 14 2019

Volans closed T199911: Systemd session creation fails under I/O load as Resolved.
Feb 14 2019, 6:47 PM · Operations, Operations-Software-Development
Volans updated the task description for T205867: Expand Spicerack library and SRE Cookbooks - Q2 2018-19 Goal.
Feb 14 2019, 5:40 PM · Patch-For-Review, Operations-Software-Development, Operations, Goal
Volans added a parent task for T204789: wmf-auto-reimage tries to remove from Debmonitor even with --new: T205885: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks.
Feb 14 2019, 5:39 PM · Operations, Operations-Software-Development