Volans (Riccardo Coccioli)
Operations Software Engineer

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Feb 10 2016, 11:25 AM (83 w, 6 d)
Availability
Available
IRC Nick
volans
LDAP User
Volans
MediaWiki User
RCoccioli (WMF)

Recent Activity

Wed, Sep 13

Volans removed a project from T174008: Cumin: setup.py installs data_files in wrong directory: Patch-For-Review.
Wed, Sep 13, 4:32 PM · Operations-Software-Development
Volans moved T175711: Cumin: create backend for OpenStack from Backlog to In Progress on the Operations-Software-Development board.
Wed, Sep 13, 4:32 PM · Cloud-VPS, Operations-Software-Development
Volans moved T175712: Install cumin in the WMCS infrastructure from In Progress to In Code Review on the Operations-Software-Development board.
Wed, Sep 13, 4:31 PM · Patch-For-Review, Cloud-VPS, Operations-Software-Development
Volans merged T175820: Degraded RAID on lvs3001 into T168619: Degraded RAID on lvs3001.
Wed, Sep 13, 11:32 AM · ops-esams, Operations
Volans merged task T175820: Degraded RAID on lvs3001 into T168619: Degraded RAID on lvs3001.
Wed, Sep 13, 11:32 AM · ops-esams, Operations
Volans merged T175808: Degraded RAID on lvs3001 into T168619: Degraded RAID on lvs3001.
Wed, Sep 13, 9:57 AM · ops-esams, Operations
Volans merged task T175808: Degraded RAID on lvs3001 into T168619: Degraded RAID on lvs3001.
Wed, Sep 13, 9:57 AM · ops-esams, Operations

Tue, Sep 12

Volans merged T175715: Degraded RAID on db2010 into T175228: Degraded RAID on db2010.
Tue, Sep 12, 5:28 PM · DBA, Operations, ops-codfw
Volans merged task T175715: Degraded RAID on db2010 into T175228: Degraded RAID on db2010.
Tue, Sep 12, 5:28 PM · Operations, ops-codfw
Volans moved T175712: Install cumin in the WMCS infrastructure from Backlog to In Progress on the Operations-Software-Development board.
Tue, Sep 12, 4:44 PM · Patch-For-Review, Cloud-VPS, Operations-Software-Development
Volans created T175712: Install cumin in the WMCS infrastructure.
Tue, Sep 12, 4:43 PM · Patch-For-Review, Cloud-VPS, Operations-Software-Development
Volans created T175711: Cumin: create backend for OpenStack.
Tue, Sep 12, 4:43 PM · Cloud-VPS, Operations-Software-Development
Volans moved T149230: wmf-auto-reimage: allow to specify the conftool state from In Progress to In Code Review on the Operations-Software-Development board.
Tue, Sep 12, 4:37 PM · Operations-Software-Development
Volans moved T169555: puppet.service systemctl failures after reimage from In Progress to In Code Review on the Operations-Software-Development board.
Tue, Sep 12, 4:37 PM · Operations-Software-Development
Volans moved T148817: wmf-auto-reimage: remove dependency on wmf-reimage from In Progress to In Code Review on the Operations-Software-Development board.
Tue, Sep 12, 4:37 PM · Operations-Software-Development
Volans moved T166300: Remove Salt from wmf-auto-reimage / wmf-reimage from In Progress to In Code Review on the Operations-Software-Development board.
Tue, Sep 12, 4:37 PM · Patch-For-Review, Technical-Debt, Operations-Software-Development, Operations
Volans moved T148814: wmf-auto-reimage improvements from In Progress to In Code Review on the Operations-Software-Development board.
Tue, Sep 12, 4:37 PM · Patch-For-Review, Operations-Software-Development
Volans added a comment to T167992: rack/setup/install new kafka nodes kafka-jumbo100[1-6].

For the record they were reimaged correctly, the new reimage script hit a small bug in the post-reimage part, I've already re-run it for the "failed" host to complete the post-reimage steps.

Tue, Sep 12, 9:53 AM · User-Elukey, Patch-For-Review, ops-eqiad, Analytics, Analytics-Cluster, Operations

Fri, Sep 8

Volans closed T149213: wmf-auto-reimage: handle multiple conftool services per host as Resolved.

With the refactoring for T166300 this problem will be naturally solved moving from a shellout to confctl to the use of the conftool library. Resolving.

Fri, Sep 8, 7:48 AM · Operations-Software-Development
Volans closed T159127: Cumin: fine tuning configuration as Resolved.

Configuration looks stable since a while, resolving.

Fri, Sep 8, 7:45 AM · Operations-Software-Development
Volans closed T171684: Cumin: improve target management as Resolved.
Fri, Sep 8, 7:43 AM · Operations-Software-Development
Volans closed T174854: Icinga raid handler improvements as Resolved.
Fri, Sep 8, 7:43 AM · Operations-Software-Development
Volans triaged T174008: Cumin: setup.py installs data_files in wrong directory as Normal priority.

Issue fixed in master branch, leaving the task open to not forget to revert the hotfix in the debian branch done in https://gerrit.wikimedia.org/r/#/c/373513/

Fri, Sep 8, 7:42 AM · Operations-Software-Development
Volans closed T174911: Cumin: clustershell worker should return if no target are specified as Resolved.
Fri, Sep 8, 7:24 AM · Patch-For-Review, Operations-Software-Development

Thu, Sep 7

Volans closed T175252: Degraded RAID on ms-be2023 as Invalid.
Thu, Sep 7, 2:17 PM · Operations, ops-codfw
Volans closed T175253: Degraded RAID on ms-be2023 as Invalid.
Thu, Sep 7, 2:17 PM · Operations, ops-codfw
Volans closed T175267: Degraded RAID on ms-be2023 as Invalid.
Thu, Sep 7, 2:17 PM · Operations, ops-codfw
Volans closed T175277: Degraded RAID on ms-be2023 as Invalid.
Thu, Sep 7, 1:51 PM · Operations, ops-codfw
Volans closed T175276: Degraded RAID on ms-be2023 as Invalid.
Thu, Sep 7, 1:51 PM · Operations, ops-codfw
Volans closed T175275: Degraded RAID on ms-be2023 as Invalid.
Thu, Sep 7, 1:50 PM · Operations, ops-codfw
Volans closed T175278: Degraded RAID on ms-be2023 as Invalid.
Thu, Sep 7, 1:50 PM · Operations, ops-codfw
Volans closed T175271: Degraded RAID on ms-be2023 as Invalid.

False positive, I'll check why was not blacklisted

Thu, Sep 7, 1:36 PM · Operations, ops-codfw

Wed, Sep 6

Volans merged task T175174: Degraded RAID on ms-be2023 into T174777: Degraded RAID on ms-be2023.
Wed, Sep 6, 4:46 PM · Operations, ops-codfw
Volans merged T175174: Degraded RAID on ms-be2023 into T174777: Degraded RAID on ms-be2023.
Wed, Sep 6, 4:46 PM · Operations, ops-codfw
Volans merged T175168: Degraded RAID on lvs3001 into T168619: Degraded RAID on lvs3001.
Wed, Sep 6, 4:01 PM · ops-esams, Operations
Volans merged task T175168: Degraded RAID on lvs3001 into T168619: Degraded RAID on lvs3001.
Wed, Sep 6, 4:01 PM · ops-esams, Operations

Tue, Sep 5

Volans updated the task description for T171704: Switch all hosts to the future parser.
Tue, Sep 5, 2:28 PM · Patch-For-Review, User-Joe, Puppet, Operations

Mon, Sep 4

Volans moved T174911: Cumin: clustershell worker should return if no target are specified from In Progress to In Code Review on the Operations-Software-Development board.
Mon, Sep 4, 11:03 AM · Patch-For-Review, Operations-Software-Development
Volans moved T174911: Cumin: clustershell worker should return if no target are specified from Backlog to In Progress on the Operations-Software-Development board.
Mon, Sep 4, 10:35 AM · Patch-For-Review, Operations-Software-Development
Volans created T174911: Cumin: clustershell worker should return if no target are specified.
Mon, Sep 4, 10:35 AM · Patch-For-Review, Operations-Software-Development
Volans moved T174854: Icinga raid handler improvements from In Progress to In Code Review on the Operations-Software-Development board.
Mon, Sep 4, 9:02 AM · Operations-Software-Development

Sun, Sep 3

Volans added a comment to T162013: etcd cluster in codfw has raft consensus issues.

I might have a good candidate for what is causing it: mdadm checkarray

Sun, Sep 3, 9:58 AM · Patch-For-Review, User-Joe, Operations

Sat, Sep 2

Volans moved T169555: puppet.service systemctl failures after reimage from Backlog to In Progress on the Operations-Software-Development board.
Sat, Sep 2, 4:53 PM · Operations-Software-Development
Volans moved T149230: wmf-auto-reimage: allow to specify the conftool state from Backlog to In Progress on the Operations-Software-Development board.
Sat, Sep 2, 4:40 PM · Operations-Software-Development
Volans moved T148817: wmf-auto-reimage: remove dependency on wmf-reimage from Backlog to In Progress on the Operations-Software-Development board.
Sat, Sep 2, 4:39 PM · Operations-Software-Development
Volans moved T174854: Icinga raid handler improvements from Backlog to In Progress on the Operations-Software-Development board.
Sat, Sep 2, 4:11 PM · Operations-Software-Development
Volans triaged T174857: Degraded RAID on db1059 as Normal priority.
Sat, Sep 2, 4:11 PM · DBA, ops-eqiad, Operations
Volans created T174854: Icinga raid handler improvements.
Sat, Sep 2, 2:33 PM · Operations-Software-Development

Wed, Aug 30

Volans updated subscribers of T164341: Decommission old memcached hosts - mc1001->mc1018.

@elukey, @Joe, @Cmjohnson: for testing purposes of the migration of the reimage script from salt to cumin, could I grab mc100[1-2] in the next few days as test hosts for the reimage script?

Wed, Aug 30, 4:30 PM · Patch-For-Review, User-Elukey, Operations, ops-eqiad
Volans added a project to T174534: Degraded RAID on ms-be2024: media-storage.
Wed, Aug 30, 8:02 AM · media-storage, Operations, ops-codfw

Mon, Aug 28

Volans updated subscribers of T166965: Degraded RAID on lvs3001.
Mon, Aug 28, 9:21 PM · Traffic, ops-esams, Operations
Volans updated subscribers of T168619: Degraded RAID on lvs3001.
Mon, Aug 28, 9:21 PM · ops-esams, Operations
Volans added a project to T174392: Disk errors: restbase1010.eqiad.wmnet: ops-eqiad.
Mon, Aug 28, 9:03 PM · Services (watching), Operations
Volans added a comment to T174392: Disk errors: restbase1010.eqiad.wmnet.

Adding ops-eqiad, looks like we'll probably end up replacing the disk

Mon, Aug 28, 9:02 PM · Services (watching), Operations
Volans updated subscribers of T169035: bast3002 sdb broken.
Mon, Aug 28, 4:31 PM · Operations, ops-esams
Volans moved T166300: Remove Salt from wmf-auto-reimage / wmf-reimage from Backlog to In Progress on the Operations-Software-Development board.
Mon, Aug 28, 7:56 AM · Patch-For-Review, Technical-Debt, Operations-Software-Development, Operations
Volans moved T174008: Cumin: setup.py installs data_files in wrong directory from In Progress to In Code Review on the Operations-Software-Development board.
Mon, Aug 28, 7:56 AM · Operations-Software-Development
Volans moved T174008: Cumin: setup.py installs data_files in wrong directory from Backlog to In Progress on the Operations-Software-Development board.
Mon, Aug 28, 7:56 AM · Operations-Software-Development

Fri, Aug 25

Volans added a comment to T170598: Extending our HSTS value beyond ~1y.

+1 for me, I see this almost as a noop. Over 2y is more likely than the user change the physical device (in particular if mobile) than HSTS expires 😉

Fri, Aug 25, 3:57 PM · Operations, Traffic
Volans added a comment to T173315: Review check_ping settings.

The check_ping on our icinga hosts doesn't seem to have an option to set the equivalent of the -i of the ping command. Reducing from 5 to 3 packets half the time to 2s per check, but I'm not sure if it's worth given the increased risk of false positives (although for 3 packets should be low enough inside our prod network).
I'm too curious about what is the current issue/bottleneck

Fri, Aug 25, 2:30 PM · Operations, monitoring
Volans added a comment to T173427: Review check_puppetrun frequency.

An alternative option could be to make this check passive, with a freshness threshold of like 35m, with the data pushed directly by the run-puppet-agent/puppet-run after each run. If the check is stale (no data received by icinga after the threshold) than an active check can be performed automatically, allowing (I think) to keep the current logic of warning/critical for last puppet run.

Fri, Aug 25, 2:25 PM · Operations, monitoring
Volans added a comment to T109903: add pdu redundancy checking to server/router/switch checks in icinga.

I agree, the only drawback I see to have them bundled together is that we couldn't use stalking to tell them apart given that the temperature will change on each check.

Fri, Aug 25, 9:37 AM · Patch-For-Review, Operations, monitoring
Volans added a comment to T168613: Broken disk on mw1228.

What is the status of this server? I can see it all red in Icinga, trying to SSH gives the key changed warning but puppet is not aware of the new one.

Fri, Aug 25, 8:25 AM · Operations, ops-eqiad

Thu, Aug 24

Volans added a comment to T173999: CI job debian-glue-non-voting: add support for BACKPORTS=yes.

Thanks @thcipriani for the answers, with my little knowledge of the zuul-jenkins relationship and (in)direct variables settings, it seems to me a fairly normal requirement to be able to configure a repository to run a CI job and specify/set some parameters for it.

Thu, Aug 24, 9:45 PM · Release-Engineering-Team (Kanban), Patch-For-Review, Continuous-Integration-Config
Volans added a comment to T86552: Monitor and alarm on SMART attributes.

The above should not be needed on megaraid hosts where smartctl --scan-open works well AFAICT (see on ms-be2014).

Thu, Aug 24, 11:59 AM · Patch-For-Review, User-fgiunchedi, Operations, monitoring
Volans created T174008: Cumin: setup.py installs data_files in wrong directory.
Thu, Aug 24, 9:20 AM · Operations-Software-Development
Volans created T173999: CI job debian-glue-non-voting: add support for BACKPORTS=yes.
Thu, Aug 24, 7:41 AM · Release-Engineering-Team (Kanban), Patch-For-Review, Continuous-Integration-Config

Wed, Aug 23

Volans created P5907 ack_handler.py.
Wed, Aug 23, 4:41 PM

Tue, Aug 22

Volans added a comment to P5900 (An Untitled Masterwork).

This works but is super ugly:

Tue, Aug 22, 11:36 AM
Volans added a comment to T171167: Evaluate LibreNMS' Graphite backend.

At first sight it might just be that the update frequency of the data and the smallest retention period set in graphite do not match to each other, having a much smaller retention period than the update frequency.

Tue, Aug 22, 10:43 AM · Patch-For-Review, User-fgiunchedi, netops, monitoring, Operations
Volans added a comment to T151632: Fix Icinga checks for test/decom servers.

I think so too but it might need some parameter or hiera value to define those as "provisioning", given that they will have already the production MariaDB role but will not be fully provisioned. So if I'm understanding it correctly, yes it's possible but it will require an additional commit to remove the "provisioning" param/hiera once the provisioning is completed.

Tue, Aug 22, 9:23 AM · Patch-For-Review, monitoring, Operations
Volans created T173806: Icinga: evaluate stalking options for some checks.
Tue, Aug 22, 9:09 AM · monitoring
Volans added a comment to T151632: Fix Icinga checks for test/decom servers.

My answers to the above questions are: YES, YES, YES (but I'd like them to be separated in the UI, unfortunately this is not possible in Icinga), NO

Tue, Aug 22, 9:06 AM · Patch-For-Review, monitoring, Operations
Volans triaged T172809: Degraded RAID on analytics1055 as Normal priority.
Tue, Aug 22, 8:01 AM · Analytics-Kanban, ops-eqiad, Operations
Volans triaged T173679: Degraded RAID on logstash1006 as Normal priority.
Tue, Aug 22, 8:00 AM · Discovery-Search (Current work), ops-eqiad, Operations

Aug 3 2017

Volans added a comment to T170353: Icinga: timeseries checks should have the link to a graph with the data.

I've started working on this, I hoped to be able to finish it by today but the list of checks is long. I will complete it when I'll be back.

Aug 3 2017, 1:56 PM · Operations, monitoring

Aug 2 2017

Volans closed T171679: Cumin: allow to ignore exit codes of executed commands as Resolved.
Aug 2 2017, 1:36 PM · Operations-Software-Development
Volans closed T170394: Cumin: add multi-query support as Resolved.
Aug 2 2017, 1:35 PM · Operations-Software-Development
Volans triaged T161545: Cumin: PuppetDB backend, allow to specify boolean values for resource parameters as Low priority.
Aug 2 2017, 10:02 AM · Operations-Software-Development
Volans triaged T170394: Cumin: add multi-query support as High priority.
Aug 2 2017, 10:02 AM · Operations-Software-Development
Volans triaged T169304: Cumin masters: simplify usage in case of emergency as Normal priority.
Aug 2 2017, 10:01 AM · Patch-For-Review, Operations-Software-Development
Volans triaged T171684: Cumin: improve target management as Normal priority.
Aug 2 2017, 10:01 AM · Operations-Software-Development
Volans triaged T171679: Cumin: allow to ignore exit codes of executed commands as Normal priority.
Aug 2 2017, 10:01 AM · Operations-Software-Development

Jul 31 2017

Volans merged T172062: Degraded RAID on ms-be1017 into T171926: Degraded RAID on ms-be1017.
Jul 31 2017, 7:32 AM · ops-eqiad, Operations
Volans merged task T172062: Degraded RAID on ms-be1017 into T171926: Degraded RAID on ms-be1017.
Jul 31 2017, 7:32 AM · ops-eqiad, Operations
Volans merged task T172054: Degraded RAID on ms-be1017 into T171926: Degraded RAID on ms-be1017.
Jul 31 2017, 7:31 AM · ops-eqiad, Operations
Volans merged T172054: Degraded RAID on ms-be1017 into T171926: Degraded RAID on ms-be1017.
Jul 31 2017, 7:31 AM · ops-eqiad, Operations
Volans merged task T172051: Degraded RAID on ms-be1017 into T171926: Degraded RAID on ms-be1017.
Jul 31 2017, 7:31 AM · ops-eqiad, Operations
Volans merged T172051: Degraded RAID on ms-be1017 into T171926: Degraded RAID on ms-be1017.
Jul 31 2017, 7:31 AM · ops-eqiad, Operations
Volans added a comment to T134893: Unhandled pybal error causing services to be depooled in etcd but not in lvs.

The added PyBal IPVS diff check is flapping a bit with UNKNOWN for some hosts (lvs100[3,6,9], lvs200[3,6]) with message:

HTTPConnectionPool(host='localhost', port=9090): Read timed out. (read timeout=1.0)
$ grep -c "PyBal IPVS diff check" icinga.log
34

When specifying the timeout in Requests you can use a tuple to put different values for connect and read timeouts. My guess is that sometimes on those hosts PyBal is not able to reply within the 1s timeout and we might need a larger one.

Jul 31 2017, 7:09 AM · Patch-For-Review, Operations-Software-Development, Pybal, Traffic, Operations

Jul 26 2017

Volans updated subscribers of T170740: PuppetDB misbehaving on 2017-07-15.

So we had a small hiccup today in which puppetdb responded 28 times 503s between 16:20:13 and 16:20:39 UTC, of those 17 where POSTs to update the hosts facts and we had a bit of a failure spam on IRC. It recovered by itself.

Jul 26 2017, 5:06 PM · Patch-For-Review, Puppet, Operations
Volans triaged T171723: Degraded RAID on db1068 as High priority.

This is s4 master.

Jul 26 2017, 11:10 AM · DBA, ops-eqiad, Operations

Jul 25 2017

Volans moved T171684: Cumin: improve target management from In Progress to In Code Review on the Operations-Software-Development board.
Jul 25 2017, 11:22 PM · Operations-Software-Development
Volans moved T171684: Cumin: improve target management from Backlog to In Progress on the Operations-Software-Development board.
Jul 25 2017, 11:12 PM · Operations-Software-Development
Volans created T171684: Cumin: improve target management.
Jul 25 2017, 11:11 PM · Operations-Software-Development
Volans moved T171679: Cumin: allow to ignore exit codes of executed commands from In Progress to In Code Review on the Operations-Software-Development board.
Jul 25 2017, 11:03 PM · Operations-Software-Development
Volans moved T171679: Cumin: allow to ignore exit codes of executed commands from Backlog to In Progress on the Operations-Software-Development board.
Jul 25 2017, 10:28 PM · Operations-Software-Development
Volans created T171679: Cumin: allow to ignore exit codes of executed commands.
Jul 25 2017, 10:28 PM · Operations-Software-Development

Jul 24 2017

Volans added a comment to T168142: Cleanup phabricator.wikimedia.org uploaded files, WP zero abuse.

I've disabled (if not already) and removed files for the following users:

Soufianehamouda
Houssamista
Marama12
Oussama177
Jul 24 2017, 8:42 AM · Wikimedia-Incident, Wikimedia-Site-requests, Phabricator

Jul 21 2017

Volans moved T170394: Cumin: add multi-query support from In Progress to In Code Review on the Operations-Software-Development board.
Jul 21 2017, 8:22 AM · Operations-Software-Development