Volans (Riccardo Coccioli)
Operations Software Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Feb 10 2016, 11:25 AM (122 w, 5 d)
Availability
Available
IRC Nick
volans
LDAP User
Volans
MediaWiki User
RCoccioli (WMF)

Recent Activity

Fri, Jun 15

Volans created T197458: Cumin: add option when --batch=1 to skip deduplication.
Fri, Jun 15, 10:36 AM · Operations-Software-Development
Volans added a watcher for monitoring: Volans.
Fri, Jun 15, 9:53 AM
Volans removed a project from T184562: Upgrade Puppet Master Infrastructure to Debian Stretch: Security.
Fri, Jun 15, 9:07 AM · User-fgiunchedi, Patch-For-Review, Puppet, Operations
Volans assigned T184562: Upgrade Puppet Master Infrastructure to Debian Stretch to fgiunchedi.
Fri, Jun 15, 9:06 AM · User-fgiunchedi, Patch-For-Review, Puppet, Operations
Volans claimed T184563: Investigate landscape of PuppetDB Frontends and Provision One.
Fri, Jun 15, 8:52 AM · Patch-For-Review, Operations, Puppet
Volans changed the visibility for T184564: Plan Puppet 5 upgrade.
Fri, Jun 15, 8:46 AM · Security, Puppet, Operations
Volans removed a project from T184456: Exclude 'admin-monitoring' and 'contintcloud' projects from cumin openstack queries: Security.
Fri, Jun 15, 8:45 AM · Security, Operations-Software-Development
Volans changed the visibility for T184561: Modernize Puppet Configuration Management (2017-18 Q3 Goal).
Fri, Jun 15, 8:45 AM · Security, Goal, Puppet, Operations
Volans updated subscribers of T184456: Exclude 'admin-monitoring' and 'contintcloud' projects from cumin openstack queries.
Fri, Jun 15, 8:44 AM · Security, Operations-Software-Development
Volans changed the visibility for T184456: Exclude 'admin-monitoring' and 'contintcloud' projects from cumin openstack queries.
Fri, Jun 15, 8:44 AM · Security, Operations-Software-Development
Volans assigned T184444: Puppet hosts with their cert revoked can still run puppet to herron.
Fri, Jun 15, 8:35 AM · Security, Patch-For-Review, Puppet, Operations
Volans changed the visibility for T184444: Puppet hosts with their cert revoked can still run puppet.
Fri, Jun 15, 8:32 AM · Security, Patch-For-Review, Puppet, Operations
Volans removed a project from T184435: Puppet tox: properly lint both Py2 and Py3 files: Security.
Fri, Jun 15, 8:31 AM · Security, Operations-Software-Development, Continuous-Integration-Config, Operations
Volans changed the visibility for T184435: Puppet tox: properly lint both Py2 and Py3 files.
Fri, Jun 15, 8:30 AM · Security, Operations-Software-Development, Continuous-Integration-Config, Operations
Volans assigned T184796: Configure puppetdb to export metrics via Prometheus JMX Agent to elukey.
Fri, Jun 15, 8:29 AM · User-Elukey, Patch-For-Review, monitoring, Operations
Volans removed a project from T184337: ModuleDeprecationWrapper doesn't show a deprecation warning as expected: Security.
Fri, Jun 15, 8:24 AM · Patch-For-Review, Pywikibot-core
Volans changed the visibility for T184337: ModuleDeprecationWrapper doesn't show a deprecation warning as expected.
Fri, Jun 15, 8:23 AM · Patch-For-Review, Pywikibot-core
Volans assigned T184337: ModuleDeprecationWrapper doesn't show a deprecation warning as expected to Dalba.
Fri, Jun 15, 8:23 AM · Patch-For-Review, Pywikibot-core

Thu, Jun 14

Volans added a comment to T191300: Debmonitor: deploy the agent across the fleet.

Thanks a lot @elukey!

Thu, Jun 14, 9:31 AM · Patch-For-Review, Operations-Software-Development, Operations
Volans added a comment to T197126: Create tool to handle the state of database configuration in MediaWiki in etcd.

@Joe ack to all your replies, thanks for integrating the suggestions!

Thu, Jun 14, 8:28 AM · User-Joe, MediaWiki-Configuration, Operations, DBA

Wed, Jun 13

Volans added a comment to T197126: Create tool to handle the state of database configuration in MediaWiki in etcd.

Quick first feedback/questions on the proposal:

Wed, Jun 13, 6:02 PM · User-Joe, MediaWiki-Configuration, Operations, DBA
Volans moved T191300: Debmonitor: deploy the agent across the fleet from In Progress to In Code Review on the Operations-Software-Development board.
Wed, Jun 13, 7:26 AM · Patch-For-Review, Operations-Software-Development, Operations

Tue, Jun 12

Volans added a comment to T196336: Icinga passive checks go awal and downtime stops working.

This happened again today unfortunately. And because I don't see any logs of spurious passive checks from other frack hosts, I guess we have to discard the hypothesis that it might have been that the cause of the issue.

Tue, Jun 12, 8:49 PM · Icinga, monitoring

Mon, Jun 11

Gerrit Code Review <gerrit@wikimedia.org> committed rOSMDd7a0c92918ed: Update patch set 13 (authored by Volans).
Update patch set 13
Mon, Jun 11, 3:48 AM
Gerrit Code Review <gerrit@wikimedia.org> committed rOSMDe26021ab33b3: Update patch set 4 (authored by Volans).
Update patch set 4
Mon, Jun 11, 3:47 AM
Gerrit Code Review <gerrit@wikimedia.org> committed rOSMD6e84c52ae1db: Update patch set 3 (authored by Volans).
Update patch set 3
Mon, Jun 11, 3:47 AM
Gerrit Code Review <gerrit@wikimedia.org> committed rOSMD60f17e4452fa: Update patch set 1 (authored by Volans).
Update patch set 1
Mon, Jun 11, 3:46 AM
Gerrit Code Review <gerrit@wikimedia.org> committed rOSMD0a84e1064875: Update patch set 1 (authored by Volans).
Update patch set 1
Mon, Jun 11, 3:46 AM

Sat, Jun 9

Volans added a comment to T183234: Gerrit: autocomplete to add reviewers slow.

From a quick test the slowest one letter search was ~1s and was for less common letters like z or q. As of now I cannot repro the issue, feel free to resolve the task if you think that the new version have solved it too. It can be re-opened it case we found some repro.

Sat, Jun 9, 2:46 PM · Gerrit

Fri, Jun 8

Volans moved T191299: Debmonitor: deploy the service in production from In Progress to In Code Review on the Operations-Software-Development board.
Fri, Jun 8, 8:08 PM · Patch-For-Review, Operations-Software-Development, Operations
Volans moved T167504: New tool to track package updates/status for hosts and images (debmonitor) from In Progress to In Code Review on the Operations-Software-Development board.
Fri, Jun 8, 8:08 PM · Patch-For-Review, Operations-Software-Development, Operations
Volans moved T191300: Debmonitor: deploy the agent across the fleet from Backlog to In Progress on the Operations-Software-Development board.
Fri, Jun 8, 8:08 PM · Patch-For-Review, Operations-Software-Development, Operations

Thu, Jun 7

Volans created T196628: CI: upgrade tox, currently running 2.6.0.
Thu, Jun 7, 11:36 AM · Continuous-Integration-Infrastructure

Tue, Jun 5

Volans added a comment to T195569: Degraded RAID on ms-be1034.

@Cmjohnson which disk is a tricky question in this case.

Tue, Jun 5, 5:40 PM · ops-eqiad, Operations
Volans updated subscribers of T196336: Icinga passive checks go awal and downtime stops working.

While investigating the possible root causes for this I discovered that we had some new frack hosts just installed last week that were sending metrics although not yet fully configured. In particular they are not present in Icinga hostlist, hence Icinga discards those messages and in theory that shouldn't harm. But to avoid to have too many variables in place I've asked @Jgreen if it was possible to avoid sending the metrics at all until they are fully configured and he very kindly accepted and already implemented.

Tue, Jun 5, 7:58 AM · Icinga, monitoring

Thu, May 31

Volans updated the task description for T196046: Scap required manual 'git update-server-info' on first run.
Thu, May 31, 9:05 AM · Scap
Volans created T196046: Scap required manual 'git update-server-info' on first run.
Thu, May 31, 8:49 AM · Scap
Volans added a comment to T196045: elastic2018 not rebooting.

Having a look around in the system utility (ESC+9) I found that:

Thu, May 31, 8:43 AM · ops-codfw, Discovery-Search (Current work), DC-Ops, Operations
Volans merged task T196014: Degraded RAID on labvirt1019 into T194907: Degraded RAID on labvirt1019.
Thu, May 31, 8:15 AM · ops-eqiad, Operations
Volans merged T196014: Degraded RAID on labvirt1019 into T194907: Degraded RAID on labvirt1019.
Thu, May 31, 8:15 AM · cloud-services-team, ops-eqiad, Operations

Tue, May 29

Volans added a comment to T192875: Debmonitor: request for misc DB allocation.

@jynus got it, thanks for the info. FYI if you want to test your workaround solution, there is another DB missing: frimpressions. I didn't re-create it though, as I have no context on it. I would have told you tomorrow ;)

Tue, May 29, 8:45 PM · Patch-For-Review, DBA, Operations-Software-Development
Volans added a comment to T192875: Debmonitor: request for misc DB allocation.

@jcrespo FYI I was deploying debmonitor today and the replication broke on db1065 and db1117 because of missing debmonitor database.

Tue, May 29, 8:19 PM · Patch-For-Review, DBA, Operations-Software-Development
Volans updated subscribers of T195569: Degraded RAID on ms-be1034.

I now see a SAL entry from @akosiaris:
11:18 akosiaris: powercycling ms-be1034, box is unresposive, tons of logs "sd 0:1:0:1: rejecting I/O to offline device"

Tue, May 29, 4:52 PM · ops-eqiad, Operations
Volans added a comment to T195569: Degraded RAID on ms-be1034.

Actually it seems that this already recovered: OK: Active: 4, Working: 4, Failed: 0, Spare: 0

Tue, May 29, 4:40 PM · ops-eqiad, Operations
Volans updated subscribers of T195569: Degraded RAID on ms-be1034.
Tue, May 29, 4:38 PM · ops-eqiad, Operations
Volans updated subscribers of T194907: Degraded RAID on labvirt1019.

@Cmjohnson is this controller really missing the battery or it's a software problem that is just not recognized?

Tue, May 29, 4:32 PM · cloud-services-team, ops-eqiad, Operations
Volans merged T195862: Degraded RAID on labvirt1019 into T194907: Degraded RAID on labvirt1019.
Tue, May 29, 4:22 PM · cloud-services-team, ops-eqiad, Operations
Volans merged task T195862: Degraded RAID on labvirt1019 into T194907: Degraded RAID on labvirt1019.
Tue, May 29, 4:22 PM · ops-eqiad, Operations
Volans closed T173050: Investigate icinga (einsteinium) load as Resolved.

Ack, I propose to leave it as is for now and re-evaluate once also Filippo is back. Resolving for now, feel free to re-open.

Tue, May 29, 9:25 AM · Patch-For-Review, monitoring

Thu, May 24

Volans added a comment to T194907: Degraded RAID on labvirt1019.

Forgot to mention that the above message and output was taken on labvirt1020 as I cannot ssh to 1019 right now.

Thu, May 24, 5:39 PM · cloud-services-team, ops-eqiad, Operations
Volans added a project to T194907: Degraded RAID on labvirt1019: cloud-services-team.

I've double checked both the report script that populate this task and the Icinga check script that raised the alarm. The issue here seems to be that the controller in Slot 1 (the P840 actually used) doesn't have/recognize the battery, hence the CRITICAL:

Thu, May 24, 5:38 PM · cloud-services-team, ops-eqiad, Operations
Volans placed T193394: Degraded RAID on wasat up for grabs.

I just discovered that this host is planned for reimage in the next few days, not bothering fixing the md array as the host is not seeing the replaced disk and might need anyway a reboot, going directly for the reimage at this point.

Thu, May 24, 5:19 PM · Operations, ops-codfw
Volans claimed T193394: Degraded RAID on wasat.
Thu, May 24, 5:08 PM · Operations, ops-codfw
Volans merged task T195339: Degraded RAID on wasat into T193394: Degraded RAID on wasat.
Thu, May 24, 5:03 PM · Operations, ops-codfw
Volans merged T195339: Degraded RAID on wasat into T193394: Degraded RAID on wasat.
Thu, May 24, 5:03 PM · Operations, ops-codfw
Volans updated subscribers of T195306: Degraded RAID on elastic2020.

It looks to me that the battery is broken/not recognized.

Thu, May 24, 4:58 PM · Operations, ops-codfw
Volans merged task T195501: Degraded RAID on bast3002 into T183814: Degraded RAID on bast3002.
Thu, May 24, 4:54 PM · ops-esams, Operations
Volans merged T195501: Degraded RAID on bast3002 into T183814: Degraded RAID on bast3002.
Thu, May 24, 4:54 PM · ops-esams, Operations
Volans added a comment to T192875: Debmonitor: request for misc DB allocation.

@jcrespo naos has been reimaged to deploy2001.codfw.wmnet, so I guess it can now be added to the grants. Mentioning it here just for not forgetting, there is absolutely no hurry do to do it.

Thu, May 24, 2:00 PM · Patch-For-Review, DBA, Operations-Software-Development
Volans added a comment to T195423: Reduce false positive icinga alerts during host reimages.

The proposed approach don't take into account hosts installed for the first time. As for detecting the newly added host on the Icinga configuration is not trivial at all, same for the disable notifications, that as of now requires a commit to hiera. Unless that part is moved to a more dynamic storage I don't see an easy fix going that path.

Thu, May 24, 10:14 AM · Patch-For-Review, monitoring, Operations

Wed, May 23

Volans added a comment to T173050: Investigate icinga (einsteinium) load.

@akosiaris the current EDAC check is sum(increase($metric[4d])), so is checking the increase over the last 4 days, I'd say is not time-sensitive at all.

Wed, May 23, 8:18 PM · Patch-For-Review, monitoring

Tue, May 22

Volans added a comment to T173050: Investigate icinga (einsteinium) load.

The CPU usage is already back to 40%, we can decide tomorrow if we want to increase the check_interval further.

Tue, May 22, 7:47 PM · Patch-For-Review, monitoring
Volans added a comment to T193470: Mapframe maps with an image in the GeoJSON don't display on mobile website or mobile apps .

To keep everyone in the loop, I've chat with @Catrope the other day about this and we debugged it a bit together.

Tue, May 22, 9:33 AM · Collaboration-Team-Triage (Collab-Team-This-Quarter), Readers-Web-Backlog (Tracking), Discovery, Mobile, Wikipedia-Android-App-Backlog, Wikipedia-iOS-App-Backlog, Maps
Volans added a comment to T173050: Investigate icinga (einsteinium) load.

For reference, last month CPU trend with the two clear increases:
https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=einsteinium&var-datasource=eqiad%20prometheus%2Fops&from=1524243342926&to=1526128864600&refresh=1m&panelId=3&fullscreen

Tue, May 22, 9:19 AM · Patch-For-Review, monitoring
Volans updated subscribers of T173050: Investigate icinga (einsteinium) load.

Thanks @ArielGlenn for re-opening this. From a quick look we had two big increases, one on May 2nd and one on May 8th. I think they are related to those two changes that are basically adding a check for each host each:

Tue, May 22, 8:21 AM · Patch-For-Review, monitoring

May 18 2018

Volans updated subscribers of T193470: Mapframe maps with an image in the GeoJSON don't display on mobile website or mobile apps .
May 18 2018, 10:07 PM · Collaboration-Team-Triage (Collab-Team-This-Quarter), Readers-Web-Backlog (Tracking), Discovery, Mobile, Wikipedia-Android-App-Backlog, Wikipedia-iOS-App-Backlog, Maps
Volans merged T194851: Degraded RAID on labvirt1019 into T194907: Degraded RAID on labvirt1019.
May 18 2018, 9:43 AM · cloud-services-team, ops-eqiad, Operations
Volans merged task T194851: Degraded RAID on labvirt1019 into T194907: Degraded RAID on labvirt1019.
May 18 2018, 9:43 AM · ops-eqiad, Operations
Volans added a comment to T194907: Degraded RAID on labvirt1019.

@Dzahn That usually happens if the alarm flap on icinga for some reason, the handler open a new task for each CRITICAL/HARD triggered by Icinga.

May 18 2018, 9:42 AM · cloud-services-team, ops-eqiad, Operations

May 14 2018

Volans added a comment to T187962: Rack/cable/configure asw2-c-eqiad switch stack.

Yeah, puppetdb1001 will probably just generate some spam on IRC for failing puppet runs, transient.

May 14 2018, 5:00 PM · Patch-For-Review, Operations, ops-eqiad, netops

May 3 2018

Volans added a comment to T192875: Debmonitor: request for misc DB allocation.

Great! Thanks a lot.

May 3 2018, 9:24 AM · Patch-For-Review, DBA, Operations-Software-Development
Volans moved T191299: Debmonitor: deploy the service in production from Backlog to In Progress on the Operations-Software-Development board.
May 3 2018, 8:47 AM · Patch-For-Review, Operations-Software-Development, Operations

May 1 2018

Volans created T193470: Mapframe maps with an image in the GeoJSON don't display on mobile website or mobile apps .
May 1 2018, 9:04 AM · Collaboration-Team-Triage (Collab-Team-This-Quarter), Readers-Web-Backlog (Tracking), Discovery, Mobile, Wikipedia-Android-App-Backlog, Wikipedia-iOS-App-Backlog, Maps

Apr 30 2018

Volans added a comment to T178690: Better organization for ops grafana dashboards.

As discussed in the monitoring meeting here some feedback:

Apr 30 2018, 5:05 PM · User-fgiunchedi, monitoring, Operations
Volans added a comment to T193226: Test MySQL 8.0 with production data and evaluate its fit for WMF databases.

@jcrespo ack, no blocker for me, I'm actually not using it.

Apr 30 2018, 5:01 PM · Patch-For-Review, DBA
Volans added a comment to T192875: Debmonitor: request for misc DB allocation.

@Volans you can speed up the process by setting some password on the private repository (and some non-private equivalent in the labs/private one), and suggesting a charset/collation for the database (utf8mb4?). Name debmonitor?

Apr 30 2018, 2:55 PM · Patch-For-Review, DBA, Operations-Software-Development
Volans added a comment to T191299: Debmonitor: deploy the service in production.

Setup DNS, DHCP, netboot and created 2 VMs on Ganeti: debmonitor[12]001.

Apr 30 2018, 1:52 PM · Patch-For-Review, Operations-Software-Development, Operations
Volans updated subscribers of T193394: Degraded RAID on wasat.
Apr 30 2018, 1:21 PM · Operations, ops-codfw
Volans added a comment to T192771: mcrouter production architecture.

I've an additional question, what is the expected behaviour in the following failure scenarios for each option?

Apr 30 2018, 9:51 AM · User-Joe, Patch-For-Review, Performance-Team (Radar), Availability (MediaWiki-MultiDC), Operations

Apr 29 2018

Volans added a comment to T193331: db1098 crashed and got rebooted.

I've downtimed db1098 on Icinga until Wed mid EU day and disabled notifications.

Apr 29 2018, 12:09 AM · Patch-For-Review, ops-eqiad, DBA, Operations

Apr 28 2018

Volans triaged T193331: db1098 crashed and got rebooted as High priority.
Apr 28 2018, 11:57 PM · Patch-For-Review, ops-eqiad, DBA, Operations
Volans created T193331: db1098 crashed and got rebooted.
Apr 28 2018, 11:55 PM · Patch-For-Review, ops-eqiad, DBA, Operations

Apr 26 2018

Volans triaged T193160: Monitor the BIOS boot order and parameters as Normal priority.
Apr 26 2018, 12:38 PM · monitoring, Operations
Volans triaged T193155: IPMI Audit 2018-04 as Normal priority.
Apr 26 2018, 11:22 AM · Operations
Volans created T193155: IPMI Audit 2018-04.
Apr 26 2018, 11:22 AM · Operations

Apr 24 2018

Volans added a comment to T192875: Debmonitor: request for misc DB allocation.

From where will you be querying this DB? (just to see which (new) grants you might need?

Apr 24 2018, 3:25 PM · Patch-For-Review, DBA, Operations-Software-Development
Volans added a comment to T192875: Debmonitor: request for misc DB allocation.

Sorry, I didn't mention the multi-DC setup :)

Apr 24 2018, 9:47 AM · Patch-For-Review, DBA, Operations-Software-Development
Volans triaged T192875: Debmonitor: request for misc DB allocation as Normal priority.
Apr 24 2018, 9:11 AM · Patch-For-Review, DBA, Operations-Software-Development

Apr 23 2018

Volans added a comment to T192551: atop on stretch overloading a host.

Personally never used, +1 to drop it.

Apr 23 2018, 7:55 AM · Upstream, Patch-For-Review, monitoring, Operations

Apr 19 2018

Volans triaged T192547: Improve remote IPMI monitoring as Normal priority.
Apr 19 2018, 12:49 PM · monitoring, Operations
Volans created T192547: Improve remote IPMI monitoring.
Apr 19 2018, 12:49 PM · monitoring, Operations
Volans closed T162857: Some Core availability Catchpoint tests might be more expensive than they need to be as Resolved.

To summarize the work done recently, I've made an audit of existing checks and fixed/improved some of them that had clear errors or needed to be updated. @chasemp has very kindly offered himself to review the WMCS related checks, users and groups.

Apr 19 2018, 9:10 AM · monitoring, Patch-For-Review, Operations

Apr 18 2018

Volans added a project to T191393: Puppet: tlsproxy localssl default_server make a Notify at each run: Traffic.

@Joe no it would not be super easy to solve in a DRY way, I agree.

Apr 18 2018, 11:19 AM · Traffic, Operations, Puppet

Apr 12 2018

Volans closed T191977: remote ipmi doesn't work for es2013 as Resolved.

I've fixed it, it was a case of password misalignment, see one of the cases described in T150160,

Apr 12 2018, 4:46 PM · ops-codfw, DC-Ops, Patch-For-Review, Operations, DBA

Apr 10 2018

Volans added a comment to T191905: eqsin hosts don't allow remote ipmi.

Reporting it here too for the future, to fix it's sufficient to replace the --diff of the above command with --commit and then re-run the --diff to ensure that this time it will show no error.

Apr 10 2018, 7:02 PM · Traffic, Operations, ops-eqsin

Apr 9 2018

Volans claimed T167504: New tool to track package updates/status for hosts and images (debmonitor).
Apr 9 2018, 4:50 PM · Patch-For-Review, Operations-Software-Development, Operations
Volans added a comment to T188112: cumin 3.0.1-1 is broken on labs master.

Patch updated to overcome this problem, once reviewed and merged it should solve the issue.

Apr 9 2018, 8:29 AM · Patch-For-Review, Continuous-Integration-Infrastructure, Operations-Software-Development
Volans created T191764: CI: run tests with multiple Python3 versions.
Apr 9 2018, 8:18 AM · Continuous-Integration-Infrastructure

Apr 4 2018

Volans moved T167504: New tool to track package updates/status for hosts and images (debmonitor) from Backlog to In Progress on the Operations-Software-Development board.
Apr 4 2018, 10:34 AM · Patch-For-Review, Operations-Software-Development, Operations
Volans moved T191298: Release and deploy Debmonitor (patch management software) [Technology Goal 2017-18_Q4] from Backlog to In Progress on the Operations-Software-Development board.
Apr 4 2018, 10:34 AM · Patch-For-Review, Operations-Software-Development, Operations, Goal