Volans (Riccardo Coccioli)
Operations Software Engineer

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Feb 10 2016, 11:25 AM (114 w, 6 d)
Availability
Available
IRC Nick
volans
LDAP User
Volans
MediaWiki User
RCoccioli (WMF)

Recent Activity

Yesterday

Volans added a comment to T192875: Debmonitor: request for misc DB allocation.

From where will you be querying this DB? (just to see which (new) grants you might need?

Tue, Apr 24, 3:25 PM · DBA, Operations-Software-Development
Volans added a comment to T192875: Debmonitor: request for misc DB allocation.

Sorry, I didn't mention the multi-DC setup :)

Tue, Apr 24, 9:47 AM · DBA, Operations-Software-Development
Volans triaged T192875: Debmonitor: request for misc DB allocation as Normal priority.
Tue, Apr 24, 9:11 AM · DBA, Operations-Software-Development

Mon, Apr 23

Volans added a comment to T192551: atop on stretch overloading a host.

Personally never used, +1 to drop it.

Mon, Apr 23, 7:55 AM · Upstream, Patch-For-Review, monitoring, Operations

Thu, Apr 19

Volans triaged T192547: Improve remote IPMI monitoring as Normal priority.
Thu, Apr 19, 12:49 PM · monitoring, Operations
Volans created T192547: Improve remote IPMI monitoring.
Thu, Apr 19, 12:49 PM · monitoring, Operations
Volans closed T162857: Some Core availability Catchpoint tests might be more expensive than they need to be as Resolved.

To summarize the work done recently, I've made an audit of existing checks and fixed/improved some of them that had clear errors or needed to be updated. @chasemp has very kindly offered himself to review the WMCS related checks, users and groups.

Thu, Apr 19, 9:10 AM · monitoring, Patch-For-Review, Operations

Wed, Apr 18

Volans added a project to T191393: Puppet: tlsproxy localssl default_server make a Notify at each run: Traffic.

@Joe no it would not be super easy to solve in a DRY way, I agree.

Wed, Apr 18, 11:19 AM · Traffic, Operations, Puppet

Thu, Apr 12

Volans closed T191977: remote ipmi doesn't work for es2013 as Resolved.

I've fixed it, it was a case of password misalignment, see one of the cases described in T150160,

Thu, Apr 12, 4:46 PM · ops-codfw, DC-Ops, Patch-For-Review, DBA, Operations

Tue, Apr 10

Volans added a comment to T191905: eqsin hosts don't allow remote ipmi.

Reporting it here too for the future, to fix it's sufficient to replace the --diff of the above command with --commit and then re-run the --diff to ensure that this time it will show no error.

Tue, Apr 10, 7:02 PM · Traffic, Operations, ops-eqsin

Mon, Apr 9

Volans claimed T167504: New tool to track package updates/status for hosts and images (debmonitor).
Mon, Apr 9, 4:50 PM · Patch-For-Review, Continuous-Integration-Infrastructure (shipyard), Operations-Software-Development, Operations
Volans added a comment to T188112: cumin 3.0.1-1 is broken on labs master.

Patch updated to overcome this problem, once reviewed and merged it should solve the issue.

Mon, Apr 9, 8:29 AM · Patch-For-Review, Continuous-Integration-Infrastructure, Operations-Software-Development
Volans created T191764: CI: run tests with multiple Python3 versions.
Mon, Apr 9, 8:18 AM · Continuous-Integration-Infrastructure

Wed, Apr 4

Volans moved T167504: New tool to track package updates/status for hosts and images (debmonitor) from Backlog to In Progress on the Operations-Software-Development board.
Wed, Apr 4, 10:34 AM · Patch-For-Review, Continuous-Integration-Infrastructure (shipyard), Operations-Software-Development, Operations
Volans moved T191298: Release and deploy Debmonitor (patch management software) [Technology Goal 2017-18_Q4] from Backlog to In Progress on the Operations-Software-Development board.
Wed, Apr 4, 10:34 AM · Operations-Software-Development, Operations, Goal
Volans closed T190918: Puppet: enable reports to puppetdb as Resolved.

Reports are enabled since ~1 day without any incident. Resolving.

Wed, Apr 4, 10:04 AM · Patch-For-Review, Puppet, Operations
Volans closed T190918: Puppet: enable reports to puppetdb, a subtask of T184561: Modernize Puppet Configuration Management (2017-18 Q3 Goal), as Resolved.
Wed, Apr 4, 10:04 AM · Goal, Puppet, Operations
Volans triaged T191388: Puppet: tracking catalogs that changes at every run as Normal priority.
Wed, Apr 4, 10:02 AM · Tracking, Operations, Puppet
Volans triaged T191393: Puppet: tlsproxy localssl default_server make a Notify at each run as Normal priority.
Wed, Apr 4, 10:02 AM · Traffic, Operations, Puppet
Volans created T191388: Puppet: tracking catalogs that changes at every run.
Wed, Apr 4, 9:05 AM · Tracking, Operations, Puppet

Tue, Apr 3

Volans triaged T191298: Release and deploy Debmonitor (patch management software) [Technology Goal 2017-18_Q4] as Normal priority.
Tue, Apr 3, 2:19 PM · Operations-Software-Development, Operations, Goal
Volans updated the task description for T191299: Debmonitor: deploy the service in production.
Tue, Apr 3, 2:16 PM · Operations-Software-Development, Operations
Volans updated the task description for T191298: Release and deploy Debmonitor (patch management software) [Technology Goal 2017-18_Q4].
Tue, Apr 3, 2:16 PM · Operations-Software-Development, Operations, Goal
Volans renamed T191299: Debmonitor: deploy the service in production from Debmonitor: deploy it in production to Debmonitor: deploy the service in production.
Tue, Apr 3, 2:15 PM · Operations-Software-Development, Operations
Volans triaged T191300: Debmonitor: deploy the agent across the fleet as Normal priority.
Tue, Apr 3, 2:14 PM · Operations-Software-Development, Operations
Volans triaged T191299: Debmonitor: deploy the service in production as Normal priority.
Tue, Apr 3, 2:13 PM · Operations-Software-Development, Operations
Volans added a parent task for T167504: New tool to track package updates/status for hosts and images (debmonitor): T191298: Release and deploy Debmonitor (patch management software) [Technology Goal 2017-18_Q4].
Tue, Apr 3, 2:11 PM · Patch-For-Review, Continuous-Integration-Infrastructure (shipyard), Operations-Software-Development, Operations
Volans added a subtask for T191298: Release and deploy Debmonitor (patch management software) [Technology Goal 2017-18_Q4]: T167504: New tool to track package updates/status for hosts and images (debmonitor).
Tue, Apr 3, 2:11 PM · Operations-Software-Development, Operations, Goal
Volans created T191298: Release and deploy Debmonitor (patch management software) [Technology Goal 2017-18_Q4].
Tue, Apr 3, 2:10 PM · Operations-Software-Development, Operations, Goal

Sat, Mar 31

Volans added a comment to T191149: labsdb1009 crashed.

@jcrespo LMK if you'd like me to do anything about it during the weekend.

Sat, Mar 31, 10:59 PM · Patch-For-Review, Data-Services, DBA

Fri, Mar 30

Volans added a comment to T191129: msw-c6-codfw offline.

I've agreed with @RobH on IRC that this is not UBN for now for the DBA part.

Fri, Mar 30, 10:47 PM · DC-Ops, media-storage, DBA, Operations, ops-codfw
Volans created T191116: Wikimedia\Rdbms\Database::tableName: use of subqueries is not supported this way..
Fri, Mar 30, 4:15 PM · Security, MediaWiki-Special-pages, Wikimedia-log-errors, MediaWiki-Database, DBA
Volans added a comment to T190960: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query".

For reference this is the max replication lag between all eqiad DBs in that time frame (Mar. 28th, ~18:30-20:30), from where seems pretty clear that there was no sensible lag at all in that time frame.

Fri, Mar 30, 8:27 AM · Wikimedia-Incident, MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), User-notice, Patch-For-Review, DBA, Wikimedia-log-errors

Thu, Mar 29

Volans added a comment to T191020: labsdb1004: s51541_sulwatcher.logging is out of sync.

@MarcoAurelio thanks for checking in and for the additional info. At this point I think that this is due to some older discrepancy between the two hosts for this table due to the fact that your query deleted only 14678 rows on the master while the slave has much more rows that met this condition:

Thu, Mar 29, 7:35 PM · Tool-stewardbots, Tools
Volans closed T191001: labsdb1004: broken replication as Resolved.

Replication lag back to zero, no other errors, the two tables are different though. I've opened T191020 for tracking it, while resolving this one.

Thu, Mar 29, 1:09 PM · DBA
Volans created T191020: labsdb1004: s51541_sulwatcher.logging is out of sync.
Thu, Mar 29, 1:08 PM · Tool-stewardbots, Tools
Volans added a comment to T190918: Puppet: enable reports to puppetdb.

The plan as of now is to enable it on next Tuesday, to avoid issues in the long weekend.

Thu, Mar 29, 11:24 AM · Patch-For-Review, Puppet, Operations
Volans added a comment to T190918: Puppet: enable reports to puppetdb.

From the quick test I've made yesterday enabling reporting also to puppetdb for some minutes, I got ~200 hosts reported and showing data in Puppetboard, I didn't notice any sensible load/ram/disk usage increase on puppetdb hosts.
Moreover our report-ttl parameter is set to 1d, so I don't expect this huge amount of data to be kept longer term.

Thu, Mar 29, 11:15 AM · Patch-For-Review, Puppet, Operations
Volans added a comment to T191001: labsdb1004: broken replication.

I've found the missing 20 rows that were missing on labsdb1004 in this delete that was deleting 14677 rows, re-added them to labsdb1004 and re-started the replication.

Thu, Mar 29, 10:43 AM · DBA
Volans triaged T191001: labsdb1004: broken replication as High priority.
Thu, Mar 29, 9:09 AM · DBA
Volans created T191001: labsdb1004: broken replication.
Thu, Mar 29, 9:09 AM · DBA
Volans added a comment to T190425: GlobalPreferences deploy caused a significant increase in reads on s3.

@Niharika as you might know, our DBAs are out for the end of this week, do you think this can wait Monday? If not let me know and I'll try to have a look although I'm missing some context on what analysis was already done.

Thu, Mar 29, 8:24 AM · MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), MW-1.31-release-notes (WMF-deploy-2018-04-10 (1.31.0-wmf.29)), Patch-For-Review, Community-Tech-Sprint, MediaWiki-extensions-GlobalPreferences

Wed, Mar 28

Volans triaged T190918: Puppet: enable reports to puppetdb as Normal priority.
Wed, Mar 28, 10:58 AM · Patch-For-Review, Puppet, Operations

Mon, Mar 26

Volans updated the task description for T184561: Modernize Puppet Configuration Management (2017-18 Q3 Goal).
Mon, Mar 26, 6:36 PM · Goal, Puppet, Operations
Volans closed T184563: Investigate landscape of PuppetDB Frontends and Provision One as Resolved.

Puppetboard is now reachable via https://puppetboard.wikimedia.org (LDAP auth), resolving.

Mon, Mar 26, 6:35 PM · Patch-For-Review, Operations, Puppet
Volans closed T184563: Investigate landscape of PuppetDB Frontends and Provision One, a subtask of T184561: Modernize Puppet Configuration Management (2017-18 Q3 Goal), as Resolved.
Mon, Mar 26, 6:35 PM · Goal, Puppet, Operations

Mar 22 2018

Volans triaged T190446: Degraded RAID on db1052 as Normal priority.

It's now rebuilding AFAIK there was a disk replaced:

Mar 22 2018, 8:08 PM · DBA, ops-eqiad, Operations
Volans updated the task description for T189891: Failover puppet ca service from eqiad to codfw.
Mar 22 2018, 11:14 AM · Patch-For-Review, Puppet, Operations

Mar 21 2018

Volans added a comment to T177253: Upgrade PuppetDB to version 4.4.

List of hosts with puppet disabled since before the migration, that are missing in the new puppetdb and would disappear from Icinga upon re-enabling puppet there:

Mar 21 2018, 10:04 AM · Puppet, Operations

Mar 20 2018

Volans updated the task description for T170144: Evaluate NetBox as a Racktables replacement & IPAM.
Mar 20 2018, 4:49 PM · Patch-For-Review, netops, Operations
Volans triaged T190184: Netbox: setup backups as Normal priority.
Mar 20 2018, 4:48 PM · netops, Operations
Volans updated the task description for T170144: Evaluate NetBox as a Racktables replacement & IPAM.
Mar 20 2018, 4:11 PM · Patch-For-Review, netops, Operations

Mar 19 2018

Volans updated the task description for T184634: Netbox: postgres cannot be restarted w/ current config.
Mar 19 2018, 4:00 PM · Patch-For-Review, Operations
Volans closed T185505: Netbox: add Icinga check for the website as Resolved.

Agreed on the meeting that for now the simple HTTP check is enough, given that we check that the uWSGI web app is running too.

Mar 19 2018, 3:59 PM · monitoring, Operations
Volans closed T185505: Netbox: add Icinga check for the website, a subtask of T184634: Netbox: postgres cannot be restarted w/ current config, as Resolved.
Mar 19 2018, 3:59 PM · Patch-For-Review, Operations

Mar 14 2018

Volans added a member for Security: Vgutierrez.
Mar 14 2018, 11:59 AM

Mar 13 2018

Volans added a comment to T185967: Cumin: add custom backend to WMCS.

@chasemp I think it might be kept open for https://gerrit.wikimedia.org/r/c/406779/ but up to @madhuvishy

Mar 13 2018, 1:40 PM · Patch-For-Review, cloud-services-team, Operations-Software-Development
Volans added a comment to T188112: cumin 3.0.1-1 is broken on labs master.

I've split the WMCS part into a separate CR that can be merged independently of production: https://gerrit.wikimedia.org/r/c/419131

Mar 13 2018, 9:45 AM · Patch-For-Review, Continuous-Integration-Infrastructure, Operations-Software-Development

Mar 5 2018

Volans moved T188922: EtcdConfig: add Icinga check from Backlog to In progress on the monitoring board.
Mar 5 2018, 4:46 PM · Patch-For-Review, monitoring, MediaWiki-Configuration, Operations
Volans triaged T188922: EtcdConfig: add Icinga check as Normal priority.
Mar 5 2018, 4:46 PM · Patch-For-Review, monitoring, MediaWiki-Configuration, Operations

Mar 2 2018

Volans closed T188627: Cumin: "No such file or directory" when log_file has no directory as Resolved.

@aggro the fix has been merged into master and will be included in the next cumin release.
For now as a quick workaround to generate the log file in the current directory you could use ./cumin.log in the configuration file instead of just cumin.log.
Thanks again for reporting the issue.

Mar 2 2018, 9:16 AM · Operations-Software-Development

Mar 1 2018

Volans added a comment to T170740: PuppetDB misbehaving on 2017-07-15.

We had OOMs also with puppet disabled on tegmen, so that's not the culprit.

Mar 1 2018, 10:03 PM · Patch-For-Review, Puppet, Operations
Volans committed rCUMINc982c6119b41: CLI: fix setup_logging() when without path (authored by Volans).
CLI: fix setup_logging() when without path
Mar 1 2018, 5:25 PM
Volans moved T188627: Cumin: "No such file or directory" when log_file has no directory from In Progress to In Code Review on the Operations-Software-Development board.
Mar 1 2018, 4:41 PM · Operations-Software-Development
Volans added a comment to T188627: Cumin: "No such file or directory" when log_file has no directory.

@aggro Thanks a lot for reporting the issue and I can confirm it.
In effect there is a missing check on the log path once has been split from the filename. I'm sending a fix shortly.

Mar 1 2018, 4:31 PM · Operations-Software-Development
Volans moved T188627: Cumin: "No such file or directory" when log_file has no directory from Backlog to In Progress on the Operations-Software-Development board.
Mar 1 2018, 4:24 PM · Operations-Software-Development
Volans claimed T188627: Cumin: "No such file or directory" when log_file has no directory.
Mar 1 2018, 3:54 PM · Operations-Software-Development

Feb 27 2018

Volans created T188380: Horizon: RAM limit should be rounded to GB.
Feb 27 2018, 10:05 AM · Horizon

Feb 23 2018

Volans added a comment to T188112: cumin 3.0.1-1 is broken on labs master.

This will be fixed by https://gerrit.wikimedia.org/r/c/412894/ , that is pending the full release of Cumin 3.0.1 in prod that is waiting the full release of conftool 1.0.0 in prod, that is pending final testing and also an issue with the python3-etcd debian package.

Feb 23 2018, 9:00 PM · Patch-For-Review, Continuous-Integration-Infrastructure, Operations-Software-Development
Volans added a comment to T188016: db2037 IPMI not working.

The pasted command is without the 'mgmt' part, it seems to work for me adding it:

Feb 23 2018, 9:34 AM · monitoring, Operations, ops-codfw

Feb 21 2018

Volans moved T167504: New tool to track package updates/status for hosts and images (debmonitor) from In Progress to Backlog on the Operations-Software-Development board.
Feb 21 2018, 10:03 AM · Patch-For-Review, Continuous-Integration-Infrastructure (shipyard), Operations-Software-Development, Operations
Volans moved T187773: Cumin: upgrade it to 3.0.1 in production from In Progress to In Code Review on the Operations-Software-Development board.
Feb 21 2018, 10:03 AM · Patch-For-Review, Operations-Software-Development
Volans moved T187751: wmf-auto-reimage: migrate script to Python3 from In Progress to In Code Review on the Operations-Software-Development board.
Feb 21 2018, 10:03 AM · Operations-Software-Development

Feb 20 2018

Volans triaged T187773: Cumin: upgrade it to 3.0.1 in production as Normal priority.
Feb 20 2018, 12:52 PM · Patch-For-Review, Operations-Software-Development
Volans triaged T187751: wmf-auto-reimage: migrate script to Python3 as Normal priority.
Feb 20 2018, 12:52 PM · Operations-Software-Development
Volans added a subtask for T187773: Cumin: upgrade it to 3.0.1 in production: T187751: wmf-auto-reimage: migrate script to Python3.
Feb 20 2018, 12:52 PM · Patch-For-Review, Operations-Software-Development
Volans added a parent task for T187751: wmf-auto-reimage: migrate script to Python3: T187773: Cumin: upgrade it to 3.0.1 in production.
Feb 20 2018, 12:52 PM · Operations-Software-Development
Volans moved T187773: Cumin: upgrade it to 3.0.1 in production from Backlog to In Progress on the Operations-Software-Development board.
Feb 20 2018, 12:42 PM · Patch-For-Review, Operations-Software-Development
Volans created T187773: Cumin: upgrade it to 3.0.1 in production.
Feb 20 2018, 12:42 PM · Patch-For-Review, Operations-Software-Development

Feb 19 2018

Volans moved T187751: wmf-auto-reimage: migrate script to Python3 from Backlog to In Progress on the Operations-Software-Development board.
Feb 19 2018, 10:26 PM · Operations-Software-Development
Volans created T187751: wmf-auto-reimage: migrate script to Python3.
Feb 19 2018, 10:26 PM · Operations-Software-Development
Volans committed rCUMIN6c7ac60db9b6: Upstream release v3.0.1 (authored by Volans).
Upstream release v3.0.1
Feb 19 2018, 9:36 PM
Volans committed rCUMINfc584f75bf9e: Merge tag 'tags/v3.0.1' into debian (authored by Volans).
Merge tag 'tags/v3.0.1' into debian
Feb 19 2018, 9:36 PM
Volans committed rCUMIN67bb5d46e806: CHANGELOG: add changelogs for release v3.0.1 (authored by Volans).
CHANGELOG: add changelogs for release v3.0.1
Feb 19 2018, 9:27 PM
Volans committed rCUMIN380ce3c29cc8: CLI: fix help message (authored by Volans).
CLI: fix help message
Feb 19 2018, 9:15 PM
Volans committed rCUMIN77397bb54184: CLI: fix help message (authored by Volans).
CLI: fix help message
Feb 19 2018, 9:11 PM
Volans committed rCUMINc56e34333bb7: Upstream release v3.0.0 (authored by Volans).
Upstream release v3.0.0
Feb 19 2018, 6:13 PM
Volans committed rCUMIN0e7624a55b87: Upstream release v3.0.0 (authored by Volans).
Upstream release v3.0.0
Feb 19 2018, 6:09 PM
Volans committed rCUMIN82c22ef232e3: Upstream release v3.0.0 (authored by Volans).
Upstream release v3.0.0
Feb 19 2018, 4:19 PM
Volans committed rCUMINf91089f03bcd: Merge tag 'tags/v3.0.0' into debian (authored by Volans).
Merge tag 'tags/v3.0.0' into debian
Feb 19 2018, 3:58 PM
Volans added a project to T187722: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030): DBA.
Feb 19 2018, 3:34 PM · Patch-For-Review, DBA, Operations, ops-codfw
Volans committed rCUMIN32b2e06327d5: CHANGELOG: add changelogs for release v3.0.0 (authored by Volans).
CHANGELOG: add changelogs for release v3.0.0
Feb 19 2018, 3:32 PM
Volans committed rCUMIN8a6f98fbfe7c: CHANGELOG: add changelogs for release v3.0.0 (authored by Volans).
CHANGELOG: add changelogs for release v3.0.0
Feb 19 2018, 3:32 PM
Volans updated subscribers of T187709: Cumin feature idea: Prometheus backend.

Thanks for the proposal. It seems to me a nice to have backend, I don't see any conceptual problem with its addition to cumin's backends. For example also other non-configuration backends like Icinga in our case might also be useful sometimes, etc.

Feb 19 2018, 12:34 PM · Operations-Software-Development

Feb 15 2018

Volans moved T162857: Some Core availability Catchpoint tests might be more expensive than they need to be from Backlog to In progress on the monitoring board.
Feb 15 2018, 3:55 PM · monitoring, Patch-For-Review, Operations
Volans added a project to T162857: Some Core availability Catchpoint tests might be more expensive than they need to be: monitoring.
Feb 15 2018, 3:55 PM · monitoring, Patch-For-Review, Operations
Volans updated subscribers of T186818: Cumin: add --limit to randomly select N hosts.
Feb 15 2018, 11:33 AM · Patch-For-Review, Operations-Software-Development
Volans created T187429: Beta: Y-axis units and rounding issues.
Feb 15 2018, 10:52 AM · Analytics, Analytics-Wikistats

Feb 14 2018

Volans closed T187185: Cumin: CLI, allow to specify percentage too in --batch-size as Resolved.
Feb 14 2018, 3:49 PM · Operations-Software-Development
Volans claimed T184563: Investigate landscape of PuppetDB Frontends and Provision One.
Feb 14 2018, 3:20 PM · Patch-For-Review, Operations, Puppet