Volans (Riccardo Coccioli)
Operations Software Engineer

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Feb 10 2016, 11:25 AM (71 w, 1 d)
Availability
Available
IRC Nick
volans
LDAP User
Volans
MediaWiki User
RCoccioli (WMF)

Recent Activity

Yesterday

Volans added a comment to T167504: New tool to track package updates/status for hosts and images (debmonitor).

We should also investigate other available tools in the container space, for example one recently released is https://github.com/puppetlabs/lumogon or from CoreOS https://github.com/coreos/clair (thanks @Joe for this one). Disclaimer: I've not yet done an extensive search for other available tools ;)

Wed, Jun 21, 9:43 AM · Operations-Software-Development, Operations

Mon, Jun 19

Volans added a comment to T156933: Improve purging for analytics-slave data on Eventlogging.

What @jcrespo said, see also my comment on https://gerrit.wikimedia.org/r/#/c/356383/12/modules/role/files/mariadb/eventlogging_cleaner.py@206 regarding the addition of an ORDER BY.

Mon, Jun 19, 11:07 AM · Patch-For-Review, User-Elukey, Analytics-Kanban

Sat, Jun 17

Volans added a comment to T168142: Cleanup phabricator.wikimedia.org uploaded files, WP zero abuse.

Sorry for the late reply, partially because I was too busy cleaning stuff around to reply here (thanks Reedy for the help) and partially to not give too much of a realtime feedback to the abusers.
Thanks everyone here that helped notifying us and limiting the impact whenever possible.

Sat, Jun 17, 9:38 PM · Wikimedia-Site-requests, Phabricator
Volans added a comment to T159922: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003.

There is an ETA for a permanent fix? It seems to me that we've already delayed this too much given the frequency at which it's happening lately.

Sat, Jun 17, 4:55 PM · Patch-For-Review, Operations, Services, Electron-PDFs

Fri, Jun 16

Volans moved T164838: Cumin: allow to specify a timeout per command from In Progress to In Code Review on the Operations-Software-Development board.
Fri, Jun 16, 4:42 PM · Patch-For-Review, Operations-Software-Development
Volans added a hashtag to Operations-Software-Development: #cumin.
Fri, Jun 16, 10:40 AM

Wed, Jun 14

Volans added a comment to T156120: Update gerrit to 2.14.1.

@Paladox FYI I'm still getting Invalid SSH Key when trying to add my key

Wed, Jun 14, 1:30 PM · Release-Engineering-Team (Backlog), Patch-For-Review, Gerrit

Tue, Jun 13

Volans moved T166371: Monitoring: create an alert for daemonized puppet from Backlog to Done on the Operations-Software-Development board.
Tue, Jun 13, 8:27 AM · Patch-For-Review, Operations-Software-Development, monitoring, Operations

Mon, Jun 12

Volans added a comment to T167504: New tool to track package updates/status for hosts and images (debmonitor).

@akosiaris yes we were aware of it and I spoke with @Joe last week about the requirements for the Docker part, sorry to not have mentioned/referenced it here too. The idea is to have a single tool at this point that can work for both physical hosts and Docker images, so it should overlap fully with the requirements of T167269.

Mon, Jun 12, 3:00 PM · Operations-Software-Development, Operations
Volans closed T167394: Cumin: fix ok_codes when set to empty list as Resolved.
Mon, Jun 12, 1:43 PM · Operations-Software-Development
Volans closed T167392: Cumin: fix --success-percentage 0 as Resolved.
Mon, Jun 12, 1:42 PM · Operations-Software-Development

Fri, Jun 9

Volans updated subscribers of T167504: New tool to track package updates/status for hosts and images (debmonitor).
Fri, Jun 9, 1:34 PM · Operations-Software-Development, Operations
Volans updated the task description for T167504: New tool to track package updates/status for hosts and images (debmonitor).
Fri, Jun 9, 1:21 PM · Operations-Software-Development, Operations

Thu, Jun 8

Volans created T167422: Monitoring: add link to graph for Icinga timeseries alarms.
Thu, Jun 8, 2:47 PM · Operations, monitoring
Volans moved T167392: Cumin: fix --success-percentage 0 from In Progress to In Code Review on the Operations-Software-Development board.
Thu, Jun 8, 10:20 AM · Operations-Software-Development
Volans moved T167394: Cumin: fix ok_codes when set to empty list from In Progress to In Code Review on the Operations-Software-Development board.
Thu, Jun 8, 10:20 AM · Operations-Software-Development
Volans moved T167394: Cumin: fix ok_codes when set to empty list from Backlog to In Progress on the Operations-Software-Development board.
Thu, Jun 8, 10:00 AM · Operations-Software-Development
Volans created T167394: Cumin: fix ok_codes when set to empty list.
Thu, Jun 8, 10:00 AM · Operations-Software-Development
Volans triaged T167392: Cumin: fix --success-percentage 0 as High priority.
Thu, Jun 8, 9:46 AM · Operations-Software-Development
Volans moved T167392: Cumin: fix --success-percentage 0 from Backlog to In Progress on the Operations-Software-Development board.
Thu, Jun 8, 9:46 AM · Operations-Software-Development
Volans created T167392: Cumin: fix --success-percentage 0.
Thu, Jun 8, 9:46 AM · Operations-Software-Development

Wed, Jun 7

Volans closed T166203: Upgrade facter to version 2.4.6 as Resolved.

Facter is upgraded in production on the whole fleet apart cp3003.esams.wmnet,labstore[1001-1002].eqiad.wmnet that will need to be reimaged anyway. Labs also was upgraded by Faidon via Salt.

Wed, Jun 7, 3:10 PM · Patch-For-Review, Labs, Operations
Volans added a project to T167268: Degraded RAID on ms-be1016: media-storage.
Wed, Jun 7, 8:42 AM · media-storage, ops-eqiad, Operations

Tue, Jun 6

Volans edited projects for T167118: Degraded RAID on ms-be2001, added: media-storage; removed Traffic.
Tue, Jun 6, 11:48 AM · media-storage, Operations, ops-codfw
Volans added a project to T167118: Degraded RAID on ms-be2001: Traffic.
Tue, Jun 6, 11:48 AM · media-storage, Operations, ops-codfw

Mon, Jun 5

Volans moved T158747: Cumin: better error message if no config file is available from In Progress to In Code Review on the Operations-Software-Development board.
Mon, Jun 5, 6:35 PM · Patch-For-Review, Operations-Software-Development
Volans moved T158747: Cumin: better error message if no config file is available from Backlog to In Progress on the Operations-Software-Development board.
Mon, Jun 5, 6:35 PM · Patch-For-Review, Operations-Software-Development
Volans closed T166962: Degraded RAID on terbium as Resolved.

Fix merged.

Mon, Jun 5, 3:33 PM · ops-eqiad, Operations
Volans claimed T166962: Degraded RAID on terbium.

False positive, I'll add the error message to the list of ones to be skipped.

Mon, Jun 5, 3:26 PM · ops-eqiad, Operations
Volans added a comment to T166964: Degraded RAID on lvs3001.

Relating it to T166965

Mon, Jun 5, 3:23 PM · Traffic, ops-esams, Operations
Volans added a project to T166964: Degraded RAID on lvs3001: Traffic.
Mon, Jun 5, 3:16 PM · Traffic, ops-esams, Operations
Volans added a project to T166965: Degraded RAID on lvs3001: Traffic.
Mon, Jun 5, 3:15 PM · Traffic, ops-esams, Operations
Volans closed T145191: Fix retcode in wmfpuppet Salt module as Declined.

Salt is now deprecated and we're using Cumin instead. We also have new tools to properly manage puppet runs such as run-puppet-agent.

Mon, Jun 5, 2:45 PM · Operations-Software-Development
Volans added a comment to T149589: Puppet tab in Horizon unusably slow.

To add some data here, I'm getting very slow responses when opening an instance page, like https://horizon.wikimedia.org/project/instances/edbb1ea0-6e77-4159-8e6f-29886fad5dfa/, it takes around 15 seconds the first time, and then is quicker for a while, I guess until some of the results are cached. Then, to open the Puppet Configuration tab it takes another 4~5 seconds. See the timings below with the details for the instance GET:

Mon, Jun 5, 2:04 PM · Patch-For-Review, Horizon, Operations, Puppet, Labs
Volans moved T144169: Flake8 for python files without extension in puppet repo from In Progress to In Code Review on the Operations-Software-Development board.
Mon, Jun 5, 11:41 AM · Patch-For-Review, Continuous-Integration-Config, Operations, Operations-Software-Development
Volans moved T144169: Flake8 for python files without extension in puppet repo from Backlog to In Progress on the Operations-Software-Development board.
Mon, Jun 5, 11:21 AM · Patch-For-Review, Continuous-Integration-Config, Operations, Operations-Software-Development

Thu, Jun 1

Volans added a project to T166777: Degraded RAID on ms-be1020: media-storage.

@Cmjohnson @Papaul FYI: given that now the RAID alarm in Icinga can be triggered also for a faulty BBU or wrong WritePolicy, I've added on top of the get raid output the Icinga error.
If the error reports problems related to the BBU or the WritePolicy most likely the output from the disk status will report all ok and not be super helpful.
This is a temporary solution for the moment, until we'll have some time to work on the refactoring/improvement of the raid checks as a whole.

Thu, Jun 1, 11:52 AM · media-storage, ops-eqiad, Operations

Wed, May 31

Volans closed T166519: Raid handler: manage new alarms as Resolved.
Wed, May 31, 4:38 PM · Operations-Software-Development
Volans merged T166700: Degraded RAID on db1094 into T166518: Degraded BBU on db1094 (was: Degraded RAID on db1094).
Wed, May 31, 4:37 PM · Patch-For-Review, DBA, ops-eqiad, Operations
Volans merged task T166700: Degraded RAID on db1094 into T166518: Degraded BBU on db1094 (was: Degraded RAID on db1094).
Wed, May 31, 4:37 PM · ops-eqiad, Operations
Volans moved T166519: Raid handler: manage new alarms from In Progress to In Code Review on the Operations-Software-Development board.
Wed, May 31, 11:00 AM · Operations-Software-Development
Volans moved T166519: Raid handler: manage new alarms from Backlog to In Progress on the Operations-Software-Development board.
Wed, May 31, 10:45 AM · Operations-Software-Development
Volans moved T164838: Cumin: allow to specify a timeout per command from In Code Review to In Progress on the Operations-Software-Development board.
Wed, May 31, 10:45 AM · Patch-For-Review, Operations-Software-Development

Tue, May 30

Volans closed T165842: Cumin: add a simple txt/json output as Resolved.
Tue, May 30, 5:03 PM · Operations-Software-Development
Volans closed T165838: Cumin: add a simple interactive mode as Resolved.
Tue, May 30, 5:02 PM · Operations-Software-Development
Volans placed T166203: Upgrade facter to version 2.4.6 up for grabs.
Tue, May 30, 4:29 PM · Patch-For-Review, Labs, Operations
Volans added a comment to T166570: Do something to better handle wmf-reimage runs cleanups/failures.

I think most of this will go away when working on T166300 probably on Q1 as part of the salt deprecation goal. My plan is to get rid of wmf-reimage completely and have a single script that handle the whole process.

Tue, May 30, 4:27 PM · Operations-Software-Development
Volans added a parent task for T166300: Remove Salt from wmf-auto-reimage / wmf-reimage: T148814: wmf-auto-reimage improvements.
Tue, May 30, 4:24 PM · Technical-Debt, Operations-Software-Development, Operations
Volans added a subtask for T148814: wmf-auto-reimage improvements: T166300: Remove Salt from wmf-auto-reimage / wmf-reimage.
Tue, May 30, 4:24 PM · Operations-Software-Development
Volans updated subscribers of T166203: Upgrade facter to version 2.4.6.
Tue, May 30, 4:17 PM · Patch-For-Review, Labs, Operations
Volans added a comment to T166372: Puppet: test non stringified facts across the fleet .

Then a few diffs that are labs-only:

  • I now have an ec2_metadata fact that I was not getting before
  • Getting the facts from puppetdb right after the first run after setting stringify_facts = false I got 17 additional facts ec2_* that then disappeared after the second and subsequent puppet runs.
Tue, May 30, 3:08 PM · Patch-For-Review, Operations
Volans updated subscribers of T166372: Puppet: test non stringified facts across the fleet .

@akosiaris @Joe @faidon
I've changed to stringify_facts = false my labs project and this are the different facts. Bare in mind that with the v3 of the PuppetDB API the facts are still reported "stringified", in the sense that they have a value key that is a string, that now is a JSON-encoded strings.
Here below are the diffs of the value property:

Tue, May 30, 11:16 AM · Patch-For-Review, Operations

Mon, May 29

Volans created T166519: Raid handler: manage new alarms.
Mon, May 29, 5:49 PM · Operations-Software-Development
Volans added a project to T166518: Degraded BBU on db1094 (was: Degraded RAID on db1094): DBA.
Mon, May 29, 5:46 PM · Patch-For-Review, DBA, ops-eqiad, Operations
Volans updated the task description for T166518: Degraded BBU on db1094 (was: Degraded RAID on db1094).
Mon, May 29, 5:45 PM · Patch-For-Review, DBA, ops-eqiad, Operations
Volans merged T166517: Degraded RAID on ms-be1020 into T163777: Debug HP raid cache disabled errors on ms-be1019/20/21.
Mon, May 29, 5:43 PM · User-fgiunchedi, ops-eqiad, Operations
Volans merged task T166517: Degraded RAID on ms-be1020 into T163777: Debug HP raid cache disabled errors on ms-be1019/20/21.
Mon, May 29, 5:43 PM · ops-eqiad, Operations
Volans closed T163087: Degraded RAID on heze as Resolved.

All looks good, resolving for now:

Mon, May 29, 11:18 AM · Operations, ops-codfw
Volans added a project to T166422: Degraded RAID on db1046: DBA.
Mon, May 29, 11:14 AM · DBA, Analytics-Kanban, ops-eqiad, Operations
Volans added a comment to T165220: Degraded RAID on labstore1003.

Should this be resolved? There is still a disk with predictive failure, but not yet failed:

Mon, May 29, 11:13 AM · Labs, ops-eqiad, Operations
Volans closed T164833: Cumin: allow to specify successful exit codes as Resolved.
Mon, May 29, 11:10 AM · Operations-Software-Development
Volans added a comment to T164206: Icinga randomly forgets downtimes, causing alert and page spam.

Yes @akosiaris , all the times it happened was during a cron puppet run and seems to me only when there are changes in the puppet_hosts.cfg generated config file.

Mon, May 29, 9:27 AM · Patch-For-Review, Icinga, Operations, monitoring
Volans added a comment to T164206: Icinga randomly forgets downtimes, causing alert and page spam.

@akosiaris actually this happened ~2h after I've killed the daemonized puppet on tegmen... I'm not sure this explanation can still be valid, thoughts?

Mon, May 29, 9:12 AM · Patch-For-Review, Icinga, Operations, monitoring
Volans added a comment to T166372: Puppet: test non stringified facts across the fleet .

After the above was merged now all the labvirt* instances have no diff, hence all the differences are just the string vs. integer of the $::processorcount as class parameter.

Mon, May 29, 8:52 AM · Patch-For-Review, Operations

Sun, May 28

Volans updated subscribers of T161553: Remove OpenStackManager from Wikitech.

Adding @MoritzMuehlenhoff too.

Sun, May 28, 2:23 PM · MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), Patch-For-Review, wikitech.wikimedia.org, MediaWiki-extensions-OpenStackManager, Labs
Volans added a comment to T166372: Puppet: test non stringified facts across the fleet .

All but two diffs are related to $::processorcount:

Sun, May 28, 9:48 AM · Patch-For-Review, Operations

Sat, May 27

Volans added a comment to T160731: db1048 BBU Faulty - slave lagging.

And db1048 returned to WriteBack policy less than 1h ago 😛

Sat, May 27, 5:45 PM · Patch-For-Review, ops-eqiad, Operations, Phabricator, DBA
Volans added a comment to T166397: Cumin fails on huge nodelists emitted by its own outputs.

Once we will upgrade to PuppetDB API v4 I will move the PuppetDB queries in Cumin from using GET to using POST to overcome this limit and see if we find any other limit. The v3 of PuppetDB API don't accept POST unfortunately.

Sat, May 27, 4:21 PM · Operations-Software-Development
Volans added a comment to T160731: db1048 BBU Faulty - slave lagging.

So far the lag is limited to 3~4 seconds according to tendril, while from Grafana is flat zero, maybe the dashboard is not graphing the right data?
See db1048 replication lag dashboard.

Sat, May 27, 10:23 AM · Patch-For-Review, ops-eqiad, Operations, Phabricator, DBA
Volans reopened T160731: db1048 BBU Faulty - slave lagging as "Open".

Re-opening as it alarmed again today for the write policy... the battery is reported to be from 2010, was not swapped few days ago?

Sat, May 27, 10:17 AM · Patch-For-Review, ops-eqiad, Operations, Phabricator, DBA
Volans renamed T162850: CPU throttling on DELL PowerEdge R320 from acpi_pad issues to CPU throttling on DELL PowerEdge R320.
Sat, May 27, 9:53 AM · Patch-For-Review, Operations
Volans reopened T162850: CPU throttling on DELL PowerEdge R320 as "Open".

tin hit this today. I've tried to rmmod mei_me and rmmod mei as suggested above, but didn't fix the problem live, it probably needs a reboot, but I'm not rebooting it right now (see below).

Sat, May 27, 9:52 AM · Patch-For-Review, Operations

Fri, May 26

Volans added a comment to T166397: Cumin fails on huge nodelists emitted by its own outputs.

@BBlack yes that is a puppetdb error when the limit is reached.
If you have already an authoritative list of hosts in NodeSet notation (the
one printed by cumin), you can use --backend direct to use that as is
without querying puppetdb.

Fri, May 26, 5:29 PM · Operations-Software-Development
Volans added a comment to T166372: Puppet: test non stringified facts across the fleet .

It seems expected to me, it is used through $::processorcount across different modules in puppet. And the reported diff is only in the parameters of the class.

Fri, May 26, 11:34 AM · Patch-For-Review, Operations
Volans added a comment to T166372: Puppet: test non stringified facts across the fleet .

First diff found on scb1004:

Fri, May 26, 11:00 AM · Patch-For-Review, Operations
Volans added a comment to T166372: Puppet: test non stringified facts across the fleet .

The command to run this across the fleet (skipping the hosts currently down) is:

Fri, May 26, 9:28 AM · Patch-For-Review, Operations
Volans created T166372: Puppet: test non stringified facts across the fleet .
Fri, May 26, 8:54 AM · Patch-For-Review, Operations
Volans closed T166203: Upgrade facter to version 2.4.6 as Resolved.
Fri, May 26, 8:46 AM · Patch-For-Review, Labs, Operations
Volans updated the task description for T166371: Monitoring: create an alert for daemonized puppet.
Fri, May 26, 8:26 AM · Patch-For-Review, Operations-Software-Development, monitoring, Operations
Volans added a comment to T166203: Upgrade facter to version 2.4.6.

So it seems that those flapping results are due to puppet running ALSO as a daemon on those hosts (thanks @faidon ), because if at any time when running a puppet agent there is a typo in the options around the -t puppet smartly decides to ignore the wrong option and run as daemon in background.
Some examples were:

Fri, May 26, 8:24 AM · Patch-For-Review, Labs, Operations
Volans created T166371: Monitoring: create an alert for daemonized puppet.
Fri, May 26, 8:24 AM · Patch-For-Review, Operations-Software-Development, monitoring, Operations

Thu, May 25

Volans added a comment to T166344: db1016 m1 master: Possibly faulty BBU.

I've ack'ed the Icinga alarm with this task.

Thu, May 25, 9:12 PM · ops-eqiad, Operations, DBA
Volans updated subscribers of T166203: Upgrade facter to version 2.4.6.

Facter upgraded and verified was a noop across the fleet.

Thu, May 25, 6:57 PM · Patch-For-Review, Labs, Operations
Volans reopened T166300: Remove Salt from wmf-auto-reimage / wmf-reimage, a subtask of T164780: Sunset our use of Salt, as Open.
Thu, May 25, 10:52 AM · Technical-Debt, Operations-Software-Development, Operations
Volans created T166300: Remove Salt from wmf-auto-reimage / wmf-reimage.
Thu, May 25, 10:52 AM · Technical-Debt, Operations-Software-Development, Operations

Wed, May 24

Volans added a comment to T166203: Upgrade facter to version 2.4.6.

The upgrade will be performed with those steps:

  • disable puppet reliably (waiting for any in-flight run)
  • compile the catalog and output the facts to a directory
  • upgrade facter
  • compile the catalog again and output the fact to another directory
  • compare the result of the two runs
  • enable puppet
  • remove temporary files
Wed, May 24, 9:21 AM · Patch-For-Review, Labs, Operations
Volans created T166203: Upgrade facter to version 2.4.6.
Wed, May 24, 8:27 AM · Patch-For-Review, Labs, Operations

Tue, May 23

Volans updated the task description for T166177: Degraded RAID on ms-be1008.
Tue, May 23, 9:00 PM · ops-eqiad, Operations

May 23 2017

Volans placed T150560: More verbose messages from service-checker-swagger up for grabs.
May 23 2017, 5:28 PM · Patch-For-Review, Services (watching), Operations-Software-Development, Operations
Volans closed T166137: Degraded RAID on dataset1001 as Invalid.

This was a raid check false positive

May 23 2017, 3:17 PM · ops-eqiad, Operations
Volans closed T166136: Degraded RAID on tin as Invalid.

This was a raid check false positive

May 23 2017, 3:17 PM · ops-eqiad, Operations
Volans closed T165583: Puppet compiler: sync facts from all workers as Resolved.

Documentation updated on https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet3-diffs/Documentation

May 23 2017, 9:42 AM · Patch-For-Review, Operations, Operations-Software-Development

May 20 2017

Volans moved T165842: Cumin: add a simple txt/json output from In Progress to In Code Review on the Operations-Software-Development board.
May 20 2017, 10:37 AM · Operations-Software-Development
Volans moved T165838: Cumin: add a simple interactive mode from In Progress to In Code Review on the Operations-Software-Development board.
May 20 2017, 10:37 AM · Operations-Software-Development
Volans moved T165842: Cumin: add a simple txt/json output from Backlog to In Progress on the Operations-Software-Development board.
May 20 2017, 8:57 AM · Operations-Software-Development
Volans created T165842: Cumin: add a simple txt/json output.
May 20 2017, 8:56 AM · Operations-Software-Development
Volans moved T165838: Cumin: add a simple interactive mode from Backlog to In Progress on the Operations-Software-Development board.
May 20 2017, 8:44 AM · Operations-Software-Development
Volans created T165838: Cumin: add a simple interactive mode.
May 20 2017, 8:44 AM · Operations-Software-Development

May 17 2017

Volans moved T165583: Puppet compiler: sync facts from all workers from In Progress to In Code Review on the Operations-Software-Development board.
May 17 2017, 11:17 AM · Patch-For-Review, Operations, Operations-Software-Development
Volans moved T165583: Puppet compiler: sync facts from all workers from Backlog to In Progress on the Operations-Software-Development board.
May 17 2017, 10:46 AM · Patch-For-Review, Operations, Operations-Software-Development