Page MenuHomePhabricator

Volans (Riccardo Coccioli)
SRE

Projects (8)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Feb 10 2016, 11:25 AM (184 w, 4 d)
Availability
Available
IRC Nick
volans
LDAP User
Volans
MediaWiki User
RCoccioli (WMF) [ Global Accounts ]

Recent Activity

Fri, Aug 23

Volans updated the task description for T231068: Spicerack: improve support for Ganeti VMs.
Fri, Aug 23, 10:46 AM · SRE-tools
Volans triaged T231068: Spicerack: improve support for Ganeti VMs as Normal priority.
Fri, Aug 23, 10:39 AM · SRE-tools
Volans created T231068: Spicerack: improve support for Ganeti VMs.
Fri, Aug 23, 10:39 AM · SRE-tools
Volans moved T231066: Host decommission improvements from Backlog to In Progress on the SRE-tools board.
Fri, Aug 23, 10:31 AM · Patch-For-Review, DC-Ops, SRE-tools
Volans triaged T231066: Host decommission improvements as Normal priority.
Fri, Aug 23, 9:18 AM · Patch-For-Review, DC-Ops, SRE-tools

Thu, Aug 22

Volans added a comment to T229657: Switchover m5 primary master: db1073 to db1133.

@CDanis @Volans can you confirm this command will set wikitech (db1073 is its master) on read-only?:

# set read-only
dbctl --scope eqiad section wikitech ro "Maintenance on wikitech T229657 " && dbctl config commit -m "Set wikitech as read-only for maintenance T229657"

Thanks!

Thu, Aug 22, 1:04 PM · Patch-For-Review, cloud-services-team (Kanban), wikitech.wikimedia.org, Operations, DBA

Mon, Aug 19

Volans updated subscribers of T230712: sre.ganeti.makevm cook book only allows specifying RAM size in full gigabytes.

That's because we pass memory={memory}g to the gnt-instance add command. We could instead accept a float in the cookbook, convert it to MB and use m in the command.
I'm fine either way. CCing @elukey that used it a lot and @crusnov

Mon, Aug 19, 11:23 AM · SRE-tools

Wed, Aug 14

Volans added a comment to T217072: Spicerack module for Netbox.

Test results of the module on cumin2001:

  • fetch_host() doesn't raise if no host is found and returns None, should probably raise
  • dry_run is not respected if using fetch_host() as the user can modify the object and call host.save(). I'm not sure what's the easy solution there.
Wed, Aug 14, 4:44 PM · netbox, Patch-For-Review, User-crusnov, SRE-tools
Volans reopened T209182: Setup Swift Storage for Netbox image (was: netbox won't allow me to upload photos of the rack) as "Open".

Re-opening as there is still some work to do given that as it is right now is less redundant that it was before.
Namely I think we need:

  • ensure images are replicated across both main datacenters
  • ensure Content-Type is properly set so that image attachments are shown instead of downloaded
  • decide a strategy to backup the attachments
  • check if Netbox supports/is supposed to show thumbs of attached images
Wed, Aug 14, 1:34 PM · netbox, Operations

Mon, Aug 12

Volans added a comment to T229677: #dbctl: add 'comment'/'description' metadata to instances.

I'm ok with the proposed UI but I'm wondering if we could reduce the duplication for the DBAs to set both a commit message and a note message when very often they will be the same.
The problem is that the note is set on the instance and the commit message only later when actually committing.
Also it would be hard to avoid to override pre-existing notes and to clean up when a host is back in normal shape.
So probably for now is just better to keep it simple at the cost of some duplication in some cases.

Mon, Aug 12, 11:00 AM · DBA, conftool
Volans closed T230002: puppetdb queue size went up since July 30, a subtask of T228657: Upgrade Puppet Masters and Puppet DB servers, as Resolved.
Mon, Aug 12, 10:28 AM · Patch-For-Review, Puppet
Volans closed T230002: puppetdb queue size went up since July 30 as Resolved.

As the queue on grafana has gone back to zero too I'll resolve it for now. Thanks a lot for the fix @jbond

Mon, Aug 12, 10:28 AM · Patch-For-Review, Operations

Wed, Aug 7

Volans added a comment to T229998: decom cookbook: dry-run mode not working / PuppetDB and Debmonitor removals can fail.

The solution that was agreed at the SRE summit for this is to add a dd to override the bootloader(s) so that the host cannot boot anymore and perform a shutdown of the host (most likely via IPMI, for VMs probably directly via ganeti).
At that point we can revoke certs in puppet, remove from debmonitor, etc... without the risk of any race or re-appearance.

Wed, Aug 7, 9:22 AM · Operations
Volans added a comment to T229998: decom cookbook: dry-run mode not working / PuppetDB and Debmonitor removals can fail.

@MoritzMuehlenhoff the dry-run mode is passed to all modules and all modules must implement unless they only do RO actions.
The difference in logging is because dry-run sets automatically debug logging to stdout, if you look at the file logs (normal and detailed) they show you all the steps.
The "skip" part was not added to the logging given the DRY-RUN in front, but ofc could be done if this is confusing.
PuppetDB has a queue and sure thing it has an issue to be investigated, see https://grafana.wikimedia.org/d/000000477/puppetdb?panelId=19&fullscreen&orgId=1&from=1564236470272&to=1565167416489

Wed, Aug 7, 8:44 AM · Operations

Tue, Aug 6

Volans committed rOSCTc4dac46a79d9: Bump debian release (authored by Volans).
Bump debian release
Tue, Aug 6, 1:50 PM
Volans committed rOSCTa5488ba86516: debian: re-add the tests directory in the package (authored by Volans).
debian: re-add the tests directory in the package
Tue, Aug 6, 1:37 PM

Mon, Aug 5

Volans closed T205886: Cookbooks: convert remaining wmf-* scripts, a subtask of T205867: Expand Spicerack library and SRE Cookbooks - Q2 2018-19 Goal, as Resolved.
Mon, Aug 5, 1:38 PM · SRE-tools, Operations, Goal
Volans closed T205886: Cookbooks: convert remaining wmf-* scripts as Resolved.
Mon, Aug 5, 1:38 PM · Patch-For-Review, SRE-tools
Volans closed T229706: helium.mgmt down, a subtask of T224794: Degraded RAID on helium, as Resolved.
Mon, Aug 5, 8:52 AM · ops-eqiad, Operations
Volans closed T229706: helium.mgmt down as Resolved.

@Dzahn have you tried to follow https://wikitech.wikimedia.org/wiki/Management_Interfaces#Troubleshooting_Commands, in particular Management_Interfaces#Does_IPMI_works_but_SSH_to_the_management_console_doesn't? ?

Mon, Aug 5, 8:52 AM · ops-eqiad, Operations
Volans triaged T229782: SRE firefighting improvements - 2019-20 Q1 Goal as Normal priority.
Mon, Aug 5, 8:09 AM · Goal, Operations
Volans created T229782: SRE firefighting improvements - 2019-20 Q1 Goal.
Mon, Aug 5, 8:09 AM · Goal, Operations
Volans updated Volans.
Mon, Aug 5, 8:01 AM
Volans added a member for conftool: Volans.
Mon, Aug 5, 8:01 AM

Thu, Aug 1

Volans added a comment to T97972: Figure out a security model for etcd.

@Joe any feedback on the above proposal? I'd really like to split the users ASAP given that dbctl is being deployed.

Thu, Aug 1, 8:31 AM · conftool, Patch-For-Review, Operations, services-tooling, discovery-system, Traffic

Wed, Jul 31

Volans added a comment to T229449: db2058: Broken storage.

host downtimed on icinga until Friday ~15UTC. chatted with @Marostegui and the host is due decommission, so no hurry, he'll take a look tomorrow.

Wed, Jul 31, 5:26 PM · DBA, Operations
Volans added a comment to T229449: db2058: Broken storage.

Unable to run hpssacli utility due to I/O error, I've depooled the host on dbctl and from db-codfw.php with the above patch (shortly). I'll look into logs after that.

Wed, Jul 31, 5:17 PM · DBA, Operations
Volans triaged T229449: db2058: Broken storage as Normal priority.
Wed, Jul 31, 5:12 PM · DBA, Operations
Volans added a comment to T229397: Puppet: get row/rack info from Netbox.

If we go on the one file per host approach then I'd say we can read the file before writing so that we write/overwrite only if it's not there or has the wrong info. This should limit the re-write operations that in turn should reduce the race conditions of puppet non finding the file at the exact moment it's reading them to a negligible amount.
FYI row/rack don't change for most hosts during their lifetime, but in some cases we move hosts around, so it's a use case to take into account.

Wed, Jul 31, 2:01 PM · Patch-For-Review, Puppet, Operations
Volans triaged T229397: Puppet: get row/rack info from Netbox as Normal priority.

It seems to me that the simplest option would be #1, it would also be the one that optimizes API calls to Netbox (just one per puppetmaster every X minutes) and has a natural caching mechanism.
In addition we could add an alert if the file is stale (too old).

Wed, Jul 31, 11:40 AM · Patch-For-Review, Puppet, Operations
Volans created T229397: Puppet: get row/rack info from Netbox.
Wed, Jul 31, 11:19 AM · Patch-For-Review, Puppet, Operations
Volans added a comment to T226331: Upgrade Netbox to 2.6.1.

What was the issue?

Wed, Jul 31, 8:20 AM · Patch-For-Review, netbox

Tue, Jul 30

Volans created P8826 wdqs updated on wdqs1009.
Tue, Jul 30, 1:11 PM

Jul 25 2019

Volans added a comment to T228122: DB reload for WDQS.

errata corrige, I run the above with 'A:wdqs-internal and not P{wdqs1003.eqiad.wmnet}' instead to avoid to restart again 1003 that was already manually tested

Jul 25 2019, 5:36 PM · Discovery-Wikidata-Query-Service-Sprint, Wikidata-Query-Service, Wikidata

Jul 24 2019

Volans committed rOSCTcb7f08880235: Release 1.1.3 (authored by Volans).
Release 1.1.3
Jul 24 2019, 9:33 AM
Volans committed rOSCT10aaa2f00f11: setup.py: re-include tests in the distribution (authored by Volans).
setup.py: re-include tests in the distribution
Jul 24 2019, 9:24 AM

Jul 23 2019

Volans committed rOSCT22843026109e: Fix extras_require key for use in console_scripts (authored by Volans).
Fix extras_require key for use in console_scripts
Jul 23 2019, 6:27 PM
Volans updated subscribers of T228606: Degraded RAID on elastic1046.
Jul 23 2019, 6:43 AM · ops-eqiad, Operations

Jul 22 2019

Volans lowered the priority of T97972: Figure out a security model for etcd from High to Normal.
Jul 22 2019, 10:39 AM · conftool, Patch-For-Review, Operations, services-tooling, discovery-system, Traffic
Volans reopened T97972: Figure out a security model for etcd, a subtask of T97978: Create a tool to sync static configuration from a repository to the consistent k/v store, as Open.
Jul 22 2019, 10:39 AM · Operations, services-tooling, discovery-system, Traffic
Volans reopened T97972: Figure out a security model for etcd as "Open".
In T97972#5352851, @Joe wrote:

IIRC we already have an account specialized for accessing only mwconfig, we could expand on the concept.

Jul 22 2019, 10:39 AM · conftool, Patch-For-Review, Operations, services-tooling, discovery-system, Traffic

Jul 19 2019

Volans committed rCUMINefb86bc64c39: dependency: replace colorama with custom module (authored by Volans).
dependency: replace colorama with custom module
Jul 19 2019, 10:38 PM
Volans committed rCUMIN75498de6e408: tests: temporarily limit max version of prospector (authored by Volans).
tests: temporarily limit max version of prospector
Jul 19 2019, 10:38 PM
Volans committed rCUMIN79d83010acf7: setup.py: limit max version of tqdm (authored by Volans).
setup.py: limit max version of tqdm
Jul 19 2019, 10:38 PM
Volans committed rCUMIN988d3f3d9891: docstrings: fix newly reported pep257 violations (authored by Volans).
docstrings: fix newly reported pep257 violations
Jul 19 2019, 10:38 PM
Volans closed T217038: Cumin: replace colorama as Resolved.
Jul 19 2019, 10:19 PM · Patch-For-Review, SRE-tools
Volans closed T207037: Cumin: allow to query for Puppet primitive types as Resolved.

Resolving as this was merged already. The release process is out of scope for this task.

Jul 19 2019, 9:24 PM · Patch-For-Review, SRE-tools
Volans moved T217038: Cumin: replace colorama from Up next to In Progress on the SRE-tools board.
Jul 19 2019, 9:22 PM · Patch-For-Review, SRE-tools

Jul 18 2019

Volans renamed SRE-tools from Operations-Software-Development to SRE-tools.
Jul 18 2019, 9:50 AM
Volans added a comment to T228388: Configuration management for network operations.

[stretch] Evaluate Netbox to store network secrets

Jul 18 2019, 9:46 AM · Patch-For-Review, Operations, Goal, netops, SRE-tools
Volans triaged T228388: Configuration management for network operations as Normal priority.
Jul 18 2019, 9:34 AM · Patch-For-Review, Operations, Goal, netops, SRE-tools
Volans triaged T228387: Bare metal cloud: management interfaces as Normal priority.
Jul 18 2019, 9:31 AM · User-crusnov, Goal, SRE-tools
Volans triaged T220395: TEC6: Database Automation as Normal priority.
Jul 18 2019, 8:39 AM · Goal, Operations
Volans closed T228288: debmonitor send status update before the package actually got upgraded as Resolved.

Ack, thanks.

Jul 18 2019, 8:01 AM · SRE-tools

Jul 17 2019

Volans triaged T228288: debmonitor send status update before the package actually got upgraded as Normal priority.

@hashar debmonitor uses Dpkg::Pre-Install-Pkgs for this feature because it's the only available hook from APT/DPKG that gives us the details of the operation that is made.
Any other option seemed very suboptimal. Things like save some temporary data with all the concurrency and stale risks involved + run something in Post-Invoke or send the full list each time in Post-Invoke or having a 2-way commit on the server side.

Jul 17 2019, 6:40 PM · SRE-tools

Jul 16 2019

Volans added a comment to T224794: Degraded RAID on helium.

@wiki_willy sorry but cannot help as I've no special knowledge of this host or backup1001.

Jul 16 2019, 10:42 PM · ops-eqiad, Operations
Volans added a comment to T224260: restbase-dev1006 has a broken disk.

@Cmjohnson I've nothing to do with this host, I just commented because unable to remove a package as part of the conftool upgrade.

Jul 16 2019, 10:39 PM · Core Platform Team (Needs Cleaning - Cassandra Operational), Cassandra, RESTBase, Services (watching), Operations
Volans added a comment to T220246: Management of Cassandra schema and keyspace/table configuration.

I had a chat on IRC with @Eevans, here some additional proposal that came up:

Jul 16 2019, 4:42 PM · Patch-For-Review, User-WDoran, Core Platform Team Workboards (Clinic Duty Team), serviceops-radar

Jul 15 2019

Volans added a comment to T220246: Management of Cassandra schema and keyspace/table configuration.

First of all I've a some questions due to my lack of knowledge of Cassandra specifics:

  • Is there any configuration that could require a code change in the application? Can those be changed at will without any coordination with the application?
  • Is there an easy way to connect to the cluster apart from localhost?
  • If this configuration is not managed by Puppet how it will be managed? Would this system have a way to automatically apply those or we'll prefer to still do it manually for safety reasons?
Jul 15 2019, 2:24 PM · Patch-For-Review, User-WDoran, Core Platform Team Workboards (Clinic Duty Team), serviceops-radar

Jul 10 2019

jijiki awarded T227636: wikimediafoundation.org: ltr/rtl not properly handled in yellow box a Pterodactyl token.
Jul 10 2019, 9:27 AM · I18n, RTL, wikimediafoundation.org
Volans created T227636: wikimediafoundation.org: ltr/rtl not properly handled in yellow box.
Jul 10 2019, 7:20 AM · I18n, RTL, wikimediafoundation.org

Jul 9 2019

Volans added a comment to T225140: Icinga alerts that should open tasks instead of alerting.

@ayounsi I'm not sure the last two added in the last update should not alarm. What is the criteria used? According to the proposed document for incident response only incidents of level 5 should open tasks instead of alerting IMHO.

Jul 9 2019, 6:32 AM · observability

Jul 5 2019

Volans renamed T184634: Error in postgres puppettization for new installation (was Netbox: postgres cannot be restarted w/ current config) from Netbox: postgres cannot be restarted w/ current config to Error in postgres puppettization for new installation (was Netbox: postgres cannot be restarted w/ current config).
Jul 5 2019, 3:04 PM · netbox, Operations
Volans added a comment to T184634: Error in postgres puppettization for new installation (was Netbox: postgres cannot be restarted w/ current config).

Most things were indeed fixed, I'm not sure on the status of the last 2 in the description checkboxes list. But they shouldn't affect anymore reboots/restarts but at most new installations. I'll rename it

Jul 5 2019, 3:03 PM · netbox, Operations
Volans added a comment to T197126: Create tool to handle the state of database configuration in MediaWiki in etcd.

All patches for v1 of dbconfig are merged, including the ones to make a new conftool release. I'll rebuild the packages on Monday and upload them to our APT and I'll start testing the upgrade of python3-conftool (the base package, pretty much untouched).
The testing of python3-conftool-dbctl will follow after.

Jul 5 2019, 6:46 AM · Performance-Team (Radar), User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
Volans closed T197531: Data model for dbconfig, a subtask of T197126: Create tool to handle the state of database configuration in MediaWiki in etcd, as Resolved.
Jul 5 2019, 6:36 AM · Performance-Team (Radar), User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
Volans closed T197531: Data model for dbconfig as Resolved.

The data model is now part of the software and will evolve with it, wikitech documentation will be provided for it. I'm resolving this.

Jul 5 2019, 6:36 AM · Patch-For-Review, MediaWiki-Configuration, Operations, DBA

Jul 4 2019

Volans lowered the priority of T227298: elastic2054 unresponsive from High to Normal.

@Gehel I'll leave the task open if you want to investigate more tomorrow for potential hardware parts to replace. (see above for hardware logs).

Jul 4 2019, 9:43 PM · Discovery, Operations
Volans added a comment to T227298: elastic2054 unresponsive.

Both clusters back to green:

 elastic2054  0 ~$ curl -s localhost:9600/_cluster/health?pretty
{
  "cluster_name" : "production-search-psi-codfw",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 15,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 1450,
  "active_shards" : 4349,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
 elastic2054  0 ~$ curl -s localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "production-search-codfw",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 30,
  "number_of_data_nodes" : 30,
  "active_primary_shards" : 1254,
  "active_shards" : 3741,
  "relocating_shards" : 2,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
Jul 4 2019, 9:42 PM · Discovery, Operations
Volans added a comment to T227298: elastic2054 unresponsive.

Nothing in syslog.
It first detected a CPU error and then a memory one, here the hardware logs:

-------------------------------------------------------------------------------
Record:      4
Date/Time:   07/04/2019 21:12:40
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   07/04/2019 21:12:40
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   07/04/2019 21:12:41
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   07/04/2019 21:12:41
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   07/04/2019 21:12:41
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   07/04/2019 21:15:21
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   07/04/2019 21:15:21
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
-------------------------------------------------------------------------------
Jul 4 2019, 9:41 PM · Discovery, Operations
Volans triaged T227298: elastic2054 unresponsive as High priority.

The host is part of the main and psi clusters:

$ confctl --quiet select name="elastic2054.codfw.wmnet" get
{"elastic2054.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=elasticsearch,service=elasticsearch"}
{"elastic2054.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=elasticsearch,service=elasticsearch-psi-ssl"}
{"elastic2054.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=elasticsearch,service=elasticsearch-ssl"}
Jul 4 2019, 9:31 PM · Discovery, Operations
Volans created T227298: elastic2054 unresponsive.
Jul 4 2019, 9:26 PM · Discovery, Operations
Volans closed T226965: conftool: upgrade fleet to use existing python3-conftool as Resolved.
Jul 4 2019, 2:01 PM · serviceops, Operations
Volans closed T226965: conftool: upgrade fleet to use existing python3-conftool, a subtask of T220395: TEC6: Database Automation, as Resolved.
Jul 4 2019, 2:01 PM · Goal, Operations
Volans added a comment to T224260: restbase-dev1006 has a broken disk.

I assume that this host will be reimaged, but in case it's not, please manually run:

apt-get remove python-conftool

once fixed.

Jul 4 2019, 2:01 PM · Core Platform Team (Needs Cleaning - Cassandra Operational), Cassandra, RESTBase, Services (watching), Operations
Volans added a comment to T226965: conftool: upgrade fleet to use existing python3-conftool.

What we need to do is:

  • Upgrade python3-etcd to the latest version
  • Upgrade python3-conftool to the latest version
  • Remove python-conftool if present
Jul 4 2019, 1:59 PM · serviceops, Operations
Volans added a comment to T226965: conftool: upgrade fleet to use existing python3-conftool.

The blocker above has been fixed in scap, released and rolled out to the fleet. I'll proceed with the removal of python-conftool.

Jul 4 2019, 1:28 PM · serviceops, Operations
Volans closed T227225: release a scap that contains I85a2161 (Remove functionality to talk to conftool) as Resolved.
Jul 4 2019, 1:26 PM · Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO (201907), Scap
Volans added a comment to T227225: release a scap that contains I85a2161 (Remove functionality to talk to conftool).

Scap has been rolled out to the fleet, I've tested a scap pull on a mwdebug host and there was a scap sync-file deployment for mediawiki-config (DB stuff).

Jul 4 2019, 1:26 PM · Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO (201907), Scap
Volans triaged T227272: keyholder: continue to arm keys if one fails as Normal priority.
Jul 4 2019, 12:02 PM · Operations
Volans created T227272: keyholder: continue to arm keys if one fails.
Jul 4 2019, 12:00 PM · Operations

Jul 2 2019

Volans added a comment to T203963: Convert makevm to spicerack cookbook.

@akosiaris the "plan" was partially explained as part of the bare metal/host provisioning breakout session at the SRE Summit. You can find more details in the notes of the summit but basically the TL;DR is that as part of the effort to automate host provisioning we're aiming to have a system in which we don't need to hardcode MAC addresses anymore.
The details of the plan are evolving with the plan itself but the gist is that it will involve DHCP option 82 (or IPv6 autoconf alternatively) and iPXE (or equivalent) to dynamically map a physical host to data available in Netbox and from there drive the whole installation process with the required parameters.
Ping me offline if you want more details.

Jul 2 2019, 10:25 AM · serviceops-radar, Patch-For-Review, User-crusnov, SRE-tools, User-jijiki, User-Joe, Operations
Volans added a comment to T203963: Convert makevm to spicerack cookbook.

Thanks a lot @elukey. I've just added a small detail and formatted hosts and paths.
For the status of the task, the port of makevm script is done, the remaining additional parts are still pending.

Jul 2 2019, 9:44 AM · serviceops-radar, Patch-For-Review, User-crusnov, SRE-tools, User-jijiki, User-Joe, Operations

Jul 1 2019

Volans added a comment to T203963: Convert makevm to spicerack cookbook.

@elukey: yes because of the temporary suppression of cumin's default output, to allow each cookbook to decide what to do with it, this specific one is printing its output at the end. This will soon-ish be more flexible on cumin side and should allow spicerack to expose it in a way that each cookbook can choose what to do with it. Sorry for the non-optimal experience for now.

Jul 1 2019, 8:05 PM · serviceops-radar, Patch-For-Review, User-crusnov, SRE-tools, User-jijiki, User-Joe, Operations
Volans added a comment to T203963: Convert makevm to spicerack cookbook.

Not yet as the script has clearly not been tested:

$ sudo cookbook sre.ganeti.makevm -h
Exception raised while parsing arguments for cookbook sre.ganeti.makevm:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/cookbook.py", line 460, in _parse_args
    args = self.module.argument_parser().parse_args(self.args)
  File "/srv/deployment/spicerack/cookbooks/sre/ganeti/makevm.py", line 37, in argument_parser
    clusters_and_rows = [cluster + '_' + row for cluster, row in CLUSTERS_AND_ROWS.items()]
  File "/srv/deployment/spicerack/cookbooks/sre/ganeti/makevm.py", line 37, in <listcomp>
    clusters_and_rows = [cluster + '_' + row for cluster, row in CLUSTERS_AND_ROWS.items()]
TypeError: Can't convert 'tuple' object to str implicitly
Jul 1 2019, 1:15 PM · serviceops-radar, Patch-For-Review, User-crusnov, SRE-tools, User-jijiki, User-Joe, Operations
Volans lowered the priority of T226908: ops-monitoring-bot creating dupes from High to Normal.
Jul 1 2019, 10:25 AM · SRE-tools, Icinga, observability, Operations
Volans added a project to T226908: ops-monitoring-bot creating dupes: SRE-tools.

Yes it's confirmed that the Icinga check flaps between critical and unknown due to time outs and as a result the even handler created the dupes. See more specific info in T224794#5295606, basically megacli takes very long time to gather info from the broken disk.

Jul 1 2019, 10:25 AM · SRE-tools, Icinga, observability, Operations
Volans added a comment to T224794: Degraded RAID on helium.

The automatic gathering times out because megacli takes ~3 minutes to return the status of the disks, it blocks at PD7 (the one broken) and takes very long time to get info from that disk.

Jul 1 2019, 10:18 AM · ops-eqiad, Operations
Volans created T226965: conftool: upgrade fleet to use existing python3-conftool.
Jul 1 2019, 9:05 AM · serviceops, Operations
Volans added a comment to T226952: Failover m2 master db1065 to db1132.

For debmonitor it connects to m2-master.eqiad.wmnet and I'm not sure if Django's connection pooling would be smart enough to reconnect given that the old one will still work, just RO. It might need a:

sudo cumin 'A:debmonitor' 'systemctl restart uwsgi-debmonitor.service'

just after the switch.
Alternatively if you plan to kill all existing connections to the old master that would do the trick already, because debmonitor will automatically reconnect and the proxy will send it to the new one.

Jul 1 2019, 8:31 AM · SRE-tools, OTRS, Recommendation-API, Operations, DBA

Jun 30 2019

Volans added a comment to T226908: ops-monitoring-bot creating dupes.

Sorry for the spam. My guess is that the check is flapping between critical and unknown. The script ignores the unknowns but it doesn't know if there is already a task opened (long story).
I can have a check tomorrow, I'm without laptop at the moment. (It might also be related to the CPU governor task).
For now I've disabled the event handler on icinga for that check on that host so it should not spam anymore. Let me know in case it generate any additional noise and I can try to have a deeper luck tonight or silent it even more.

Jun 30 2019, 1:06 PM · SRE-tools, Icinga, observability, Operations

Jun 26 2019

Volans added a comment to T226599: (OoW) Degraded RAID on analytics1039.

@elukey you can disable icinga notification on a cluster via hiera. Alternatively to disable only this kind of check you can disable event handler from Icinga UI. I'm not sure if we have any more fine-tuned way to skip this ones.

Jun 26 2019, 8:51 PM · ops-eqiad, Operations
Volans added a comment to T226599: (OoW) Degraded RAID on analytics1039.

I'll let them reply :) we have also an hpssacli version of kinda the same script fwiw.

Jun 26 2019, 11:23 AM · ops-eqiad, Operations
Volans added a comment to T226599: (OoW) Degraded RAID on analytics1039.

@jbond FYI if you want to mimic the automation, just run:

Jun 26 2019, 11:18 AM · ops-eqiad, Operations

Jun 25 2019

Volans added a comment to T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts.

Great! Thanks a lot.

Jun 25 2019, 3:51 PM · Patch-For-Review, User-fgiunchedi, Operations, observability
Volans added a comment to T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts.

@fgiunchedi the sort should be fixed otherwise will keep alerting on icinga because of the changing text in the message, an undesirable behaviour ;)

Jun 25 2019, 3:25 PM · Patch-For-Review, User-fgiunchedi, Operations, observability
Volans triaged T226470: Spicerack: improve Icinga module to support mgmt interfaces as Normal priority.
Jun 25 2019, 7:05 AM · SRE-tools

Jun 24 2019

Volans moved T200706: rack/setup/install centrallog1001.eqiad.wmnet from Up next to In Dev/Progress on the Wikimedia-Logstash board.
Jun 24 2019, 3:19 PM · observability, User-herron, Wikimedia-Logstash, User-fgiunchedi, Operations
Volans assigned T214183: Setup graphs for power usage readings in Grafana to fgiunchedi.
Jun 24 2019, 3:05 PM · DC-Ops, observability
Volans moved T218544: ms-be1043 sdk failed from In progress to Radar on the observability board.
Jun 24 2019, 3:04 PM · User-fgiunchedi, observability, SRE-tools, Operations, ops-eqiad