faidon (Faidon Liambotis)
Principal Operations Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 10:21 AM (154 w, 3 d)
Availability
Available
IRC Nick
paravoid
LDAP User
Faidon Liambotis
MediaWiki User
Faidon Liambotis (WMF)

Recent Activity

Today

faidon added a comment to T109903: add pdu redundancy checking to server/router/switch checks in icinga.

Check_ipmi_sensor is showing failures on 3 out of 4 of the Dell PowerEdge R620 class systems that UnitedLayer recently reported as having failed PSUs.

Fri, Sep 22, 7:37 AM · Patch-For-Review, Operations, monitoring

Yesterday

faidon added a comment to T176386: upload@ulsfo strange ethernet / power / switch issues, etc....

Confirmed from UnitedLayer email:

Assad Kermanshahi, Sep 20, 21:13 PDT
Thu, Sep 21, 9:09 AM · Patch-For-Review, Operations, Traffic
faidon added a comment to T176386: upload@ulsfo strange ethernet / power / switch issues, etc....

Recoveries of whatever the hell is happening in ulsfo:

04:26 <+icinga-wm> RECOVERY - Juniper alarms on asw-ulsfo is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
04:28 <+icinga-wm> RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 78.60 ms
04:30 <+icinga-wm> RECOVERY - Host ripe-atlas-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 78.69 ms
04:30 <+icinga-wm> RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 78.67 ms
04:30 <+icinga-wm> RECOVERY - Host cp4007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.19 ms
Thu, Sep 21, 9:01 AM · Patch-For-Review, Operations, Traffic

Wed, Sep 20

faidon added a comment to T174587: Blog post about the server switch project.

Thank you all (and especially @Whatamidoing-WMF for spearheading this) and sorry for not being very responsive!

Wed, Sep 20, 6:40 PM · Community-Liaisons (Oct-Dec 2017), Wikimedia-Blog-Content
faidon added a comment to T176314: Replace salt on integration and deployment-prep projects.

beta, CI and other WMCS VPS projects are not environments that either TechOps or WMCS operate and as such, we hadn't incorporated it into our plans of the Salt deprecation (and that's also why it's not listed in our goals). To be honest, I wasn't even aware of this use of Salt, but even if I had known about it, I'm not sure how we could had reasonably do anything about it other than just give you a heads-up, given our unfamiliarity with this environment. Due to dependency on Trebuchet, this was a quarterly goal that was planned and coordinated with Release-Engineering-Team, so I don't think this was a surprise to you regardless? I'm being a little defensive because I see that you made this a subtask of T164780, tagged this as Goal and Operations etc., so I guess you disagree and/or this may be all a surprise to you after all? If not, then feel free to ignore this whole paragraph :)

Wed, Sep 20, 2:56 PM · Continuous-Integration-Infrastructure, Beta-Cluster-Infrastructure, Technical-Debt, Operations-Software-Development

Tue, Sep 19

faidon added a comment to T176044: Replace kernel and reboot labvirt1015, 1016, 1017, 1018.

Was network connectivity lost to the server at large or to the VMs running on that labvirt instance?

It was the host itself. For the most part these systems aren't hosting any VMs.

root@labtestvirt2002:~# ifconfig up eth0
eth0: Host name lookup failure
ifconfig: `--help' gives usage information.
Tue, Sep 19, 2:49 PM · Patch-For-Review, cloud-services-team (Kanban)

Mon, Sep 18

Krinkle awarded T150256: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 a Orange Medal token.
Mon, Sep 18, 11:42 PM · Patch-For-Review, Traffic, netops, Operations
faidon added a comment to T176044: Replace kernel and reboot labvirt1015, 1016, 1017, 1018.

@chasemp mentioned this odd issue at the meeting today. If there are no (useful?) logs, are there perhaps any hosts that exhibit the non-working behavior or can be easily triggered to? Let me know (here or on IRC) if you reboot and get the broken behavior, and I can attempt to debug or gather more information from the live (broken) system.

Mon, Sep 18, 10:54 PM · Patch-For-Review, cloud-services-team (Kanban)
faidon added a comment to T176175: connect second ethernet interface for fundraising codfw hosts.

Again, I fear that this gives a false sense of redundancy

This does not concern me--I think we're all pretty clear on what this accomplishes. It's a hedge against switch or port failure and we're under no illusion that it will fix any other SPOF.

  • plus, LAGs (and especially multi-chassis, or virtual-chassis ones) are not without their own risks.

True. FWIW we're only doing active-backup which is simple to enable/disable and doesn't require network-side configuration beyond enabling ports and putting them in the right vlan. If there is a problem it is pretty straightforward to remedy from either side, we can disable switch ports and/or adjust host OS config.

Mon, Sep 18, 10:33 PM · ops-codfw, netops, fundraising-tech-ops, Operations
faidon added a comment to T176175: connect second ethernet interface for fundraising codfw hosts.

All of them? Wasn't the plan to only do it for the few hosts that are important SPOFs? Again, I fear that this gives a false sense of redundancy -- plus, LAGs (and especially multi-chassis, or virtual-chassis ones) are not without their own risks.

Mon, Sep 18, 9:19 PM · ops-codfw, netops, fundraising-tech-ops, Operations
faidon moved T175636: prometheus -> grafana stats for per-numa-node meminfo from Backlog to In progress on the monitoring board.
Mon, Sep 18, 3:28 PM · Patch-For-Review, monitoring, Operations, Traffic
faidon reassigned T171157: Monitor internal CA expirations from Dzahn to akosiaris.
Mon, Sep 18, 3:19 PM · monitoring, Operations
faidon moved T175980: Upgrade grafana to 4.5 from Backlog to Up next on the monitoring board.
Mon, Sep 18, 3:17 PM · monitoring, Operations
faidon moved T165348: Check long-running screen/tmux sessions from Up next to In progress on the monitoring board.
Mon, Sep 18, 2:34 PM · Patch-For-Review, monitoring, Operations
faidon updated subscribers of T176126: Update node-rdkafka version to 2.0.

I think @elukey and @Ottomata have some plans around the librdkafka version that needs to be deployed fleet-wide, since there's an implicit dependency to the Kafka TLS work. Is node-rdkafka's dependency on 0.9.5 specifically, or >= 0.9? Can we use 0.11.0?

Mon, Sep 18, 12:12 PM · Services (next), EventBus, Analytics, Trending-Service, Reading-Infrastructure-Team-Backlog, ChangeProp

Thu, Sep 14

faidon added a comment to T171704: Switch all hosts to the future parser.

I think as of today, with the latest compiler run (#7882) plus another hotfix (28111a9), all manifests are compatible with the future parser and we can (and should!) migrate all hosts to the future environment, plus CI and the compiler.

Thu, Sep 14, 2:58 PM · Patch-For-Review, User-Joe, Puppet, Operations
faidon added a comment to T165348: Check long-running screen/tmux sessions.

See patch above, based on the cumin results and feedback from the first few users, in the first round i suggest the following to be whitelisted:

  • package building hosts (copper)
  • mediawiki maintenance servers (terbium/wasat)
  • salt masters (neodymium)
  • puppet masters (frond and backend)
  • analytics_cluster::client (stat1004, notebook)
  • mariadb::core and all other mariadb::* roles (db*)
  • restbase-dev and restbase-test (but not restbase-prod)
  • labtest* (various wmcs::labtest and labtestn roles)
  • analytics_cluster::coordinator (analytics1003, data imports happen here)
  • analytics_cluster::druid::worker (druid* otto/joal replacing pivot)
Thu, Sep 14, 2:52 PM · Patch-For-Review, monitoring, Operations
faidon added a comment to T165348: Check long-running screen/tmux sessions.

@jcrespo, fully agreed that alerts should be actionable and I don't particularly disagree with your alert definitions. This task exists precisely because a long-running forgotten screen caused a real, user-facing outage (we discussed it at an ops meeting at the time).

Thu, Sep 14, 2:49 PM · Patch-For-Review, monitoring, Operations

Wed, Sep 13

faidon updated subscribers of T171704: Switch all hosts to the future parser.

I pushed and merged a bunch of changes under Gerrit's topic:future-parser today. I also switched a couple of other patchsets to that topic as well, for referencing them easily. For the record, @ema used topic:varnish-future-parser for the Varnish work, but all this has been merged.

Wed, Sep 13, 1:41 AM · Patch-For-Review, User-Joe, Puppet, Operations

Fri, Sep 8

faidon created T175362: Split MXes into inbound and outbound.
Fri, Sep 8, 1:05 PM · Operations, Mail
faidon moved T175361: Upgrade mx1001/mx2001 to stretch from Backlog to Up Next on the Mail board.
Fri, Sep 8, 1:00 PM · Operations, Mail
faidon created T175361: Upgrade mx1001/mx2001 to stretch.
Fri, Sep 8, 1:00 PM · Operations, Mail

Wed, Sep 6

faidon assigned T169600: Enable diamond PowerDNSRecursor collector on dnsrecursors to akosiaris.
Wed, Sep 6, 3:16 PM · Patch-For-Review, Diamond, monitoring, Traffic, Prometheus-metrics-monitoring, Operations
faidon moved T151632: Fix Icinga checks for test/decom servers from Up next to In progress on the monitoring board.
Wed, Sep 6, 3:12 PM · Patch-For-Review, monitoring, Operations
faidon moved T171823: Grafana dashboards for librenms graphite data from Up next to In progress on the monitoring board.
Wed, Sep 6, 3:11 PM · User-fgiunchedi, netops, monitoring, Operations
faidon moved T158837: Consolidate performance website and related software from In progress to Externally blocked on the monitoring board.
Wed, Sep 6, 3:09 PM · monitoring, Performance-Team, Operations
faidon moved T169860: Investigate/setup prometheus blackbox_exporter from In progress to Externally blocked on the monitoring board.
Wed, Sep 6, 3:06 PM · monitoring, User-fgiunchedi, Patch-For-Review, Prometheus-metrics-monitoring
faidon moved T170150: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible from In progress to Up next on the monitoring board.
Wed, Sep 6, 3:05 PM · monitoring, Operations
faidon moved T170353: Icinga: timeseries checks should have the link to a graph with the data from In progress to Up next on the monitoring board.
Wed, Sep 6, 3:02 PM · Operations, monitoring
faidon added a comment to T173050: Investigate icinga (einsteinium) load.

I agree -- this doesn't look very much loaded. That said, investigating whether our check intervals make sense (in any direction) is still worthwhile. @herron, is your investigation (that resulted into those three subtasks above) done?

Wed, Sep 6, 1:23 PM · monitoring
faidon added a comment to T158837: Consolidate performance website and related software.

Am I right to understand that the current plan is 2 VMs? If so, yeah, that sounds absolutely fine :)

Wed, Sep 6, 1:21 PM · monitoring, Performance-Team, Operations
faidon moved T169600: Enable diamond PowerDNSRecursor collector on dnsrecursors from Backlog to Up next on the monitoring board.
Wed, Sep 6, 1:15 PM · Patch-For-Review, Diamond, monitoring, Traffic, Prometheus-metrics-monitoring, Operations
faidon moved T109903: add pdu redundancy checking to server/router/switch checks in icinga from Backlog to Up next on the monitoring board.
Wed, Sep 6, 1:14 PM · Patch-For-Review, Operations, monitoring
faidon moved T158837: Consolidate performance website and related software from Backlog to In progress on the monitoring board.
Wed, Sep 6, 1:11 PM · monitoring, Performance-Team, Operations
faidon added a comment to T171710: pybal: add prometheus metrics.

I know a bunch of work happened during the Wikimania hackathon, but what's the status of this?

Wed, Sep 6, 1:10 PM · Patch-For-Review, monitoring, Pybal, Operations, Traffic

Mon, Sep 4

faidon added a comment to T174637: Setup esams atlas anchor.

@mark assigned asset tag WMF4203 to this device. The image has also been generated (for AS43821) and can be found on install1002.

Mon, Sep 4, 12:16 PM · Operations, netops, ops-esams

Wed, Aug 30

faidon added a comment to T41785: Create a labs SMTP smarthost.

Also see T47827, T47828, T47829 and T61142. This task is supposed to be for the smarthost which sounds like a good first step. I'd recommend keeping separate instances for inbound and outbound email for configuration simplicity too (something that we really should do for production as well).

Wed, Aug 30, 7:08 PM · Operations, Cloud-Services, Mail
faidon renamed T174577: Request for python package csvsort on stat1005.eqiad.wmnet from Request for python package csvsort on stat1005.equiad.wmnet to Request for python package csvsort on stat1005.eqiad.wmnet.
Wed, Aug 30, 4:09 PM · Analytics
faidon added a comment to T41785: Create a labs SMTP smarthost.

Indeed! Note that ToolForge already has something like that for tool authors that does LDAP calls etc. if I recall correctly, so perhaps these two efforts could complement each other or even coalesced. Let's split into separate relays first, then all kinds of possibilities exist on how to route WMCS emails :)

Wed, Aug 30, 4:08 PM · Operations, Cloud-Services, Mail
faidon edited projects for T41785: Create a labs SMTP smarthost, added: Cloud-Services, Operations; removed Cloud-VPS.
Wed, Aug 30, 2:40 PM · Operations, Cloud-Services, Mail
faidon assigned T41785: Create a labs SMTP smarthost to herron.

This is a WMCS task, but since this use case is currently supported by the production mailservers and that's a long-standing problem (and risk) for us, perhaps it's still worth it for prod ops to spend the time for setting it up. @herron, is that something that you could help with?

Wed, Aug 30, 2:40 PM · Operations, Cloud-Services, Mail
faidon added a comment to T171965: [Spike - 8 hours] How should the PDF post-processing script be exposed for use by Extension:Collection.

Honestly... I'm not exactly sure what you're proposing :) Is there a design document or something that describes the architecture of the system you're thinking of implementing?

Wed, Aug 30, 11:51 AM · Proton, Readers-Web-Kanban-Board, Electron-PDFs, Readers-Web-Backlog (Tracking), Spike
faidon added a comment to T174081: mail.wikimedia.org SSL cert expiring Mon 23 Oct 2017.

For the history side of it :), mx1002/mx2002 never existed, it was just me hoping to get around in building additional MXes (and possibly splitting roles, e.g. inbound and outbound) and since adding SANs later costs, I just added them there to be on the safe side. As for mail.wikimedia.org... these was just a made-up subject to avoid picking one out of four hostnames/SANs as subject.

Wed, Aug 30, 11:26 AM · Patch-For-Review, Mail, Operations
faidon added a comment to T168584: Labsdb* servers need to be rebooted.

JFTR, since I didn't see it mentioned neither here nor in T142807, how impending is that decomm? Days/weeks/months?

Wed, Aug 30, 11:19 AM · Scoring-platform-team (Current), DBA, cloud-services-team, Operations

Tue, Aug 29

faidon moved T87790: decom amslvs1-4 (dc work) from Backlog to Decommission on the ops-esams board.
Tue, Aug 29, 3:07 PM · DC-Ops, Operations, ops-esams
faidon moved T94215: decommission cp3001 & cp3002 from Backlog to Decommission on the ops-esams board.
Tue, Aug 29, 3:05 PM · DC-Ops, Operations, Patch-For-Review, ops-esams
faidon moved T95742: Decomission amssq31-62 (32 hosts) from Backlog to Decommission on the ops-esams board.
Tue, Aug 29, 3:05 PM · hardware-requests, DC-Ops, Operations, ops-esams
faidon moved T130883: decom cp3011-22 (12 machines) from Backlog to Decommission on the ops-esams board.
Tue, Aug 29, 3:05 PM · ops-esams, hardware-requests, Operations
faidon moved T159480: Decommission bast3001 from Backlog to Decommission on the ops-esams board.
Tue, Aug 29, 3:05 PM · ops-esams, hardware-requests, Operations
faidon moved T167376: Decommission cp300[3456] from Backlog to Decommission on the ops-esams board.
Tue, Aug 29, 3:05 PM · hardware-requests, Operations, ops-esams
faidon moved T169518: Decommission esams ms-fe / ms-be from Backlog to Decommission on the ops-esams board.
Tue, Aug 29, 3:05 PM · User-fgiunchedi, Patch-For-Review, ops-esams, Operations
faidon added a comment to T93579: Restructure so that citoid can be run without Zotero.

Is there any progress on this not captured here? I saw that on the recent 5.0 announcement someone asked about the timeline of Electron support, only to get a response that there isn't one.

Tue, Aug 29, 12:29 PM · Services (watching), VisualEditor, Parsing-Team, Technical-Debt, Citoid
faidon closed T146391: eeden ethernet outage as Resolved.
Tue, Aug 29, 11:29 AM · Operations, ops-esams, netops, DNS, Traffic

Mon, Aug 28

faidon added a project to T173427: Review check_puppetrun frequency: Operations.
Mon, Aug 28, 4:31 PM · Operations, monitoring

Fri, Aug 25

faidon added a comment to T173315: Review check_ping settings.

1 ping is going to be too error-prone though :/ A single packet may be dropped for whatever reason on either side or in transport. Especially when talking about cross-DC checks, this isn't too uncommon. We monitor levels of packetloss with smokeping, but we wouldn't want to see a large amount of random hosts alerting when such events happen.

Fri, Aug 25, 2:13 PM · Operations, monitoring
faidon added a comment to T173427: Review check_puppetrun frequency.

Indeed, 1 minute may be a bit excessive. I'm also not sure of the point of doing 3 checks spaced by 1 minute before alerting either -- that feels useless unless I'm missing something.

Fri, Aug 25, 2:09 PM · Operations, monitoring
faidon added a project to T173315: Review check_ping settings: Operations.
Fri, Aug 25, 2:04 PM · Operations, monitoring
faidon added a project to T173311: Review check_raid_hpssacli frequency : Operations.
Fri, Aug 25, 2:03 PM · Operations, monitoring
faidon updated subscribers of T173311: Review check_raid_hpssacli frequency .

It really depends on the server. For some of them (e.g. databases, and especially masters cc @jcrespo @Marostegui) it's probably best to know as soon as possible, in order to depool, fix or take some other action. Furthermore, 4h-8h could cost us a day, if say, it's at the beginning or middle of Chris/Papaul's work day.

Fri, Aug 25, 2:03 PM · Operations, monitoring
faidon moved T172131: Investigate check_nrpe -u option to reduce critical alerts from Backlog to Up next on the monitoring board.

Sounds good to me, feel free to go ahead :)

Fri, Aug 25, 1:56 PM · Operations, monitoring
faidon added a project to T172131: Investigate check_nrpe -u option to reduce critical alerts: Operations.
Fri, Aug 25, 1:56 PM · Operations, monitoring

Thu, Aug 24

faidon added a comment to T171275: ms-be2024 not powering on.

The reported (by dmidecode etc.) serial number for the system changed from MXQ62300TQ to HZ6BNV8315. I changed Racktables to reflect that. I'm not sure what's our policy supposed to be -- I know that some BIOSes allow you to override the reported serial number, but I don't recall us ever doing that. Maybe @RobH knows more?

Thu, Aug 24, 2:02 PM · User-fgiunchedi, Operations, ops-codfw

Aug 23 2017

faidon awarded T169498: Investigate load spikes on the elasticsearch cluster in eqiad a Love token.
Aug 23 2017, 1:39 AM · Patch-For-Review, Discovery-Search (Current work), Discovery
faidon added a comment to T169498: Investigate load spikes on the elasticsearch cluster in eqiad.

@EBernhardson this is all incredibly impressive, kudos!

Aug 23 2017, 1:39 AM · Patch-For-Review, Discovery-Search (Current work), Discovery
faidon awarded T173370: Support restricted execution of external commands a Like token.
Aug 23 2017, 1:30 AM · Security-Team, MediaWiki-General-or-Unknown, MediaWiki-Platform-Team

Aug 22 2017

faidon added a comment to T171167: Evaluate LibreNMS' Graphite backend.

I was looking for PDU power usage metrics. Since we don't have a Grafana dashboard yet, I tried to query Graphite manually with e.g. this query: librenms.ps*eqiad*.sensor.sensor.current.*.*.sensor. (actually, what we really need is the sum() of that, but it's less obvious to see what's happening in that one).

Aug 22 2017, 10:40 AM · Patch-For-Review, User-fgiunchedi, netops, monitoring, Operations

Aug 21 2017

faidon added a comment to T172815: Improve stability and maintainability of our browser-based PDF render service.

I don't really mind who owns the service (Services or Readers), as long as it's owned by someone :)

Aug 21 2017, 6:08 PM · Electron-PDFs, OfflineContentGenerator, Operations, Services (designing)
faidon moved T86556: monitor SSD wear levels from Backlog to In progress on the monitoring board.
Aug 21 2017, 3:20 PM · User-fgiunchedi, Operations-Software-Development, Operations, monitoring
faidon reassigned T86556: monitor SSD wear levels from Volans to fgiunchedi.
Aug 21 2017, 3:20 PM · User-fgiunchedi, Operations-Software-Development, Operations, monitoring
faidon moved T169860: Investigate/setup prometheus blackbox_exporter from Backlog to In progress on the monitoring board.
Aug 21 2017, 3:18 PM · monitoring, User-fgiunchedi, Patch-For-Review, Prometheus-metrics-monitoring
faidon moved T173311: Review check_raid_hpssacli frequency from Backlog to Up next on the monitoring board.
Aug 21 2017, 3:17 PM · Operations, monitoring
faidon moved T173315: Review check_ping settings from Backlog to Up next on the monitoring board.
Aug 21 2017, 3:17 PM · Operations, monitoring
faidon moved T173427: Review check_puppetrun frequency from Backlog to Up next on the monitoring board.
Aug 21 2017, 3:17 PM · Operations, monitoring
faidon reassigned T165348: Check long-running screen/tmux sessions from MoritzMuehlenhoff to Dzahn.
Aug 21 2017, 3:14 PM · Patch-For-Review, monitoring, Operations
faidon changed the status of T171157: Monitor internal CA expirations from Open to Stalled.

Setting to stalled until we decide what to actually do with the internal CA, as we're considering dropping it entirely in favour of other options.

Aug 21 2017, 3:13 PM · monitoring, Operations
faidon moved T136312: Encrypt syslog traffic from Up next to In progress on the monitoring board.
Aug 21 2017, 3:08 PM · Patch-For-Review, monitoring, User-fgiunchedi, Operations
faidon moved T171167: Evaluate LibreNMS' Graphite backend from In progress to Externally blocked on the monitoring board.
Aug 21 2017, 3:08 PM · Patch-For-Review, User-fgiunchedi, netops, monitoring, Operations
faidon moved T171823: Grafana dashboards for librenms graphite data from Backlog to Up next on the monitoring board.
Aug 21 2017, 3:07 PM · User-fgiunchedi, netops, monitoring, Operations
faidon lowered the priority of T173698: Backfill librenms data in graphite with historical RRDs from Normal to Low.
Aug 21 2017, 3:07 PM · User-fgiunchedi, netops, monitoring, Operations
faidon closed T97635: Update diamond to latest upstream version as Resolved.

Fixed for our purposes, we can follow-up on upstream's/Debian's bug reports for the long-term fixes.

Aug 21 2017, 3:02 PM · User-fgiunchedi, Operations, monitoring
faidon added a comment to T172681: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th.
  1. Make sure to install/deploy pmacct 1.6.2 (follow up wi Debian maintainers too? Stretch comes with librdkafka 0.11 and this combination is dangerous for 0.9 kafka clusters).
Aug 21 2017, 1:32 PM · Patch-For-Review, Operations, Analytics-Kanban, User-Elukey

Aug 1 2017

faidon added a comment to T171962: bonded/redundant network connections for fundraising hosts.

(T119654 is a restricted task, I have no access to it)

Aug 1 2017, 8:57 PM · netops, fundraising-tech-ops, Operations

Jul 27 2017

faidon added a comment to T138591: Backport iproute2 4.x from debian testing -> our jessie.

This has been open for a while :) What new things that our kernels can do do we need and on which systems? Are these a priority now or can they wait until we upgrade that particular set of systems to stertch?

Jul 27 2017, 2:24 PM · Traffic, Operations
faidon added a comment to T171850: Backport ipvsadm.

stretch has 1.28, so perhaps it's just simpler to upgrade the LVS systems to stretch, which we'll need to do anyway at some point? We're already running the stretch kernel, and they don't have much of a userspace apart of Pybal and its dependencies.

Jul 27 2017, 2:22 PM · Pybal, Traffic, Operations
faidon raised the priority of T171580: Diamond log level set to DEBUG spams syslog from Normal to High.
Jul 27 2017, 1:22 PM · Patch-For-Review, User-fgiunchedi, Operations, monitoring
faidon closed T125205: Monitor hardware thermal issues as Resolved.

So I thought about it a little bit and think we can resolve this after all. I don't know of any cases where temperatures are an issue but one that the current IPMI check doesn't catch. Writing yet another thermal check is more work for dubious gains at this point -- and it also means that we'll be checking the same values twice, from two different places, and get unnecessarily spammed on failure.

Jul 27 2017, 12:35 AM · Operations, monitoring

Jul 26 2017

faidon added a project to T171714: "MySQL server has gone away" from librenms logs: monitoring.
Jul 26 2017, 10:25 AM · monitoring, Operations, netops
faidon added a project to T170932: prometheus-puppet-agent-stats cronspam on missing puppet stats: monitoring.
Jul 26 2017, 10:23 AM · Patch-For-Review, monitoring, User-fgiunchedi, Operations

Jul 25 2017

faidon assigned T168881: Rename mw2148 / mw2149 / mw2259 / mw2260 to thumbor200[1234] to Papaul.

@Papaul, this needs to be fixed in the server labels and Racktables.

Jul 25 2017, 11:07 PM · ops-codfw, User-fgiunchedi, Operations, Performance-Team, Thumbor
faidon closed T157853: Replace nrpe 2.15 (& evaluate alternatives) as Resolved.

Yeah, I thought about it some more and I concur. 2.15's "SSL" is a joke, but in our case it doesn't matter much as pretty much everything that we send over NRPE is public anyway.

Jul 25 2017, 12:36 PM · monitoring, Operations
faidon added a comment to T171498: Implement machine-local forwarding DNS caches.

I think this is a good idea overall and that we should be doing that. A few points:

  • I'm worried a little bit that this will hide issues like the ones you mentioned under the carpet. The cases where services are latency/failure-sensitive especially are issues we should be fixing. I'm worried that with a local recursor we'll just make them manifest even less often and in even more corner-cases :/
  • For the other case of services flooding our recursors, we should probably be gathering statistics from the local recursor and monitor them in a similar fashion as we do in the "central" recursors, right?
  • The glibc resolver issues with multiple recursors/timeouts is something we can't get around from addressing I think :( The local recursor can fail (and will regularly fail when e.g. restarting it), so the system needs to operate even without it...
  • I think designing our DNS data in a way where we never need to flush caches is a bit too optimistic, but I think the proposed solution of just using cumin for this use case sounds like a perfect fit. I wonder if we could get away with just flushing the whole cache altogether rather than flushing specific records and thus potentially put systemd-resolved back on the table?
Jul 25 2017, 12:59 AM · Traffic, Operations
faidon added a comment to T171538: Degraded RAID on labsdb1001.

hmm

# cat /proc/mdstat
 Personalities :
 unused devices: <none>
Jul 25 2017, 12:40 AM · Patch-For-Review, cloud-services-team (Kanban), Data-Services, ops-eqiad, Operations

Jul 24 2017

faidon moved T171018: decom netmon1001 from In progress to Externally blocked on the monitoring board.
Jul 24 2017, 3:40 PM · ops-eqiad, monitoring, Operations, hardware-requests
faidon closed T162327: certspotter on einsteinium has issues talking to external, a subtask of T132324: Tracking and Reducing cron-spam from root@ , as Declined.
Jul 24 2017, 3:39 PM · Patch-For-Review, Operations
faidon closed T162327: certspotter on einsteinium has issues talking to external as Declined.

This is basically an artifact of the CT logs failing to respond every now and then, which certspotter complains about. It doesn't happen often.

Jul 24 2017, 3:39 PM · monitoring, Operations
faidon assigned T170353: Icinga: timeseries checks should have the link to a graph with the data to Volans.
Jul 24 2017, 3:26 PM · Operations, monitoring
faidon moved T170353: Icinga: timeseries checks should have the link to a graph with the data from Backlog to In progress on the monitoring board.
Jul 24 2017, 3:26 PM · Operations, monitoring
faidon assigned T97635: Update diamond to latest upstream version to fgiunchedi.
Jul 24 2017, 3:24 PM · User-fgiunchedi, Operations, monitoring
faidon claimed T157853: Replace nrpe 2.15 (& evaluate alternatives).
Jul 24 2017, 3:23 PM · monitoring, Operations
faidon moved T162327: certspotter on einsteinium has issues talking to external from Up next to In progress on the monitoring board.
Jul 24 2017, 3:23 PM · monitoring, Operations