faidon (Faidon Liambotis)
Principal Operations Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 10:21 AM (162 w, 3 d)
Availability
Available
IRC Nick
paravoid
LDAP User
Faidon Liambotis
MediaWiki User
Faidon Liambotis (WMF)

Recent Activity

Today

faidon added a comment to T150651: Information missing from racktables.

@faidon
db1018 and db1022 are confirmed in racktables but are both decommissioned and removed from the rack.

I updated the following asset tags with the purchase and warranty expiration dates.

WMF7043
WMF7068 - WMF7071
WMF7091, WMF7093 - WMF7097
WMF7113 - WMF7114
WMF7127 - WMF7134

I Did Not Update these , @RobH can you help with these
WMF6980 - WMF6987 PDU's
wmf7135 - wmf7139 Network gear

Fri, Nov 17, 2:25 PM · Operations, DC-Ops
faidon assigned T150651: Information missing from racktables to RobH.

We've fixed so many issues over the past few months that I can't even count them :) Thanks all. I did another sweep today and found these that need fixing:

Fri, Nov 17, 4:50 AM · Operations, DC-Ops
Restricted Application assigned T179036: Request VM for webperf (metrics processing) to R3609901.

What's the status of this?

Fri, Nov 17, 4:21 AM · Patch-For-Review, Performance-Team (Radar), vm-requests, Operations
Restricted Application assigned T97701: Can't visit login page or https wiki pages in IE7/8 on Windows XP (on SauceLabs) to R3609901.

However, https://stats.wikimedia.org/wikimedia/squids/SquidReportClients.htm doesn't break down by underlying OS version, which is why IE8 is higher there (it's including e.g. IE8-on-Vista and others).

Fri, Nov 17, 4:18 AM · MediaWiki-General-or-Unknown

Tue, Nov 14

faidon added a comment to T180384: Turn off Trending Service.

Do we really need all this for an endpoint marked as "experimental"?

Tue, Nov 14, 12:01 PM · Operations, Services (designing), Reading-Infrastructure-Team-Backlog (Kanban), Trending-Service

Wed, Nov 8

faidon closed T156256: Allocate address space for Singapore (APNIC) as Resolved.

RPKI is all done as far as I know. @mark said he'll create his account later, if at all. I think we can resolve.

Wed, Nov 8, 4:02 PM · Patch-For-Review, Operations, Traffic
faidon closed T156256: Allocate address space for Singapore (APNIC), a subtask of T156027: Configuration for Asia Cache DC hosts, as Resolved.
Wed, Nov 8, 4:02 PM · Patch-For-Review, Traffic, Operations
faidon closed T156256: Allocate address space for Singapore (APNIC), a subtask of T162684: Network hardware configuration for Asia Cache DC, as Resolved.
Wed, Nov 8, 4:02 PM · Operations, Traffic

Tue, Nov 7

faidon added a comment to T170817: Upgrade Thumbor servers to Stretch.

I wouldn't recommend reviving MgOpen for basically the reasons I described in #819026. TL;DR is that it had serious unresolved issues to begin with (hinting, missing Euro sign) and has been abandoned upstream for years. Meanwhile, there are plenty of good and free (as in OFL) fonts nowadays with Greek glyphs, including DejaVu, Liberation, the Google fonts (Droid, Roboto, CrOS), the Adobe fonts (Source Sans/Serif).

Tue, Nov 7, 9:46 AM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), Operations, Thumbor

Mon, Nov 6

faidon added a project to T179461: Use the term "developer account" for Wikimedia LDAP accounts: Operations.
Mon, Nov 6, 3:05 PM · Operations, Cloud-Services, Developer-Relations
faidon awarded T179461: Use the term "developer account" for Wikimedia LDAP accounts a Love token.
Mon, Nov 6, 3:04 PM · Operations, Cloud-Services, Developer-Relations

Thu, Oct 26

faidon added a comment to T179042: Setup eqsin RIPE Atlas anchor.

Image has been downloaded to the install* servers.

Thu, Oct 26, 5:24 PM · ops-eqiad, netops, Operations
faidon updated the task description for T179042: Setup eqsin RIPE Atlas anchor.
Thu, Oct 26, 5:23 PM · ops-eqiad, netops, Operations
faidon renamed T179042: Setup eqsin RIPE Atlas anchor from Setup eqsin atlas anchor to Setup eqsin RIPE Atlas anchor.
Thu, Oct 26, 5:22 PM · ops-eqiad, netops, Operations

Tue, Oct 24

faidon added a comment to T171508: Investigate and implement alternative for showmount based check at instance boot time.

Going one step further to the original assumptions:

there could be a temporary state in which /home isn't mounted yet, a user logs in, /home gets created, and then something whacky happens and the directory is overridden with the NFS mount

Tue, Oct 24, 1:51 PM · cloud-services-team (Kanban), Patch-For-Review, Cloud-Services
faidon added a comment to T171508: Investigate and implement alternative for showmount based check at instance boot time.

The pam_nologin behavior you're reporting sounds very odd indeed. If it's actually the case it will be CVE-worthy! It's an old, popular and well-audited piece of code though, so it'd be surprising to me if the root cause lies with pam_nologin and not somewhere in our configuration. It's not impossible of course, bugs and CVEs do happen :)

Tue, Oct 24, 1:48 PM · cloud-services-team (Kanban), Patch-For-Review, Cloud-Services

Mon, Oct 23

faidon added a comment to T177498: Provide a forward port of ICU 52 for stretch / Investigate best ICU update strategy.

I investigated the upgrade procedure for "provide icu57 in jessie and migrate before moving to stretch": This allows for a much less invasive transition (mostly because libxml2 in jessie doesn't link against ICU yet):
(snip)

That makes a lot of sense to me. Thanks for all the background work to support this :)

Mon, Oct 23, 7:51 PM · User-Elukey, HHVM, Operations
faidon added a comment to T173489: pmacct should be upgraded to 1.6.2 on Stretch.

pmacct 1.7.0-1 (with GeoIP2 support too!) was uploaded to sid yesterday. This should be as easy as a backport-and-install now.

Mon, Oct 23, 3:58 PM · Patch-For-Review, User-Elukey, Operations, netops, monitoring
faidon added a comment to T159137: certspotter: Error retrieving STH from log.

We get occasional rare failures depending on the availability of the CT log servers. I don't see a way around this unless we make our cronjobs quite a bit more sophisticated (e.g. ignore transient errors but complain when we get more than X number of errors for N hours).

Mon, Oct 23, 3:47 PM · Traffic, Operations

Fri, Oct 20

faidon added a comment to T167840: Merge AS14907 with AS43821.

Sounds fine to me. Before we resolve this task, let's not forget that we'll need to cleanup our RIPE objects by remove the old route(6) ones.

Fri, Oct 20, 1:20 PM · Performance-Team (Radar), Performance-Team-notice, Patch-For-Review, Operations, netops
faidon updated subscribers of T156256: Allocate address space for Singapore (APNIC).

OK, so APNIC fixed the "57 duplicate objects" situation, so I proceeded with the rest and specifically:

  • Updated our objects for the new office address
  • Updated to use the right mailbox per object and type (instead of abuse@ everywhere)
  • Created route objects for the /24 and /48 with origin: AS14907
  • Created domain objects for in-addr.arpa/ip6.arpa (reverse delegation)
  • Added the zones (with just SOA) to operations/dns, and verified the delegation works
Fri, Oct 20, 1:16 PM · Patch-For-Review, Operations, Traffic

Thu, Oct 19

faidon added a comment to T177228: Multiple systems in esams OE10 showing PSU failures.

IIRC, @mark said that the rack in question doesn't have a secondary PDU. New PDUs for esams are in the budget this year, so I guess this is planned?

Thu, Oct 19, 6:21 PM · Traffic, ops-esams, DC-Ops, Operations
faidon closed T109903: Add PDU redundancy server/router/switch checks in Icinga as Resolved.

For switches/routers we have alerts on Juniper's system/chassis alarms, which we know trips when they lose PDU redundancy, or any kind of other error. I don't think our disk shelves are connected to the network at all, so I don't see how we'd be able to monitor that? Resolving for now, if there is additional work to be done, feel free to reopen :)

Thu, Oct 19, 6:20 PM · Patch-For-Review, Operations, monitoring
faidon added a comment to T177227: Multiple servers in eqiad D8 showing PSU failures.

@Cmjohnson, both analytics1036 and analytics1037 are still showing PSU redundancy errors. analytics1035 is fine now, though.

Thu, Oct 19, 6:18 PM · ops-eqiad, DC-Ops, Operations
faidon added a comment to T156256: Allocate address space for Singapore (APNIC).

Yup, that's fine, as is creating the zones in the DNS and puppet repository (but not do the reverse delegation).

Thu, Oct 19, 4:41 PM · Patch-For-Review, Operations, Traffic
faidon renamed T156256: Allocate address space for Singapore (APNIC) from Select or Acquire Address Space for Asia Cache DC to Allocate address space for Singapore (APNIC).
Thu, Oct 19, 8:53 AM · Patch-For-Review, Operations, Traffic
faidon added a comment to T156256: Allocate address space for Singapore (APNIC).

We now have an APNIC account, and we were assigned today this IP space:

  • 103.102.166.0/24
  • 2001:df2:e500::/48
Thu, Oct 19, 8:52 AM · Patch-For-Review, Operations, Traffic

Oct 18 2017

faidon added a comment to T178457: nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures).

Ah! Yes, that all makes sense now, thanks!

Oct 18 2017, 1:20 PM · Patch-For-Review, Operations, Release-Engineering-Team (Kanban), Beta-Cluster-Infrastructure
faidon added a comment to T178457: nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures).

nutcracker ships /usr/lib/tmpfiles.d/nutcracker.conf which should be creating the file in (/var)/run. This has been working in production fine for months now. Not sure why it doesn't work in your case, could you troubleshoot a little more and provide more information?

Oct 18 2017, 12:17 PM · Patch-For-Review, Operations, Release-Engineering-Team (Kanban), Beta-Cluster-Infrastructure
faidon added a comment to T178346: uprightdiff fails to build with opencv 3.2.

Ideally uprightdiff would detect that at runtime and adjust as necessary. That'd a little difficult with the plain Makefile we have; have you considered switching to autoconf/automake or something fancier/newer than that (Meson, CMake etc.)?

Oct 18 2017, 11:32 AM · uprightdiff

Oct 17 2017

faidon closed Unknown Object (Task), a subtask of T156844: Prep to decommission old dbstore hosts (db1046, db1047), as Resolved.
Oct 17 2017, 5:18 PM · Patch-For-Review, User-Elukey, Analytics, Operations, DBA

Oct 16 2017

faidon added a comment to T109903: Add PDU redundancy server/router/switch checks in Icinga.

What's the status and what's left here? @herron?

Oct 16 2017, 4:08 PM · Patch-For-Review, Operations, monitoring
faidon moved T175798: Port non-deprecated Diamond collectors to Prometheus from Backlog to Up next on the monitoring board.
Oct 16 2017, 3:27 PM · monitoring, Operations
faidon moved T178008: ensure that services on labtest machines never create SMS from Icinga (not send sms pages for labtest* things to non-cloud folks) from Backlog to In progress on the monitoring board.
Oct 16 2017, 3:24 PM · Patch-For-Review, monitoring, Operations
faidon moved T177225: Uninstall ganglia from the fleet from Backlog to In progress on the monitoring board.
Oct 16 2017, 1:32 PM · Patch-For-Review, Operations, monitoring
faidon moved T178220: Fix cronspam from /usr/local/sbin/pdns_gmetric from Backlog to In progress on the monitoring board.
Oct 16 2017, 1:28 PM · monitoring

Oct 12 2017

faidon closed T176505: rack/setup/install flerovium.eqiad.wmnet as Resolved.

In production for about a week now.

Oct 12 2017, 5:26 PM · Patch-For-Review, ops-eqiad, Operations
faidon closed T176506: rack/setup/install furud.codfw.wmnet as Resolved.

This is all installed and in production for about a week now.

Oct 12 2017, 5:25 PM · Analytics, Operations
faidon closed T178087: furud /mnt/2a 97% full as Declined.

Yeah that's temporary and fine. The test in general is a bit flawed in that way, but we can ignore that for this particular host.

Oct 12 2017, 5:25 PM · Operations
faidon closed T178087: furud /mnt/2a 97% full, a subtask of T176506: rack/setup/install furud.codfw.wmnet, as Declined.
Oct 12 2017, 5:25 PM · Analytics, Operations

Oct 11 2017

faidon changed the status of Unknown Object (Task), a subtask of T162683: Network hardware purchasing for Asia Cache DC, from Stalled to Open.
Oct 11 2017, 5:14 PM · Operations, Traffic
faidon created T177931: Decommission OCG from production.
Oct 11 2017, 1:04 PM · Patch-For-Review, Services (watching), OCG-General, Operations

Oct 9 2017

faidon added a comment to T177225: Uninstall ganglia from the fleet.

I saw some of these commits fly by. These are obviously well agreed in principle but I think it's important to not have regressions here -- if we remove a service from being monitored by Ganglia, we should have the equivalent metrics in Prometheus and Graphite, and these need to show up in a suitable Grafana dashboard. Has this been taken into account?

Oct 9 2017, 6:01 PM · Patch-For-Review, Operations, monitoring

Oct 4 2017

faidon merged T177403: esams rack OE10 power redundancy issues? (cp3030-9) into T177228: Multiple systems in esams OE10 showing PSU failures.
Oct 4 2017, 3:03 PM · Traffic, ops-esams, DC-Ops, Operations
faidon merged task T177403: esams rack OE10 power redundancy issues? (cp3030-9) into T177228: Multiple systems in esams OE10 showing PSU failures.
Oct 4 2017, 3:03 PM · Traffic, ops-esams, Operations
faidon added a comment to T177371: Phase out DSA keys for SSH access (ssh-dss).

We have at least another usage, the Ganeti key (cf. modules/role/manifests/ganeti.pp). This was for legacy reasons -- Ganeti didn't support RSA, but I think it does now, at least in the version in stretch (also available in jessie-backports).

Oct 4 2017, 1:54 PM · Operations

Oct 3 2017

faidon added a comment to T173311: Review check_raid_hpssacli frequency .

Would it make sense to lower the interval for all role::mariadb::core, irrespective of mysql_role, to make this a simpler target? We can take the extra hit of more frequent checks for all of those hosts, I think.

Oct 3 2017, 4:51 PM · Patch-For-Review, Operations, monitoring
faidon added a comment to T173311: Review check_raid_hpssacli frequency .

What is the high-level/human description of the policy we want to enforce for database servers? (e.g "HP servers in the active datacenter need the check to run every 5 minutes, the rest every 10 minutes"). I ask because I'm wondering if/how any of this can be fixed with tools like role/profile classes and facts, without hardcoding specific hosts/ hiera keys.

Oct 3 2017, 3:21 PM · Patch-For-Review, Operations, monitoring

Oct 2 2017

faidon added a comment to T176370: Migrate to PHP 7 in WMF production.

Taking into account the lack funding for appserver work, as well as the end of the year fundraising and Christmas freezes, the (tentative!) timeline I proposed is:

  • Upgrade the appserver fleet (w/ HHVM) to Debian stretch, including the ICU migration, in Q3 FY17-18 (circa February/March 2018)
  • Begin PHP7 planning and initial implementation work in Q4 FY17-18, e.g. including a few test servers
  • Fund the work in FY18-19 and complete it early in the year (Q1 or Q2 at the latest)
Oct 2 2017, 6:10 PM · NewPHP, HHVM, TechCom-RfC, MediaWiki-Platform-Team, Operations
faidon assigned T136312: Encrypt syslog traffic to fgiunchedi.
Oct 2 2017, 3:47 PM · Patch-For-Review, monitoring, User-fgiunchedi, Operations
faidon renamed T109903: Add PDU redundancy server/router/switch checks in Icinga from add pdu redundancy checking to server/router/switch checks in icinga to Add PDU redundancy server/router/switch checks in Icinga.
Oct 2 2017, 3:47 PM · Patch-For-Review, Operations, monitoring
faidon closed T173315: Review check_ping settings as Declined.

No, no per-process statistics that I know of :(

Oct 2 2017, 3:44 PM · Operations, monitoring
faidon closed T173315: Review check_ping settings, a subtask of T173050: Investigate icinga (einsteinium) load, as Declined.
Oct 2 2017, 3:44 PM · monitoring
faidon renamed T177227: Multiple servers in eqiad D8 showing PSU failures from Multiple servers in equad D8 showing PSU failures to Multiple servers in eqiad D8 showing PSU failures.
Oct 2 2017, 3:42 PM · ops-eqiad, DC-Ops, Operations
faidon assigned T82937: re-create script for manual paging to Dzahn.
Oct 2 2017, 3:39 PM · Operations, monitoring, Icinga
faidon moved T82937: re-create script for manual paging from Backlog to Up next on the monitoring board.
Oct 2 2017, 3:39 PM · Operations, monitoring, Icinga
faidon moved T170353: Icinga: timeseries checks should have the link to a graph with the data from Backlog to Up next on the monitoring board.
Oct 2 2017, 3:38 PM · Patch-For-Review, Operations, monitoring
faidon moved T175636: prometheus -> grafana stats for per-numa-node meminfo from In progress to Externally blocked on the monitoring board.
Oct 2 2017, 3:29 PM · Patch-For-Review, monitoring, Traffic, Operations
faidon moved T175922: Use Prometheus for Kafka JMX metrics instead of jmxtrans from Backlog to In progress on the monitoring board.
Oct 2 2017, 3:07 PM · monitoring, User-Elukey, Patch-For-Review, Analytics-Kanban, Analytics-Cluster
faidon added a project to T175922: Use Prometheus for Kafka JMX metrics instead of jmxtrans: monitoring.
Oct 2 2017, 3:06 PM · monitoring, User-Elukey, Patch-For-Review, Analytics-Kanban, Analytics-Cluster
faidon added a comment to T177214: install2002 free disk space warning.

install1002 and install2002 should be identical, so why one is alerting and the other one isn't? I think there's an rsync from one to the other to keep them in sync, perhaps we aren't passing --delete to it?

Oct 2 2017, 2:14 PM · Patch-For-Review, Operations
faidon added a comment to T173427: Review check_puppetrun frequency.

I think the proposal is to bump check interval from 1 minute to 5 minutes, right? Any other actionables here?

Oct 2 2017, 2:09 PM · Patch-For-Review, Operations, monitoring
faidon moved T109903: Add PDU redundancy server/router/switch checks in Icinga from Up next to In progress on the monitoring board.
Oct 2 2017, 2:08 PM · Patch-For-Review, Operations, monitoring
faidon added a comment to T173315: Review check_ping settings.

I'm not sure if a different implementation (like fping) is going to make a difference. check_ping is "slow" because it sends multiple packets over 1-second intervals -- that will always consume real time (but not CPU time) irrespective of implementation.

Oct 2 2017, 2:07 PM · Operations, monitoring
faidon added a comment to T173311: Review check_raid_hpssacli frequency .

Given that we're not under load/pressure and increasing the frequency has the potential of hiding issues during troubleshooting, I'd be inclined to leave it unchanged, at 10 minutes. Thoughts/disagreements?

Oct 2 2017, 2:03 PM · Patch-For-Review, Operations, monitoring
faidon moved T175980: Upgrade grafana to 4.5.2 from Up next to In progress on the monitoring board.
Oct 2 2017, 2:01 PM · Graphite, User-fgiunchedi, monitoring, Operations
faidon added a comment to T177131: adjust flerovium power draw.

Note that the storage shelves are only there temporarily, for 2-3 weeks. I'll leave the decision on whether to balance power in the meantime to you guys though, you know best :)

Oct 2 2017, 1:10 PM · ops-eqiad, Operations
faidon merged T177130: labsdb1001's switch port negociating at 100M into T137555: labsdb1001: Investigate eth0 wrong negotiated interface speed.
Oct 2 2017, 1:08 PM · Operations, ops-eqiad
faidon merged task T177130: labsdb1001's switch port negociating at 100M into T137555: labsdb1001: Investigate eth0 wrong negotiated interface speed.
Oct 2 2017, 1:08 PM · Operations, ops-eqiad, netops, Cloud-Services

Sep 26 2017

faidon added a comment to T176337: esams: networking audit for support contract renewal.

@mark confirmed that S/N: TA3717090152 and S/N: TA3717090331 are the new QFX5100 that were delivered at esams a few weeks ago (WMF4201/asw-oe16-esams and WMF4202/asw-oe15-esams). I've updated Racktables to reflect that, although we're still unsure which one is which, so I put the S/Ns at random.

Sep 26 2017, 5:37 PM · ops-esams, Operations, netops, procurement, DC-Ops
faidon added a comment to T175672: Make apache/maintenance hosts TLS connections to mariadb work.

That isn't needed. We import the puppet CA to the host's certificate store in base and it should thus be available as /etc/ssl/certs/Puppet_Internal_CA.pem. Instead of using that though, the preferred, future-proof way to support it would be just using the (c_rehashed) /etc/ssl/certs as the CA path (in OpenSSL applications), or as /etc/ssl/certs/ca-certificates.crt (in GnuTLS/NSS applications).

Sep 26 2017, 12:20 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
faidon added a comment to T175672: Make apache/maintenance hosts TLS connections to mariadb work.

I may be missing something, but why do we need client certificates? Just setting the CA path to /etc/ssl/certs and the rest of the arguments to NULL should suffice, I think?

Sep 26 2017, 12:05 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations

Sep 25 2017

faidon renamed T173684: Update WMF's office address in mediawiki-config from Wikimedia Foundation is moving by the end of September 2017 to Update WMF's office address in mediawiki-config.
Sep 25 2017, 7:29 PM · WMF-Legal, Patch-For-Review, User-Urbanecm, Wikimedia-Site-requests
faidon moved T170353: Icinga: timeseries checks should have the link to a graph with the data from Up next to Backlog on the monitoring board.
Sep 25 2017, 4:43 PM · Patch-For-Review, Operations, monitoring
faidon added a comment to T156256: Allocate address space for Singapore (APNIC).

Status update: back in April, APNIC had requested documentation supporting that we have or about to have a presence in the Asia-Pacific region. We didn't have any besides our internal ones to support that, so the request has been stalled ever since.

Sep 25 2017, 11:40 AM · Patch-For-Review, Operations, Traffic

Sep 22 2017

faidon added a comment to T109903: Add PDU redundancy server/router/switch checks in Icinga.

Check_ipmi_sensor is showing failures on 3 out of 4 of the Dell PowerEdge R620 class systems that UnitedLayer recently reported as having failed PSUs.

Sep 22 2017, 7:37 AM · Patch-For-Review, Operations, monitoring

Sep 21 2017

faidon added a comment to T176386: upload@ulsfo strange ethernet / power / switch issues, etc....

Confirmed from UnitedLayer email:

Assad Kermanshahi, Sep 20, 21:13 PDT
Sep 21 2017, 9:09 AM · Patch-For-Review, Operations, Traffic
faidon added a comment to T176386: upload@ulsfo strange ethernet / power / switch issues, etc....

Recoveries of whatever the hell is happening in ulsfo:

04:26 <+icinga-wm> RECOVERY - Juniper alarms on asw-ulsfo is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
04:28 <+icinga-wm> RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 78.60 ms
04:30 <+icinga-wm> RECOVERY - Host ripe-atlas-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 78.69 ms
04:30 <+icinga-wm> RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 78.67 ms
04:30 <+icinga-wm> RECOVERY - Host cp4007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.19 ms
Sep 21 2017, 9:01 AM · Patch-For-Review, Operations, Traffic

Sep 20 2017

faidon added a comment to T174587: Blog post about the server switch project.

Thank you all (and especially @Whatamidoing-WMF for spearheading this) and sorry for not being very responsive!

Sep 20 2017, 6:40 PM · Community-Liaisons (Oct-Dec 2017), Wikimedia-Blog-Content
faidon added a comment to T176314: Replace salt on integration and deployment-prep projects.

beta, CI and other WMCS VPS projects are not environments that either TechOps or WMCS operate and as such, we hadn't incorporated it into our plans of the Salt deprecation (and that's also why it's not listed in our goals). To be honest, I wasn't even aware of this use of Salt, but even if I had known about it, I'm not sure how we could had reasonably do anything about it other than just give you a heads-up, given our unfamiliarity with this environment. Due to dependency on Trebuchet, this was a quarterly goal that was planned and coordinated with Release-Engineering-Team, so I don't think this was a surprise to you regardless? I'm being a little defensive because I see that you made this a subtask of T164780, tagged this as Goal and Operations etc., so I guess you disagree and/or this may be all a surprise to you after all? If not, then feel free to ignore this whole paragraph :)

Sep 20 2017, 2:56 PM · RelEng-Archive-FY201718-Q1, Patch-For-Review, Continuous-Integration-Infrastructure, Beta-Cluster-Infrastructure, Technical-Debt, Operations-Software-Development

Sep 19 2017

faidon added a comment to T176044: Replace kernel and reboot labvirt1015, 1016, 1017, 1018.

Was network connectivity lost to the server at large or to the VMs running on that labvirt instance?

It was the host itself. For the most part these systems aren't hosting any VMs.

root@labtestvirt2002:~# ifconfig up eth0
eth0: Host name lookup failure
ifconfig: `--help' gives usage information.
Sep 19 2017, 2:49 PM · Patch-For-Review, cloud-services-team (Kanban)

Sep 18 2017

Krinkle awarded T150256: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 a Orange Medal token.
Sep 18 2017, 11:42 PM · Patch-For-Review, Traffic, netops, Operations
faidon added a comment to T176044: Replace kernel and reboot labvirt1015, 1016, 1017, 1018.

@chasemp mentioned this odd issue at the meeting today. If there are no (useful?) logs, are there perhaps any hosts that exhibit the non-working behavior or can be easily triggered to? Let me know (here or on IRC) if you reboot and get the broken behavior, and I can attempt to debug or gather more information from the live (broken) system.

Sep 18 2017, 10:54 PM · Patch-For-Review, cloud-services-team (Kanban)
faidon added a comment to T176175: connect second ethernet interface for fundraising codfw hosts.

Again, I fear that this gives a false sense of redundancy

This does not concern me--I think we're all pretty clear on what this accomplishes. It's a hedge against switch or port failure and we're under no illusion that it will fix any other SPOF.

  • plus, LAGs (and especially multi-chassis, or virtual-chassis ones) are not without their own risks.

True. FWIW we're only doing active-backup which is simple to enable/disable and doesn't require network-side configuration beyond enabling ports and putting them in the right vlan. If there is a problem it is pretty straightforward to remedy from either side, we can disable switch ports and/or adjust host OS config.

Sep 18 2017, 10:33 PM · ops-codfw, netops, fundraising-tech-ops, Operations
faidon added a comment to T176175: connect second ethernet interface for fundraising codfw hosts.

All of them? Wasn't the plan to only do it for the few hosts that are important SPOFs? Again, I fear that this gives a false sense of redundancy -- plus, LAGs (and especially multi-chassis, or virtual-chassis ones) are not without their own risks.

Sep 18 2017, 9:19 PM · ops-codfw, netops, fundraising-tech-ops, Operations
faidon moved T175636: prometheus -> grafana stats for per-numa-node meminfo from Backlog to In progress on the monitoring board.
Sep 18 2017, 3:28 PM · Patch-For-Review, monitoring, Traffic, Operations
faidon reassigned T171157: Monitor internal CA expirations from Dzahn to akosiaris.
Sep 18 2017, 3:19 PM · monitoring, Operations
faidon moved T175980: Upgrade grafana to 4.5.2 from Backlog to Up next on the monitoring board.
Sep 18 2017, 3:17 PM · Graphite, User-fgiunchedi, monitoring, Operations
faidon moved T165348: Check long-running screen/tmux sessions from Up next to In progress on the monitoring board.
Sep 18 2017, 2:34 PM · Patch-For-Review, monitoring, Operations
faidon updated subscribers of T176126: Update node-rdkafka version to v2.x.

I think @elukey and @Ottomata have some plans around the librdkafka version that needs to be deployed fleet-wide, since there's an implicit dependency to the Kafka TLS work. Is node-rdkafka's dependency on 0.9.5 specifically, or >= 0.9? Can we use 0.11.0?

Sep 18 2017, 12:12 PM · Services (blocked), EventBus, Analytics, Trending-Service, ChangeProp, Reading-Infrastructure-Team-Backlog

Sep 14 2017

faidon added a comment to T171704: Switch all hosts to the future parser.

I think as of today, with the latest compiler run (#7882) plus another hotfix (28111a9), all manifests are compatible with the future parser and we can (and should!) migrate all hosts to the future environment, plus CI and the compiler.

Sep 14 2017, 2:58 PM · Patch-For-Review, User-Joe, Puppet, Operations
faidon added a comment to T165348: Check long-running screen/tmux sessions.

See patch above, based on the cumin results and feedback from the first few users, in the first round i suggest the following to be whitelisted:

  • package building hosts (copper)
  • mediawiki maintenance servers (terbium/wasat)
  • salt masters (neodymium)
  • puppet masters (frond and backend)
  • analytics_cluster::client (stat1004, notebook)
  • mariadb::core and all other mariadb::* roles (db*)
  • restbase-dev and restbase-test (but not restbase-prod)
  • labtest* (various wmcs::labtest and labtestn roles)
  • analytics_cluster::coordinator (analytics1003, data imports happen here)
  • analytics_cluster::druid::worker (druid* otto/joal replacing pivot)
Sep 14 2017, 2:52 PM · Patch-For-Review, monitoring, Operations
faidon added a comment to T165348: Check long-running screen/tmux sessions.

@jcrespo, fully agreed that alerts should be actionable and I don't particularly disagree with your alert definitions. This task exists precisely because a long-running forgotten screen caused a real, user-facing outage (we discussed it at an ops meeting at the time).

Sep 14 2017, 2:49 PM · Patch-For-Review, monitoring, Operations

Sep 13 2017

faidon updated subscribers of T171704: Switch all hosts to the future parser.

I pushed and merged a bunch of changes under Gerrit's topic:future-parser today. I also switched a couple of other patchsets to that topic as well, for referencing them easily. For the record, @ema used topic:varnish-future-parser for the Varnish work, but all this has been merged.

Sep 13 2017, 1:41 AM · Patch-For-Review, User-Joe, Puppet, Operations

Sep 8 2017

faidon created T175362: Split MXes into inbound and outbound.
Sep 8 2017, 1:05 PM · Operations, Mail
faidon moved T175361: Upgrade mx1001/mx2001 to stretch from Backlog to Up Next on the Mail board.
Sep 8 2017, 1:00 PM · Operations, Mail
faidon created T175361: Upgrade mx1001/mx2001 to stretch.
Sep 8 2017, 1:00 PM · Operations, Mail

Sep 6 2017

faidon assigned T169600: Enable diamond PowerDNSRecursor collector on dnsrecursors to akosiaris.
Sep 6 2017, 3:16 PM · Patch-For-Review, Diamond, monitoring, Traffic, Prometheus-metrics-monitoring, Operations
faidon moved T151632: Fix Icinga checks for test/decom servers from Up next to In progress on the monitoring board.
Sep 6 2017, 3:12 PM · Patch-For-Review, monitoring, Operations
faidon moved T171823: Grafana dashboards for librenms graphite data from Up next to In progress on the monitoring board.
Sep 6 2017, 3:11 PM · User-fgiunchedi, netops, monitoring, Operations