faidon (Faidon Liambotis)
Principal Operations Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 10:21 AM (171 w, 4 d)
Availability
Available
IRC Nick
paravoid
LDAP User
Faidon Liambotis
MediaWiki User
Faidon Liambotis (WMF)

Recent Activity

Today

faidon added a comment to T185345: os_version strict distro check doesn't work.

I just pushed a change to make the rspec more extensive, including a test case for the scenario that you described here. It seems to pass fine, so I merged the change. Is there a specific server/VPS you're experiencing this on? I could give it a look.

Sat, Jan 20, 1:19 AM · Patch-For-Review, Operations, Puppet
faidon added a comment to T185345: os_version strict distro check doesn't work.

I can't reproduce. I'm testing with this for example:

$is_trusty = os_version('ubuntu trusty')
notice("is trusty is ${is_trusty}")
Sat, Jan 20, 12:56 AM · Patch-For-Review, Operations, Puppet

Yesterday

faidon added a project to T185319: IRC RecentChanges feed: code stewardship request: Operations.
Fri, Jan 19, 3:46 PM · Tools, Operations, Analytics, Wikimedia-IRC-RC-Server, Code-Stewardship-Reviews
faidon created T185319: IRC RecentChanges feed: code stewardship request.
Fri, Jan 19, 3:44 PM · Tools, Operations, Analytics, Wikimedia-IRC-RC-Server, Code-Stewardship-Reviews
faidon added a comment to T181036: Pull netflow data in realtime from Kafka via Tranquillity/Spark.

Thanks for working on this task, very much appreciated!

Fri, Jan 19, 1:12 PM · Analytics-Kanban, User-Elukey, monitoring, netops, Operations

Wed, Jan 17

faidon closed T184338: Update people.wikimedia.org with the 2017 Wikimedia hackathon group photo as Resolved.

Merged -- thanks :)

Wed, Jan 17, 1:02 PM · Patch-For-Review, Operations

Tue, Jan 16

fgiunchedi awarded T170150: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible a Love token.
Tue, Jan 16, 9:20 AM · Patch-For-Review, monitoring, Operations

Mon, Jan 15

faidon added a comment to T170150: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible.

Thanks so much for this, kudos! Any reason to not just 301 grafana-admin to grafana for a few months (and then just drop it)? Also, wmfall probably sounds excessive, I'd guess all of our users are in the ops list (which isn't just opsens).

Mon, Jan 15, 7:20 PM · Patch-For-Review, monitoring, Operations

Fri, Jan 12

faidon assigned T184230: Disavow emails from wikipedia.com to herron.
Fri, Jan 12, 4:32 PM · Operations, Mail
faidon awarded T167907: Incorporate data from the GeoIP2 ISP database to webrequest a Love token.
Fri, Jan 12, 3:59 PM · Patch-For-Review, Analytics-Kanban

Wed, Jan 10

faidon updated subscribers of T178690: Better organization for ops grafana dashboards.

@ori recently sent his thoughts about this to the ops list, and I found it a very eloquent description of the issues I was thinking of too. His full email was:

Wed, Jan 10, 4:17 PM · User-fgiunchedi, monitoring, Operations
faidon added a comment to T182215: install_server: switch to stretch as default install image.

@Dzahn, yes, that sounds like a good idea. Please do :)

Wed, Jan 10, 12:01 PM · Patch-For-Review, Operations

Mon, Jan 8

faidon added a comment to T184338: Update people.wikimedia.org with the 2017 Wikimedia hackathon group photo.

That's not the Wikimania 2017 Hackathon (which was in Montreal), but the 2017 Hackathon in Vienna. Both the task title and the commit title are incorrect. Any particular reason you chose a hackathon group photo instead of a Wikimania group photo? I don't feel strongly either way, just wondering :)

Mon, Jan 8, 11:08 AM · Patch-For-Review, Operations
faidon added a comment to T174637: Setup esams atlas anchor.

Have we acquired a new image for AS14907 yet?

Mon, Jan 8, 10:59 AM · Operations, netops, ops-esams
faidon added a comment to T156544: Create backups of Wikimedia content in diverse geographic places.

Correct, that is the intention. I can confirm that's the case that it's not a rumour, but of course take it with a grain of salt, given that the annual plan hasn't been even drafted yet, let alone finalized.

Mon, Jan 8, 10:57 AM · Internet-Archive, Offline-Working-Group, Operations
faidon added a parent task for T170144: Evaluate NetBox as a Racktables replacement & IPAM: T116063: Hardware Automation Workflow - Overall Tracking.
Mon, Jan 8, 10:52 AM · Patch-For-Review, netops, Operations
faidon added a subtask for T116063: Hardware Automation Workflow - Overall Tracking: T170144: Evaluate NetBox as a Racktables replacement & IPAM.
Mon, Jan 8, 10:52 AM · Tracking, Operations
faidon added a comment to T170144: Evaluate NetBox as a Racktables replacement & IPAM.

No, not resolved yet, but in progress :) You're absolutely right we haven't updated this task though (my fault!)

Mon, Jan 8, 10:51 AM · Patch-For-Review, netops, Operations
faidon added a watcher for Code-Stewardship-Reviews: faidon.
Mon, Jan 8, 8:37 AM

Thu, Dec 21

faidon added a comment to T181647: create 'attended' upgrade workflow for cloud with Toolforge as canonical case.
  • It feels a little odd to have specific apt sources.list here. It's not very DRY, could potentially conflict with the system's configuration and ultimately with what's in configuration management (puppet).

Well, since all is in the same repo (puppet), one simply should take care to use consistent repos when doing commits?

Thu, Dec 21, 6:16 PM · cloud-services-team (Kanban), Patch-For-Review, Toolforge

Dec 18 2017

faidon added a comment to T99531: [Task] move wikiba.se webhosting to wikimedia misc-cluster.

For wikiba.se, another option (4, I guess!) is to just host it outside of the Traffic infrastructure and with a separate setup using Let's Encrypt etc. However, broadly speaking, having the ability to route more domains than the dozen of canonical ones through our main traffic infrastructure sounds to me like a useful thing to have and would probably be better than having to maintain an entirely separate thing. Depends on how strongly you feel about it, @BBlack?

Dec 18 2017, 2:47 PM · Patch-For-Review, Traffic, wikiba.se, Operations, Wikidata-Sprint-2016-11-08, Wikidata

Dec 15 2017

faidon closed T181724: Disconnect flerovium's disk shelves as Resolved.

Confirmed this was shipped now. Documentation was sent out of band, this can be resolved :)

Dec 15 2017, 5:14 PM · Operations, ops-eqiad

Dec 14 2017

faidon closed T182456: Forward katherine@wikipedia.org and jimmy@wikipedia.org emails to katherine@wikimedia.org and jimmy@wikimedia.org, respectively as Resolved.

Done!

Dec 14 2017, 4:31 PM · Operations, Mail, fundraising-tech-ops
faidon added a comment to T182894: Trusty puppet 4 approach.

Option (1) is not really an option -- as I said in the other task, those packages are horrible, a bigger maintenance burden, and they are also very different from everything else we use. Between (2) and (3) I think option (2) is the way to go here. This is what we have done in the past, and past experience shows that the extra maintenance burden isn't much. Note that at some point we'll have to take on maintenance for even jessie/stretch systems, to go with e.g. Puppet 5.

Dec 14 2017, 4:14 PM · Puppet, Operations
faidon added a comment to T182819: custom fact interface_primary breaks under newer versions of facter.

Facter 3 is quite different than Facter 2, and we're not ready to use this -- that would be a transition of its own, I think. For what it's worth, it also ships with a structured fact, networking[primary] (if memory serves), that does effectively the same as interface_primary.

Dec 14 2017, 7:31 AM · Patch-For-Review, Puppet, Operations
faidon added a comment to T182812: Forward security@tools.wmflabs.org to security@wikimedia.org.

You can use exim4 -bt foo@example.org to test how/where exim4 would deliver a specific address (if at all).

Dec 14 2017, 7:27 AM · Toolforge, Security, Mail, Operations

Dec 13 2017

faidon added a comment to T182812: Forward security@tools.wmflabs.org to security@wikimedia.org.

tools.wmflabs.org isn't a relay that is in production, so it is not (and cannot be) a "trusted" relay. This means that e.g. Gmail will consider all aliased spam to originate from tools.wmflabs.org, which would affect its reputation and possibly mark those emails as spam. Aliasing between different administrative domains has dangers and should be avoided -- I think it'd be preferrable to just communicate that security@wikimedia.org is the canonical place to report security issues.

Dec 13 2017, 6:53 PM · Toolforge, Security, Mail, Operations
faidon changed the visibility for T81543: Enable IPSec between datacenters.
Dec 13 2017, 5:09 PM · Operations, Traffic, Interdatacenter-IPsec

Dec 11 2017

faidon added a comment to T181724: Disconnect flerovium's disk shelves.

I disconnected the disk shelves and powered down. @faidon please let me know when and if it's okay to coordinate the drop off.

Dec 11 2017, 11:44 AM · Operations, ops-eqiad

Dec 6 2017

faidon added a comment to T174587: Blog post about the server switch project.

Yeah, let's push it back to Q3 at minimum. My intention throughout has been to tie this to this year's multiDC program, that hasn't progressing much. We're deliberating on how to proceed with that program across tech -- the program is currently orphaned post-Gabriel's departure (and I -clearly- don't have the bandwidth to pick it up). Until that happens and work there picks up, I don't see much of a point to talk about a once-in-time switchover, and especially one at that that happened 6+ months ago.

Dec 6 2017, 12:22 PM · Community-Liaisons (Jan-Mar-2018), Wikimedia-Blog-Content

Dec 4 2017

faidon changed the status of T181725: Disconnect furud's disk shelves from Stalled to Open.

Per T181724, let's proceed with the original plan -- cc @Papaul :)

Dec 4 2017, 2:38 PM · Operations, ops-codfw

Dec 1 2017

faidon added a comment to T180998: Switch on http/2 in phabricator apache.

I was thinking hypothetically, if it were possible: would we actually gain anything from an http2 (or h2c even) connection internally since the service is going through Varnish? My assumption is that the gains would be minimal, at best.

Dec 1 2017, 1:42 PM · Traffic, Operations, Phabricator

Nov 30 2017

faidon closed T180998: Switch on http/2 in phabricator apache as Declined.

Do we even gain much from having Phab speak http2 for an internal connection to Varnish?

Nov 30 2017, 6:13 PM · Traffic, Operations, Phabricator
faidon created T181725: Disconnect furud's disk shelves.
Nov 30 2017, 3:19 PM · Operations, ops-codfw
faidon renamed T181724: Disconnect flerovium's disk shelves from Please disconnect flerovium's disk shelves to Disconnect flerovium's disk shelves.
Nov 30 2017, 3:19 PM · Operations, ops-eqiad
faidon created T181724: Disconnect flerovium's disk shelves.
Nov 30 2017, 3:18 PM · Operations, ops-eqiad

Nov 29 2017

faidon added subtasks for T156031: Turn up network links for Asia Cache DC: Unknown Object (Task), Unknown Object (Task).
Nov 29 2017, 12:45 AM · Traffic, Operations

Nov 27 2017

faidon added a comment to T181446: Access to logstash (LDAP group 'nda') for Paladox.

LDAP NDA access effectively means getting access to private and sensitive information, on multiple servers and services, across the board. As such, it's more than just signing a piece of paper or making a good faith promise; it's about one proving they are trustworthy to handle sensitive information (either ours, or of our users'), to be careful with what they access, to listen to instructions about what to do (and not do) with the information they access, and to think carefully before they act.

Nov 27 2017, 10:54 PM · Gerrit, Operations, Ops-Access-Requests
faidon moved T178008: ensure that services on labtest machines never create SMS from Icinga (not send sms pages for labtest* things to non-cloud folks) from In progress to Externally blocked on the monitoring board.
Nov 27 2017, 4:16 PM · Patch-For-Review, monitoring, Operations
faidon moved T178690: Better organization for ops grafana dashboards from Backlog to Up next on the monitoring board.
Nov 27 2017, 4:15 PM · User-fgiunchedi, monitoring, Operations
faidon added a comment to T181264: Refresh or replace oxygen.

I use the sampled-1000 logs from time to time (and the 5xx ones, but less frequently), especially in incident-worthy situations, where speed is of the essence.

Nov 27 2017, 2:48 PM · hardware-requests, Operations, Analytics

Nov 23 2017

faidon created T181264: Refresh or replace oxygen.
Nov 23 2017, 10:37 PM · hardware-requests, Operations, Analytics
faidon added a comment to T138396: Create ops dashboard with info like ipv6 traffic split .

Also see T167907, for a similar request (from the network side of things).

Nov 23 2017, 9:22 PM · Analytics
faidon added a comment to T167907: Incorporate data from the GeoIP2 ISP database to webrequest.

This came up again this week: I was looking into our network traffic in our various PoPs, to plan capacity and procure network links for eqsin (Singapore). There is traffic on our peering port in ulsfo, and there is no easy way to identify where it's coming from using our own tooling. Analyzing Netflows using a pmacct/Druid/Tranquility pipeline would be ideal, but we're very far from that being usable and useful, despite Analytics (very graciously!) helping us slowly get there (cf. T181036).

Nov 23 2017, 9:21 PM · Patch-For-Review, Analytics-Kanban
faidon added a comment to T181202: Enable http/2 for planet apache.

Planet is behind misc-web, right? If so, that's a fairly pointless task, unless I'm missing something. Even if HTTP/2 made a difference (doubtful) for the internal low-latency traffic, Varnish doesn't support HTTP/2 on the client-side anyway, so it won't get used. 008b62f4048 should probably be reverted, its effects manually reverted (a2dismod http2) and this task be declined, unless I'm missing something.

Nov 23 2017, 11:23 AM · Patch-For-Review, Wikimedia-Planet, Operations

Nov 22 2017

faidon closed T150651: Information missing from racktables as Resolved.

Good enough for now. Thanks everyone!

Nov 22 2017, 1:12 AM · Operations, DC-Ops
faidon closed T150651: Information missing from racktables, a subtask of T88424: Migrate racktables to servermon, as Resolved.
Nov 22 2017, 1:12 AM · Operations

Nov 21 2017

faidon merged T181069: Degraded RAID on wtp2017 into T180211: Degraded RAID on wtp2017.
Nov 21 2017, 5:09 PM · Operations, ops-codfw
faidon merged task T181069: Degraded RAID on wtp2017 into T180211: Degraded RAID on wtp2017.
Nov 21 2017, 5:09 PM · Operations, ops-codfw

Nov 17 2017

faidon added a comment to T150651: Information missing from racktables.

@faidon
db1018 and db1022 are confirmed in racktables but are both decommissioned and removed from the rack.

I updated the following asset tags with the purchase and warranty expiration dates.

WMF7043
WMF7068 - WMF7071
WMF7091, WMF7093 - WMF7097
WMF7113 - WMF7114
WMF7127 - WMF7134

I Did Not Update these , @RobH can you help with these
WMF6980 - WMF6987 PDU's
wmf7135 - wmf7139 Network gear

Nov 17 2017, 2:25 PM · Operations, DC-Ops
faidon assigned T150651: Information missing from racktables to RobH.

We've fixed so many issues over the past few months that I can't even count them :) Thanks all. I did another sweep today and found these that need fixing:

Nov 17 2017, 4:50 AM · Operations, DC-Ops
Restricted Application assigned T179036: Request VM for webperf (metrics processing) to R3609901.

What's the status of this?

Nov 17 2017, 4:21 AM · Performance-Team, vm-requests, Operations
Restricted Application assigned T97701: Can't visit login page or https wiki pages in IE7/8 on Windows XP (on SauceLabs) to R3609901.

However, https://stats.wikimedia.org/wikimedia/squids/SquidReportClients.htm doesn't break down by underlying OS version, which is why IE8 is higher there (it's including e.g. IE8-on-Vista and others).

Nov 17 2017, 4:18 AM · MediaWiki-General-or-Unknown

Nov 14 2017

faidon added a comment to T180384: Turn off Trending Service.

Do we really need all this for an endpoint marked as "experimental"?

Nov 14 2017, 12:01 PM · Services (done), User-Joe, Operations, Reading-Infrastructure-Team-Backlog (Kanban), Trending-Service

Nov 8 2017

faidon closed T156256: Allocate address space for Singapore (APNIC) as Resolved.

RPKI is all done as far as I know. @mark said he'll create his account later, if at all. I think we can resolve.

Nov 8 2017, 4:02 PM · Patch-For-Review, Operations, Traffic
faidon closed T156256: Allocate address space for Singapore (APNIC), a subtask of T156027: Configuration for Asia Cache DC hosts, as Resolved.
Nov 8 2017, 4:02 PM · Patch-For-Review, Operations, Traffic
faidon closed T156256: Allocate address space for Singapore (APNIC), a subtask of T162684: Network hardware configuration for Asia Cache DC, as Resolved.
Nov 8 2017, 4:02 PM · Operations, Traffic

Nov 7 2017

faidon added a comment to T170817: Upgrade Thumbor servers to Stretch.

I wouldn't recommend reviving MgOpen for basically the reasons I described in #819026. TL;DR is that it had serious unresolved issues to begin with (hinting, missing Euro sign) and has been abandoned upstream for years. Meanwhile, there are plenty of good and free (as in OFL) fonts nowadays with Greek glyphs, including DejaVu, Liberation, the Google fonts (Droid, Roboto, CrOS), the Adobe fonts (Source Sans/Serif).

Nov 7 2017, 9:46 AM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), Operations, Thumbor

Nov 6 2017

faidon added a project to T179461: Use the term "developer account" for Wikimedia LDAP accounts: Operations.
Nov 6 2017, 3:05 PM · Operations, Cloud-Services, Developer-Relations
faidon awarded T179461: Use the term "developer account" for Wikimedia LDAP accounts a Love token.
Nov 6 2017, 3:04 PM · Operations, Cloud-Services, Developer-Relations

Oct 26 2017

faidon added a comment to T179042: Setup eqsin RIPE Atlas anchor.

Image has been downloaded to the install* servers.

Oct 26 2017, 5:24 PM · ops-eqsin, netops, Operations
faidon updated the task description for T179042: Setup eqsin RIPE Atlas anchor.
Oct 26 2017, 5:23 PM · ops-eqsin, netops, Operations
faidon renamed T179042: Setup eqsin RIPE Atlas anchor from Setup eqsin atlas anchor to Setup eqsin RIPE Atlas anchor.
Oct 26 2017, 5:22 PM · ops-eqsin, netops, Operations

Oct 24 2017

faidon added a comment to T171508: Investigate and implement alternative for showmount based check at instance boot time.

Going one step further to the original assumptions:

there could be a temporary state in which /home isn't mounted yet, a user logs in, /home gets created, and then something whacky happens and the directory is overridden with the NFS mount

Oct 24 2017, 1:51 PM · cloud-services-team (Kanban), Patch-For-Review, Cloud-Services
faidon added a comment to T171508: Investigate and implement alternative for showmount based check at instance boot time.

The pam_nologin behavior you're reporting sounds very odd indeed. If it's actually the case it will be CVE-worthy! It's an old, popular and well-audited piece of code though, so it'd be surprising to me if the root cause lies with pam_nologin and not somewhere in our configuration. It's not impossible of course, bugs and CVEs do happen :)

Oct 24 2017, 1:48 PM · cloud-services-team (Kanban), Patch-For-Review, Cloud-Services

Oct 23 2017

faidon added a comment to T177498: Provide a forward port of ICU 52 for stretch / Investigate best ICU update strategy.

I investigated the upgrade procedure for "provide icu57 in jessie and migrate before moving to stretch": This allows for a much less invasive transition (mostly because libxml2 in jessie doesn't link against ICU yet):
(snip)

That makes a lot of sense to me. Thanks for all the background work to support this :)

Oct 23 2017, 7:51 PM · User-Elukey, HHVM, Operations
faidon added a comment to T173489: pmacct should be upgraded to 1.6.2 on Stretch.

pmacct 1.7.0-1 (with GeoIP2 support too!) was uploaded to sid yesterday. This should be as easy as a backport-and-install now.

Oct 23 2017, 3:58 PM · Patch-For-Review, User-Elukey, Operations, netops, monitoring
faidon added a comment to T159137: certspotter: Error retrieving STH from log.

We get occasional rare failures depending on the availability of the CT log servers. I don't see a way around this unless we make our cronjobs quite a bit more sophisticated (e.g. ignore transient errors but complain when we get more than X number of errors for N hours).

Oct 23 2017, 3:47 PM · Traffic, Operations

Oct 20 2017

faidon added a comment to T167840: Merge AS14907 with AS43821.

Sounds fine to me. Before we resolve this task, let's not forget that we'll need to cleanup a) our RIPE objects by remove the old route(6) ones b) our RPKI ROAs.

Oct 20 2017, 1:20 PM · Performance-Team (Radar), Performance-Team-notice, Patch-For-Review, Operations, netops
faidon updated subscribers of T156256: Allocate address space for Singapore (APNIC).

OK, so APNIC fixed the "57 duplicate objects" situation, so I proceeded with the rest and specifically:

  • Updated our objects for the new office address
  • Updated to use the right mailbox per object and type (instead of abuse@ everywhere)
  • Created route objects for the /24 and /48 with origin: AS14907
  • Created domain objects for in-addr.arpa/ip6.arpa (reverse delegation)
  • Added the zones (with just SOA) to operations/dns, and verified the delegation works
Oct 20 2017, 1:16 PM · Patch-For-Review, Operations, Traffic

Oct 19 2017

faidon added a comment to T177228: Multiple systems in esams OE10 showing PSU failures.

IIRC, @mark said that the rack in question doesn't have a secondary PDU. New PDUs for esams are in the budget this year, so I guess this is planned?

Oct 19 2017, 6:21 PM · Traffic, ops-esams, DC-Ops, Operations
faidon closed T109903: Add PDU redundancy server/router/switch checks in Icinga as Resolved.

For switches/routers we have alerts on Juniper's system/chassis alarms, which we know trips when they lose PDU redundancy, or any kind of other error. I don't think our disk shelves are connected to the network at all, so I don't see how we'd be able to monitor that? Resolving for now, if there is additional work to be done, feel free to reopen :)

Oct 19 2017, 6:20 PM · Patch-For-Review, Operations, monitoring
faidon added a comment to T177227: Multiple servers in eqiad D8 showing PSU failures.

@Cmjohnson, both analytics1036 and analytics1037 are still showing PSU redundancy errors. analytics1035 is fine now, though.

Oct 19 2017, 6:18 PM · ops-eqiad, DC-Ops, Operations
faidon added a comment to T156256: Allocate address space for Singapore (APNIC).

Yup, that's fine, as is creating the zones in the DNS and puppet repository (but not do the reverse delegation).

Oct 19 2017, 4:41 PM · Patch-For-Review, Operations, Traffic
faidon renamed T156256: Allocate address space for Singapore (APNIC) from Select or Acquire Address Space for Asia Cache DC to Allocate address space for Singapore (APNIC).
Oct 19 2017, 8:53 AM · Patch-For-Review, Operations, Traffic
faidon added a comment to T156256: Allocate address space for Singapore (APNIC).

We now have an APNIC account, and we were assigned today this IP space:

  • 103.102.166.0/24
  • 2001:df2:e500::/48
Oct 19 2017, 8:52 AM · Patch-For-Review, Operations, Traffic

Oct 18 2017

faidon added a comment to T178457: nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures).

Ah! Yes, that all makes sense now, thanks!

Oct 18 2017, 1:20 PM · Patch-For-Review, Operations, Release-Engineering-Team (Kanban), Beta-Cluster-Infrastructure
faidon added a comment to T178457: nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures).

nutcracker ships /usr/lib/tmpfiles.d/nutcracker.conf which should be creating the file in (/var)/run. This has been working in production fine for months now. Not sure why it doesn't work in your case, could you troubleshoot a little more and provide more information?

Oct 18 2017, 12:17 PM · Patch-For-Review, Operations, Release-Engineering-Team (Kanban), Beta-Cluster-Infrastructure
faidon added a comment to T178346: uprightdiff fails to build with opencv 3.2.

Ideally uprightdiff would detect that at runtime and adjust as necessary. That'd a little difficult with the plain Makefile we have; have you considered switching to autoconf/automake or something fancier/newer than that (Meson, CMake etc.)?

Oct 18 2017, 11:32 AM · uprightdiff

Oct 17 2017

faidon closed Unknown Object (Task), a subtask of T156844: Decommission old dbstore hosts (db1046, db1047), as Resolved.
Oct 17 2017, 5:18 PM · Analytics-Kanban, Patch-For-Review, User-Elukey, Operations, DBA

Oct 16 2017

faidon added a comment to T109903: Add PDU redundancy server/router/switch checks in Icinga.

What's the status and what's left here? @herron?

Oct 16 2017, 4:08 PM · Patch-For-Review, Operations, monitoring
faidon moved T175798: Port non-deprecated Diamond collectors to Prometheus from Backlog to Up next on the monitoring board.
Oct 16 2017, 3:27 PM · monitoring, Operations
faidon moved T178008: ensure that services on labtest machines never create SMS from Icinga (not send sms pages for labtest* things to non-cloud folks) from Backlog to In progress on the monitoring board.
Oct 16 2017, 3:24 PM · Patch-For-Review, monitoring, Operations
faidon moved T177225: Uninstall ganglia from the fleet from Backlog to In progress on the monitoring board.
Oct 16 2017, 1:32 PM · Patch-For-Review, Operations, monitoring
faidon moved T178220: Fix cronspam from /usr/local/sbin/pdns_gmetric from Backlog to In progress on the monitoring board.
Oct 16 2017, 1:28 PM · monitoring

Oct 12 2017

faidon closed T176505: rack/setup/install flerovium.eqiad.wmnet as Resolved.

In production for about a week now.

Oct 12 2017, 5:26 PM · Patch-For-Review, ops-eqiad, Operations
faidon closed T176506: rack/setup/install furud.codfw.wmnet as Resolved.

This is all installed and in production for about a week now.

Oct 12 2017, 5:25 PM · Analytics, Operations
faidon closed T178087: furud /mnt/2a 97% full as Declined.

Yeah that's temporary and fine. The test in general is a bit flawed in that way, but we can ignore that for this particular host.

Oct 12 2017, 5:25 PM · Operations
faidon closed T178087: furud /mnt/2a 97% full, a subtask of T176506: rack/setup/install furud.codfw.wmnet, as Declined.
Oct 12 2017, 5:25 PM · Analytics, Operations

Oct 11 2017

faidon changed the status of Unknown Object (Task), a subtask of T162683: Network hardware purchasing for Asia Cache DC, from Stalled to Open.
Oct 11 2017, 5:14 PM · Operations, Traffic
faidon changed the status of Unknown Object (Task), a subtask of T166179: singapore caching center: eqiad staging tracking task, from Stalled to Open.
Oct 11 2017, 5:14 PM · ops-eqsin, DC-Ops, Operations
faidon created T177931: Decommission OCG from production.
Oct 11 2017, 1:04 PM · Patch-For-Review, Services (watching), OCG-General, Operations

Oct 9 2017

faidon added a comment to T177225: Uninstall ganglia from the fleet.

I saw some of these commits fly by. These are obviously well agreed in principle but I think it's important to not have regressions here -- if we remove a service from being monitored by Ganglia, we should have the equivalent metrics in Prometheus and Graphite, and these need to show up in a suitable Grafana dashboard. Has this been taken into account?

Oct 9 2017, 6:01 PM · Patch-For-Review, Operations, monitoring

Oct 4 2017

faidon merged T177403: esams rack OE10 power redundancy issues? (cp3030-9) into T177228: Multiple systems in esams OE10 showing PSU failures.
Oct 4 2017, 3:03 PM · Traffic, ops-esams, DC-Ops, Operations
faidon merged task T177403: esams rack OE10 power redundancy issues? (cp3030-9) into T177228: Multiple systems in esams OE10 showing PSU failures.
Oct 4 2017, 3:03 PM · Traffic, Operations, ops-esams
faidon added a comment to T177371: Phase out DSA keys for SSH access (ssh-dss).

We have at least another usage, the Ganeti key (cf. modules/role/manifests/ganeti.pp). This was for legacy reasons -- Ganeti didn't support RSA, but I think it does now, at least in the version in stretch (also available in jessie-backports).

Oct 4 2017, 1:54 PM · Operations

Oct 3 2017

faidon added a comment to T173311: Review check_raid_hpssacli frequency .

Would it make sense to lower the interval for all role::mariadb::core, irrespective of mysql_role, to make this a simpler target? We can take the extra hit of more frequent checks for all of those hosts, I think.

Oct 3 2017, 4:51 PM · Patch-For-Review, Operations, monitoring
faidon added a comment to T173311: Review check_raid_hpssacli frequency .

What is the high-level/human description of the policy we want to enforce for database servers? (e.g "HP servers in the active datacenter need the check to run every 5 minutes, the rest every 10 minutes"). I ask because I'm wondering if/how any of this can be fixed with tools like role/profile classes and facts, without hardcoding specific hosts/ hiera keys.

Oct 3 2017, 3:21 PM · Patch-For-Review, Operations, monitoring

Oct 2 2017

faidon added a comment to T176370: Migrate to PHP 7 in WMF production.

Taking into account the lack funding for appserver work, as well as the end of the year fundraising and Christmas freezes, the (tentative!) timeline I proposed is:

  • Upgrade the appserver fleet (w/ HHVM) to Debian stretch, including the ICU migration, in Q3 FY17-18 (circa February/March 2018)
  • Begin PHP7 planning and initial implementation work in Q4 FY17-18, e.g. including a few test servers
  • Fund the work in FY18-19 and complete it early in the year (Q1 or Q2 at the latest)
Oct 2 2017, 6:10 PM · TechCom-RFC (TechCom-Approved), User-ArielGlenn, NewPHP, HHVM, MediaWiki-Platform-Team, Operations
faidon assigned T136312: Encrypt syslog traffic to fgiunchedi.
Oct 2 2017, 3:47 PM · Patch-For-Review, monitoring, User-fgiunchedi, Operations