faidon (Faidon Liambotis)
Principal Operations Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 10:21 AM (145 w, 4 d)
Availability
Available
IRC Nick
paravoid
LDAP User
Faidon Liambotis
MediaWiki User
Faidon Liambotis (WMF)

Recent Activity

Yesterday

faidon added a comment to T99531: [Task] move wikiba.se webhosting to wikimedia misc-cluster.

Hi there! I saw @Ladsgroup's email to the ops list (thanks for bringing that to our attention!), so I'll respond to some of the questions he raised there -- sorry if it sounds a bit incoherent with regards to the context above :)

Fri, Jul 21, 7:30 AM · Operations, Wikidata-Sprint-2016-11-08, Wikidata
faidon added a comment to T168871: Introduct toc with page numbers during pdf post-processing.

I don't have a strong preference for either. I think the post-processing approach makes sense overall and without looking at it very closely, it seems to me like Electron (and headless Chrome) would be better bets compared to wkhtmltopdf with regards to maintainability, compatibility, security etc.

Fri, Jul 21, 6:57 AM · Reading-Web-Backlog, Reading-Web-Kanban-Board, Reading-Infrastructure-Team-Backlog (Kanban), Patch-For-Review, Electron-PDFs

Thu, Jul 20

faidon created T171188: Move the main WMCS puppetmaster into the Labs realm.
Thu, Jul 20, 4:15 PM · Cloud-Services, Operations
faidon closed T111301: pmtpa remnants in trebuchet redis as Declined.

I think we can safely decline this ahead of time by 2½ months :)

Thu, Jul 20, 2:04 PM · Trebuchet, Operations
faidon added a comment to T97635: update diamond to latest upstream version.

If you've backported it already, yeah, we can go forward I'd say :) We can leave trusty behind too, I don't see this as a big deal at all.

Thu, Jul 20, 1:47 PM · User-fgiunchedi, Operations, monitoring
faidon moved T171167: Evaluate LibreNMS' Graphite backend from Backlog to Monitoring on the netops board.
Thu, Jul 20, 1:41 PM · Patch-For-Review, User-fgiunchedi, netops, monitoring, Operations
faidon moved T171167: Evaluate LibreNMS' Graphite backend from Backlog to Up next on the monitoring board.
Thu, Jul 20, 1:41 PM · Patch-For-Review, User-fgiunchedi, netops, monitoring, Operations
faidon created T171167: Evaluate LibreNMS' Graphite backend.
Thu, Jul 20, 1:40 PM · Patch-For-Review, User-fgiunchedi, netops, monitoring, Operations
faidon closed T67478: Graph User::pingLimiter() actions in Grafana as Declined.

gdash has been retired since ~February 2016, having been replaced with Grafana.

Thu, Jul 20, 1:37 PM · monitoring, Wikimedia-Incident, Performance-Team
faidon closed T85841: apimethods gdash is broken: Shows "No data" as Resolved.

gdash was retired a year and a half ago, so…

Thu, Jul 20, 1:36 PM · Patch-For-Review, MediaWiki-API, monitoring, Performance
faidon closed T87840: Retire Torrus as Resolved.

So @godog mentioned today that we can't actually recover the Torrus data from Bacula, as these were lost forever :(

Thu, Jul 20, 1:28 PM · monitoring, Operations, Technical-Debt
faidon raised the priority of T171018: decom netmon1001 from Normal to High.
Thu, Jul 20, 1:25 PM · ops-eqiad, monitoring, Operations, hardware-requests
faidon added a project to T97635: update diamond to latest upstream version: Operations.
Thu, Jul 20, 1:24 PM · User-fgiunchedi, Operations, monitoring
faidon added a comment to T97635: update diamond to latest upstream version.

We run 4.0 on stretch systems nowadays. Would it be worthwhile to backport it to jessie and trusty? Anything that we're missing from 3.5?

Thu, Jul 20, 1:23 PM · User-fgiunchedi, Operations, monitoring
faidon removed projects from T85326: shinken.wmflabs.org redirects on https-login to http: Patch-For-Review, monitoring.
Thu, Jul 20, 1:21 PM · Upstream, Cloud-Services, Shinken
faidon removed a project from T139423: Spike: Identify most offensive issues in the DonationInterface logs: monitoring.
Thu, Jul 20, 1:20 PM · FR-2016-17-Q2-Bugs, Spike, MediaWiki-extensions-DonationInterface, Fundraising-Backlog
faidon triaged T163033: Create grafana dashboard for video scaler job runners as Low priority.
Thu, Jul 20, 1:20 PM · Operations, Multimedia, monitoring
faidon triaged T152967: Investigate usage of service dependencies in icinga as Normal priority.
Thu, Jul 20, 1:20 PM · monitoring
faidon added a comment to T152369: toolschecker fell to pieces when labs-ns0 went down.

What's needed to be done here, from whom and with what priority? (Asking because it shows up in our monitoring workboard)

Thu, Jul 20, 1:19 PM · Wikimedia-Incident, monitoring, Cloud-Services, Cloud-VPS
faidon triaged T170353: Icinga: timeseries checks should have the link to a graph with the data as Normal priority.
Thu, Jul 20, 1:17 PM · Operations, monitoring
faidon triaged T168085: tune gearman alarms as Low priority.
Thu, Jul 20, 1:17 PM · monitoring, Continuous-Integration-Infrastructure
faidon removed a project from T138110: [Epic] Clean up DonationInterface logging: monitoring.
Thu, Jul 20, 1:16 PM · Epic, MediaWiki-extensions-DonationInterface, Fundraising-Backlog
faidon removed a project from T136169: [EPIC] Improve Fundraising monitoring, alert, and high-level error handling: monitoring.
Thu, Jul 20, 1:16 PM · Epic, Fundraising-Backlog
faidon moved T171157: Monitor internal CA expirations from Backlog to Up next on the monitoring board.
Thu, Jul 20, 10:03 AM · monitoring, Operations
faidon edited projects for T171157: Monitor internal CA expirations, added: monitoring; removed community-labs-monitoring.
Thu, Jul 20, 10:03 AM · monitoring, Operations
faidon created T171157: Monitor internal CA expirations.
Thu, Jul 20, 10:03 AM · monitoring, Operations

Wed, Jul 19

faidon added a project to T158837: Consolidate performance website and related software: monitoring.
Wed, Jul 19, 6:33 PM · monitoring, Performance-Team, Operations

Fri, Jul 14

faidon edited projects for T170546: Optimize Wikipedia PNG Logo, added: Performance-Team; removed Operations, Traffic.
Fri, Jul 14, 1:59 AM · Performance-Team, Wikimedia-Site-requests

Thu, Jul 13

faidon removed a project from T164206: Icinga loses downtime entries, causing alert and page spam: Patch-For-Review.
Thu, Jul 13, 4:02 PM · Icinga, Operations, monitoring
faidon added a comment to T162946: Write Ganglia monitors for SmashPig database things.

Given that we're phasing out Ganglia, is that task moot now?

Thu, Jul 13, 2:19 PM · Fundraising-Backlog, FR-Smashpig
faidon removed a project from T162946: Write Ganglia monitors for SmashPig database things: monitoring.
Thu, Jul 13, 2:19 PM · Fundraising-Backlog, FR-Smashpig
faidon removed a project from T170307: mw2201, mw2202 - contact Dell and replace main board: monitoring.
Thu, Jul 13, 2:18 PM · Patch-For-Review, Operations, ops-codfw
faidon assigned T150651: Information missing from racktables to RobH.

The updated list of devices missing model/number can be found below.

Thu, Jul 13, 1:14 AM · Operations, DC-Ops
faidon created P5741 Devices missing vendor/model.
Thu, Jul 13, 1:09 AM · Operations

Wed, Jul 12

faidon added a comment to T161101: Performance impact evaluation of enabling nginx-lua and nginx-lua-prometheus on tlsproxy.

@ema, it seems like the task as described has been completed (awesome work and great presentation btw!). Is there anything left to be done or shall we resolve this task?

Wed, Jul 12, 5:27 PM · Patch-For-Review, Operations, monitoring, Traffic
faidon moved T162327: certspotter on einsteinium has issues talking to external from Backlog to Up next on the monitoring board.
Wed, Jul 12, 5:25 PM · Operations, monitoring
faidon closed T150160: Remote IPMI doesn't work for ~2% of the fleet as Resolved.

All listed here and most of the T169360's are fixed now. What isn't fixed is due to hardware troubles that is tracked separately (and it's just 5 now, instead of ~2% :). Resolving!

Wed, Jul 12, 5:24 PM · monitoring, Operations
faidon closed T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface as Resolved.

So it seems like the remaining ones are:

  • labsdb100{1,3}: Ciscos, ignore (T142807)
  • mw1196: broken, to be decom'ed (T170441)
  • mw2201/2002: broken, to be replaced (T170307)
Wed, Jul 12, 5:22 PM · monitoring, Operations, ops-codfw, ops-eqiad
faidon closed T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface, a subtask of T150160: Remote IPMI doesn't work for ~2% of the fleet, as Resolved.
Wed, Jul 12, 5:22 PM · monitoring, Operations
faidon updated the task description for T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface.
Wed, Jul 12, 5:19 PM · monitoring, Operations, ops-codfw, ops-eqiad
faidon closed T167279: Create "network" icinga group as Resolved.

Done :)

Wed, Jul 12, 5:16 PM · Patch-For-Review, Operations, Icinga, monitoring
faidon added a comment to T170394: Cumin: add multi-query support.

I'd add another PRO to the second option (which I prefer for backwards compatibility): it allows us to be more concise.

While I do the occasional fancy query, 95% of my cumin queries are

R:class = some_class and *.some-site.wmnet

or similar. I would very much appreciate not having to wrap those around with 'P{}'. Writing cumin queries is verbose enough as it is - aliases are going to help a lot in this respect, admittedly, but still.

Wed, Jul 12, 11:56 AM · Patch-For-Review, Operations-Software-Development
faidon closed Unknown Object (Task), a subtask of T166342: New SCB nodes, as Resolved.
Wed, Jul 12, 7:31 AM · Services (watching), Operations, hardware-requests, Analytics, User-mobrovac, EventBus
faidon closed Unknown Object (Task), a subtask of T161753: eqiad: (1) hardware access request for labnodepool1002, as Resolved.
Wed, Jul 12, 7:31 AM · hardware-requests, Cloud-Services, Operations
faidon closed Unknown Object (Task), a subtask of T161766: Codfw: (2) hardware access request for labtest [region 2], as Resolved.
Wed, Jul 12, 7:30 AM · hardware-requests, Cloud-Services, Operations
faidon closed Unknown Object (Task), a subtask of T154664: codfw: (2) hardware access request for labtest, as Resolved.
Wed, Jul 12, 7:30 AM · hardware-requests, Operations

Tue, Jul 11

faidon added a comment to T150160: Remote IPMI doesn't work for ~2% of the fleet.

Chris fixed the cables for conf1003, kafka1018, kafka1020 and db1063. All fixed!

Tue, Jul 11, 7:53 PM · monitoring, Operations
faidon updated the task description for T150160: Remote IPMI doesn't work for ~2% of the fleet.
Tue, Jul 11, 7:52 PM · monitoring, Operations
faidon updated the task description for T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface.
Tue, Jul 11, 7:45 PM · monitoring, Operations, ops-codfw, ops-eqiad
faidon updated the task description for T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface.
Tue, Jul 11, 7:25 PM · monitoring, Operations, ops-codfw, ops-eqiad
faidon updated the task description for T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface.
Tue, Jul 11, 5:18 PM · monitoring, Operations, ops-codfw, ops-eqiad
faidon updated the task description for T150160: Remote IPMI doesn't work for ~2% of the fleet.
Tue, Jul 11, 5:18 PM · monitoring, Operations
faidon added a comment to T170193: revoke eventdonations.wikimedia.org SSL cert if there is one....

Looks like it expires in September:

Validity
    Not Before: Jul 18 18:16:03 2016 GMT
    Not After : Sep  4 12:10:02 2017 GMT
Subject: C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc.", CN = eventdonations.wikimedia.org
Tue, Jul 11, 4:17 PM · Patch-For-Review, Domains, Traffic, Operations, fundraising-tech-ops
faidon renamed T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface from Unresponsive/misconfigured iDRACs to Unresponsive/misconfigured iDRACs over the host-BMC interface.
Tue, Jul 11, 12:39 AM · monitoring, Operations, ops-codfw, ops-eqiad
faidon updated the task description for T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface.
Tue, Jul 11, 12:30 AM · monitoring, Operations, ops-codfw, ops-eqiad
faidon updated the task description for T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface.
Tue, Jul 11, 12:26 AM · monitoring, Operations, ops-codfw, ops-eqiad
faidon added a comment to T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface.

I racreset all of the ones in list which had a discrepancy of their IP configuration with the output (showing 192.168.0.1 as gateway) and they're all fixed now.

Tue, Jul 11, 12:24 AM · monitoring, Operations, ops-codfw, ops-eqiad
faidon updated the task description for T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface.
Tue, Jul 11, 12:21 AM · monitoring, Operations, ops-codfw, ops-eqiad

Mon, Jul 10

faidon updated the task description for T150160: Remote IPMI doesn't work for ~2% of the fleet.
Mon, Jul 10, 11:38 PM · monitoring, Operations
faidon updated the task description for T150160: Remote IPMI doesn't work for ~2% of the fleet.
Mon, Jul 10, 11:33 PM · monitoring, Operations
faidon added a comment to T150160: Remote IPMI doesn't work for ~2% of the fleet.

So I did the following:

  • mw1302: had Volatile_Channel_Privilege_Limit and Non_Volatile_Channel_Privilege_Limit set to Operator instead of Administrator; fixed with bmc-config
  • stat1003: had wrong DNS, fixed that
  • a bunch of the rest had the issue that I described in T160392 (IPMI password had gotten out of sync with iDRAC password); fixed with sshpass -e ssh root@$hostname racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 $password
Mon, Jul 10, 11:27 PM · monitoring, Operations
faidon closed T158893: dbstore1001 troubleshoot IPMI issue as Resolved.
Mon, Jul 10, 10:36 PM · DBA, Operations
faidon closed T158893: dbstore1001 troubleshoot IPMI issue, a subtask of T150160: Remote IPMI doesn't work for ~2% of the fleet, as Resolved.
Mon, Jul 10, 10:36 PM · monitoring, Operations
faidon added a comment to T158893: dbstore1001 troubleshoot IPMI issue.

Same issue as T160392. From the iDRAC web interface, I set the password to something random then back to our password and this seems to have done the trick.

Mon, Jul 10, 10:36 PM · DBA, Operations
faidon closed T160392: Reset db1070 idrac as Resolved.

OK, so I noticed that the Error: Unable to establish IPMI v2 / RMCP+ session response was immediate, like the password was wrong. So I tried changing the password to something else from the iDRAC web interface, and then changing it back to our regular one, and this seems to have done the trick for both db1070 and db1071.

Mon, Jul 10, 10:32 PM · Patch-For-Review, DBA, ops-eqiad, Operations
faidon closed T160392: Reset db1070 idrac, a subtask of T137191: Defragment db1070, db1082, db1087, db1092, as Resolved.
Mon, Jul 10, 10:32 PM · Patch-For-Review, DBA
faidon closed T160392: Reset db1070 idrac, a subtask of T150160: Remote IPMI doesn't work for ~2% of the fleet, as Resolved.
Mon, Jul 10, 10:32 PM · monitoring, Operations
faidon added a comment to T160392: Reset db1070 idrac.

FYI, db1071 is in a similar state, I'm not sure why.

Mon, Jul 10, 6:33 PM · Patch-For-Review, DBA, ops-eqiad, Operations
faidon closed T155690: troubleshoot drac on ms-be2010.codfw.wmnet as Declined.

ms-be2010 is decom'ed now, resolving.

Mon, Jul 10, 6:11 PM · ops-codfw, Operations
faidon closed T155690: troubleshoot drac on ms-be2010.codfw.wmnet, a subtask of T150160: Remote IPMI doesn't work for ~2% of the fleet, as Declined.
Mon, Jul 10, 6:11 PM · monitoring, Operations
faidon added a subtask for T150160: Remote IPMI doesn't work for ~2% of the fleet: T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface.
Mon, Jul 10, 6:11 PM · monitoring, Operations
faidon added a parent task for T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface: T150160: Remote IPMI doesn't work for ~2% of the fleet.
Mon, Jul 10, 6:11 PM · monitoring, Operations, ops-codfw, ops-eqiad
faidon closed T104258: Create instrumentation to monitor load on geoiplookup.wikimedia.org as Resolved.

Long resolved, geoiplookup doesn't exist anymore (T100902).

Mon, Jul 10, 3:57 PM · monitoring, Operations
faidon reassigned T87840: Retire Torrus from akosiaris to fgiunchedi.
Mon, Jul 10, 3:19 PM · monitoring, Operations, Technical-Debt
faidon edited projects for T150160: Remote IPMI doesn't work for ~2% of the fleet, added: monitoring; removed Patch-For-Review.
Mon, Jul 10, 3:14 PM · monitoring, Operations
faidon removed a project from T162629: Admin request for user paladox and Luke081515 in the project shinken: monitoring.
Mon, Jul 10, 3:04 PM · Shinken, Cloud-Services
faidon moved T167279: Create "network" icinga group from Backlog to In progress on the monitoring board.
Mon, Jul 10, 3:00 PM · Patch-For-Review, Operations, Icinga, monitoring
faidon moved T170150: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible from Backlog to Up next on the monitoring board.
Mon, Jul 10, 2:12 PM · monitoring, Operations
faidon created T170150: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible.
Mon, Jul 10, 2:11 PM · monitoring, Operations
faidon added a project to T167279: Create "network" icinga group: Operations.
Mon, Jul 10, 1:17 PM · Patch-For-Review, Operations, Icinga, monitoring
faidon moved T170144: Evaluate NetBox as a Racktables replacement & IPAM from Backlog to Up next on the monitoring board.
Mon, Jul 10, 1:17 PM · netops, monitoring, Operations
faidon created T170144: Evaluate NetBox as a Racktables replacement & IPAM.
Mon, Jul 10, 1:16 PM · netops, monitoring, Operations
faidon moved T151632: Fix Icinga checks for test/decom servers from Backlog to Up next on the monitoring board.
Mon, Jul 10, 1:10 PM · monitoring, Operations
faidon edited projects for T151632: Fix Icinga checks for test/decom servers, added: monitoring; removed Patch-For-Review.

What's left to be done here, @Dzahn?

Mon, Jul 10, 1:10 PM · monitoring, Operations
faidon archived Prometheus-metrics-monitoring.
Mon, Jul 10, 1:08 PM
faidon added a project to T152445: Move prometheus entry point off port 80: monitoring.
Mon, Jul 10, 1:08 PM · monitoring, Prometheus-metrics-monitoring, Operations
faidon added a project to T143896: MySQL monitoring with prometheus: monitoring.
Mon, Jul 10, 1:08 PM · monitoring, DBA, Patch-For-Review, Operations, Prometheus-metrics-monitoring
faidon added a project to T145072: Create a script to regenerate prometheus mysqld exporter listing that works with puppetdb: monitoring.
Mon, Jul 10, 1:07 PM · monitoring, DBA, Operations, Prometheus-metrics-monitoring
faidon added a project to T160677: Effects on adjusting Prometheus retention: monitoring.
Mon, Jul 10, 1:06 PM · monitoring, User-fgiunchedi, Operations, Prometheus-metrics-monitoring
faidon moved T1075: Audit groups of metrics in Graphite that allocate a lot of disk space from Backlog to In progress on the monitoring board.
Mon, Jul 10, 1:06 PM · monitoring, User-fgiunchedi, Operations, Graphite
faidon added a project to T1075: Audit groups of metrics in Graphite that allocate a lot of disk space: monitoring.
Mon, Jul 10, 1:06 PM · monitoring, User-fgiunchedi, Operations, Graphite
faidon set the icon for Graphite to Tag.
Mon, Jul 10, 1:04 PM
faidon edited projects for T101141: udp rcvbuferrors and inerrors on graphite1001, added: monitoring; removed MW-1.27-release (WMF-deploy-2016-04-05_(1.27.0-wmf.20)), MW-1.27-release (WMF-deploy-2016-04-26_(1.27.0-wmf.22)), Patch-For-Review.
Mon, Jul 10, 1:03 PM · monitoring, MW-1.27-release-notes, MW-1.27-release (WMF-deploy-2016-04-12_(1.27.0-wmf.21)), Operations, Graphite
faidon added a project to T136312: encrypt syslog traffic: monitoring.
Mon, Jul 10, 1:02 PM · monitoring, User-fgiunchedi, Operations
faidon added a project to T126989: MediaWiki logging & encryption: monitoring.
Mon, Jul 10, 1:01 PM · monitoring, Wikimedia-Logstash, MediaWiki-Debug-Logger, Operations
faidon moved T133110: Check for an oversized exim4 queue indicating mail delivery failures from Backlog to In progress on the monitoring board.
Mon, Jul 10, 1:00 PM · Patch-For-Review, Operations, monitoring
faidon moved T165348: Check long-running screen/tmux sessions from Backlog to Up next on the monitoring board.
Mon, Jul 10, 12:59 PM · monitoring, Operations
faidon added a project to T165348: Check long-running screen/tmux sessions: monitoring.
Mon, Jul 10, 12:59 PM · monitoring, Operations
faidon moved T125205: Monitor hardware thermal issues from Backlog to Up next on the monitoring board.
Mon, Jul 10, 12:58 PM · Operations, monitoring
faidon lowered the priority of T125205: Monitor hardware thermal issues from High to Normal.

So the IPMI checks have been deployed for a while. Quite a few hosts had BMC issues (some of them are fixed), and it remains to be seen whether the IPMI checks are going to be reliable enough for our uses.

Mon, Jul 10, 12:57 PM · Operations, monitoring