Page MenuHomePhabricator

lmata (Leo Mata)
Manager SRE Observability

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
May 14 2020, 7:26 PM (48 w, 6 d)
Availability
Available
IRC Nick
lmata
LDAP User
LMata
MediaWiki User
LMata (WMF) [ Global Accounts ]

Recent Activity

Tue, Apr 20

lmata added a comment to T141038: implement paging for non-ops teams.

do we want to keep the same scope for ICINGA? or consider our other paging tools?

Tue, Apr 20, 4:33 PM · observability, Icinga, SRE

Thu, Apr 15

lmata added a comment to T280242: Requesting access to graphite hosts for awight.

Looks good to me, approved, thanks @MoritzMuehlenhoff !

Thu, Apr 15, 12:44 PM · SRE, SRE-Access-Requests, Graphite, observability

Thu, Mar 25

lmata moved T276697: Implement central logging for mailman3 from Radar to Inbox on the observability board.

Sure thing @Legoktm will discuss with team and share notes here

Thu, Mar 25, 12:57 PM · Patch-For-Review, observability, SRE, Wikimedia-Mailing-lists

Mar 23 2021

lmata added a comment to T240685: MediaWiki Prometheus support.

@AMooney yes please

Mar 23 2021, 2:49 PM · Platform Team Workboards (External Code Reviews), Patch-For-Review, serviceops, SRE, MediaWiki-General, observability

Mar 22 2021

lmata moved T276468: Unable to exclude "error" field in Logstash from Inbox to Backlog on the observability board.
Mar 22 2021, 3:35 PM · observability, Wikimedia-Logstash
lmata moved T276492: Notifications when prometheus daemons are wedged from Inbox to Radar on the observability board.
Mar 22 2021, 3:34 PM · observability, Discovery-Search
lmata added a comment to T276492: Notifications when prometheus daemons are wedged.

Hello @EBernhardson, moving to radar for now, please let us know how you'd like to proceed and if you need assistance. thanks!

Mar 22 2021, 3:34 PM · observability, Discovery-Search
lmata moved T276623: Convert udp2log init script to use systemd from Backlog to Radar on the observability board.
Mar 22 2021, 3:33 PM · Patch-For-Review, observability, SRE
lmata moved T276623: Convert udp2log init script to use systemd from Radar to Backlog on the observability board.
Mar 22 2021, 3:32 PM · Patch-For-Review, observability, SRE
lmata moved T276623: Convert udp2log init script to use systemd from Inbox to Radar on the observability board.
Mar 22 2021, 3:30 PM · Patch-For-Review, observability, SRE
lmata moved T277445: Hourly log rotation for large MW logs from Inbox to Backlog on the observability board.
Mar 22 2021, 3:29 PM · Developer Productivity, Platform Team Workboards (Clinic Duty Team), observability
lmata assigned T277445: Hourly log rotation for large MW logs to herron.
Mar 22 2021, 3:29 PM · Developer Productivity, Platform Team Workboards (Clinic Duty Team), observability
lmata added a comment to T228838: Consider enabling all MW log channels by default for WMF.

@thcipriani would it be helpful to set a time to chat about this further? I don't know if there is an immediate plan to move MW to ECS, but lets discuss options available and see if there is a suitable path forward.

Mar 22 2021, 3:28 PM · Release-Engineering-Team (Radar), observability, Platform Engineering (Icebox), Developer Productivity, MediaWiki-Debug-Logger
lmata moved T277739: rsyslog-kubernetes missing in buster-wikimedia from Inbox to Radar on the observability board.
Mar 22 2021, 3:22 PM · SRE, observability
lmata added a comment to T277927: Add monitoring for performance.wikimedia.org.

hi @Legoktm let us (o11y) know if you need some help!

Mar 22 2021, 3:19 PM · observability, SRE, Performance-Team
lmata moved T277927: Add monitoring for performance.wikimedia.org from Inbox to Radar on the observability board.
Mar 22 2021, 3:19 PM · observability, SRE, Performance-Team

Mar 16 2021

lmata added a project to T240685: MediaWiki Prometheus support: Platform Team Workboards (Clinic Duty Team).

Hi @AMooney, I'd like to present this patch as the other of the two I was hoping to bring to your attention for next clinic duty... Please let me know if/how to proceed. thanks!

Mar 16 2021, 6:39 PM · Platform Team Workboards (External Code Reviews), Patch-For-Review, serviceops, SRE, MediaWiki-General, observability
lmata added a project to T269676: Mediawiki logging indexing conflict on 'status' for 'authevents': Platform Team Workboards (Clinic Duty Team).

Greetings @AMooney, this patch is one of the two I was hoping to bring to your attention for next clinic duty... This one is for some changes around logging and trying out the new clinic workflow regarding the "happy path" for these types of patches. Please let me know if/how to proceed. thanks!

Mar 16 2021, 6:35 PM · MW-1.36-notes, MW-1.37-notes (1.37.0-wmf.3; 2021-04-27), Platform Team Workboards (External Code Reviews), Patch-For-Review, MW-1.35-notes, observability, MediaWiki-General

Mar 15 2021

lmata moved T276972: Set up cross DC topic mirroring for Kafka logging clusters from Radar to Backlog on the SRE board.
Mar 15 2021, 4:24 PM · Analytics-Radar, observability, SRE
lmata moved T276972: Set up cross DC topic mirroring for Kafka logging clusters from Inbox to Radar on the observability board.
Mar 15 2021, 4:24 PM · Analytics-Radar, observability, SRE
lmata moved T276972: Set up cross DC topic mirroring for Kafka logging clusters from Backlog to Radar on the SRE board.
Mar 15 2021, 4:23 PM · Analytics-Radar, observability, SRE
lmata triaged T277163: Prometheus PoPs disk space utilization as Medium priority.

Moving to short term backlog

Mar 15 2021, 4:20 PM · User-fgiunchedi, observability
lmata added a comment to T277445: Hourly log rotation for large MW logs.

hi @tstarling we can help, how would you like to proceed?

Mar 15 2021, 4:17 PM · Developer Productivity, Platform Team Workboards (Clinic Duty Team), observability

Mar 8 2021

lmata triaged T276303: logmsgbot auth issues as Medium priority.
Mar 8 2021, 4:32 PM · observability
lmata moved T276501: Pontoon enroll fails to complete from Inbox to In progress on the observability board.
Mar 8 2021, 4:22 PM · observability
lmata moved T276595: Upgrade prometheus-jmx-exporter from Inbox to In progress on the observability board.
Mar 8 2021, 4:22 PM · Analytics-Clusters, wdwb-tech, SRE, Wikidata, Wikidata-Query-Service, CirrusSearch, observability
lmata updated subscribers of T276623: Convert udp2log init script to use systemd.

@herron this might be worth looking into as part of the mwlog buster upgrade

Mar 8 2021, 4:19 PM · Patch-For-Review, observability, SRE
lmata moved T276697: Implement central logging for mailman3 from Inbox to Radar on the observability board.
Mar 8 2021, 4:17 PM · Patch-For-Review, observability, SRE, Wikimedia-Mailing-lists
lmata moved T276749: Flapping Prometheus metrics for netbox_device_statistics from Inbox to Radar on the observability board.
Mar 8 2021, 4:16 PM · observability, netbox
lmata moved T276792: Remove cloud contacts from legacy paging from Inbox to In progress on the observability board.
Mar 8 2021, 4:15 PM · cloud-services-team (Kanban), User-fgiunchedi, observability

Feb 22 2021

lmata created T275405: Logstash collector nodes hang indefinitely on reboot.
Feb 22 2021, 4:34 PM · Patch-For-Review, observability
lmata added a comment to T274987: Review and purge deprecated Graphite metrics for CodeMirror.

hello @awight could you let me know the level of assistance you'd like with this task, or if its just here for information purposes. Thanks!

Feb 22 2021, 4:21 PM · WMDE-TechWish, observability, WMDE-Templates-FocusArea

Feb 16 2021

lmata added a comment to T273450: Purge and migrate deprecated metrics paths.

howdy @awight saw some chatter around this on the #wikimedia-sre-observability channel and am wondering if there is still input you would like from the team on this matter. Thanks!

Feb 16 2021, 4:33 PM · Epic, WMDE-TechWish (Sprint-2021-02-03), WMDE-Templates-FocusArea
lmata updated the image for observability from F34107832: profile to F34107835: profile.
Feb 16 2021, 3:52 PM
lmata updated the image for observability from F34107824: profile to F34107832: profile.
Feb 16 2021, 3:51 PM
lmata updated the image for observability from F34107816: profile to F34107824: profile.
Feb 16 2021, 3:49 PM
lmata updated the image for observability from F8447740: profile to F34107816: profile.
Feb 16 2021, 3:49 PM

Feb 12 2021

lmata moved T274665: Design and implement SLO Dashboard tooling from Inbox to In progress on the observability board.
Feb 12 2021, 6:11 PM · observability
lmata created T274665: Design and implement SLO Dashboard tooling.
Feb 12 2021, 4:25 PM · observability

Feb 2 2021

lmata created T273641: Security Issue Access Request for (lmata).
Feb 2 2021, 4:32 PM · SecTeam-Processed, Security-Team, Security

Feb 1 2021

lmata added a comment to T265876: Logging options for apache httpd in k8s.

noted @Joe! I'll reach out to you to coordinate a time to talk with the team.

Feb 1 2021, 5:44 PM · observability, SRE, serviceops, MW-on-K8s
lmata moved T265876: Logging options for apache httpd in k8s from Backlog to Inbox on the observability board.
Feb 1 2021, 4:16 PM · observability, SRE, serviceops, MW-on-K8s

Jan 25 2021

lmata closed T141520: "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) as Resolved.

3M delay seems like a short but acceptable window for alerting. If there is a need to shorten this down we can discuss.. Closing this ticket, please reopen if you'd like to revisit the conversation.

Jan 25 2021, 4:50 PM · Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO, SRE, observability
lmata moved T265876: Logging options for apache httpd in k8s from Inbox to Backlog on the observability board.
Jan 25 2021, 4:23 PM · observability, SRE, serviceops, MW-on-K8s
lmata added a comment to T271138: Some Observability clusters apparently do not support IPv6..

Is there a specific timeline you'd like us to meet with this? Mainly the goal is to understand urgency for prioritization. Thanks!

Jan 25 2021, 4:23 PM · IPv6, User-crusnov, observability, SRE-tools
lmata moved T271298: Add Icinga check for SRX cluster status from Inbox to Radar on the observability board.

Hi Arzhel,

Jan 25 2021, 4:20 PM · netops, SRE, observability
lmata moved T271822: Add support for scraping php applications to the kubernetes prometheus scraper from Inbox to Radar on the observability board.

Hi Joe,

Jan 25 2021, 4:17 PM · observability, MW-on-K8s, serviceops, SRE

Dec 14 2020

lmata added a project to T269937: Investigate how to aggregate Wikibase Timeout errors by their api-action or special page: observability.
Dec 14 2020, 4:23 PM · observability, Wikimedia-Logstash, Wikidata Infrastructure Reliability Sprint Dec 2020
lmata moved T269941: Investigate how to get data from logstash to Grafana for Timeout and Out of Memory errors from Inbox to Radar on the observability board.
Dec 14 2020, 4:21 PM · observability, Wikimedia-Logstash, Wikidata Infrastructure Reliability Sprint Dec 2020
lmata added a project to T269941: Investigate how to get data from logstash to Grafana for Timeout and Out of Memory errors : observability.
Dec 14 2020, 4:21 PM · observability, Wikimedia-Logstash, Wikidata Infrastructure Reliability Sprint Dec 2020

Dec 7 2020

lmata moved T269272: Sign-in links from Grafana dashboards don't work when not signed into SSO from Inbox to Backlog on the observability board.
Dec 7 2020, 4:21 PM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), CAS-SSO, observability, SRE
lmata moved T269333: Switch default Grafana datasource to Thanos from Inbox to Backlog on the observability board.
Dec 7 2020, 4:20 PM · observability
lmata assigned T269560: Increased icinga check latency since 05/12 to colewhite.
Dec 7 2020, 4:16 PM · SRE, observability
lmata moved T269563: HP RAID failed on ms-be1054 didn't open a task from Inbox to Radar on the observability board.
Dec 7 2020, 4:13 PM · SRE, SRE-tools, observability

Nov 30 2020

lmata assigned T266570: Two close pages for idle workers api + appserver didn't auto-resolve on recovery to herron.
Nov 30 2020, 4:32 PM · observability, SRE
lmata closed T266800: VictorOps ~5min delay from email received to incident paging as Resolved.

Closing for now we can reopen if we see another occurrence of this event happening.

Nov 30 2020, 4:30 PM · observability, SRE
lmata moved T268369: how to deal with cumin alias alerts from Inbox to Radar on the observability board.
Nov 30 2020, 4:27 PM · SRE-tools, observability, SRE
lmata moved T268806: ELK: uniquely identify network syslog from Inbox to Backlog on the observability board.
Nov 30 2020, 4:27 PM · observability
lmata moved T268995: Add alertmanager@ email user/alias or equivalent from Inbox to In progress on the observability board.
Nov 30 2020, 4:24 PM · User-fgiunchedi, observability
lmata moved T269000: thanos: 404 error trying to fetch js library from Inbox to Backlog on the observability board.
Nov 30 2020, 4:22 PM · SRE, observability

Nov 23 2020

lmata moved T268091: Capture usage metrics for Kibana saved objects from Inbox to Backlog on the observability board.
Nov 23 2020, 4:21 PM · observability
lmata moved T268233: thanos u/i gives errors if left idle for a few hours from Inbox to In progress on the observability board.
Nov 23 2020, 4:21 PM · CAS-SSO, observability, SRE
lmata moved T268282: Kibana deprecation warnings on startup from Radar to Backlog on the observability board.
Nov 23 2020, 4:20 PM · observability
lmata moved T268282: Kibana deprecation warnings on startup from Inbox to Radar on the observability board.
Nov 23 2020, 4:20 PM · observability
lmata moved T268355: cronspam from prometheus-directory-size (on labstore1004) from Inbox to Radar on the observability board.
Nov 23 2020, 4:19 PM · cloud-services-team (Kanban), observability, SRE
lmata moved T268355: cronspam from prometheus-directory-size (on labstore1004) from Backlog to Radar on the SRE board.
Nov 23 2020, 4:19 PM · cloud-services-team (Kanban), observability, SRE
lmata added a project to T268369: how to deal with cumin alias alerts: SRE-tools.
Nov 23 2020, 4:18 PM · SRE-tools, observability, SRE

Nov 16 2020

lmata moved T267901: SMART data dump healthy metric can contain None from Inbox to Backlog on the observability board.
Nov 16 2020, 4:21 PM · observability
lmata moved T267664: Enhance smart_data_dump to support gathering metrics from both raid and standalone disks from Inbox to Backlog on the observability board.
Nov 16 2020, 4:21 PM · observability
lmata moved T267660: Add ssacli support to smart_data_dump from Inbox to Backlog on the observability board.
Nov 16 2020, 4:20 PM · observability
lmata moved T267650: LibreNMS supports more than one Alertmanager address from Inbox to Backlog on the observability board.
Nov 16 2020, 4:20 PM · Upstream, User-fgiunchedi, observability
lmata moved T267645: Wrong redirect when logging into grafana-rw from a grafana.w.o dashboard from Inbox to In progress on the observability board.
Nov 16 2020, 4:19 PM · User-fgiunchedi, observability, SRE
lmata moved T265435: codfw: Testing Out Sample PDUs from Inbox to Radar on the observability board.
Nov 16 2020, 4:18 PM · observability, ops-codfw, DC-Ops, SRE

Nov 9 2020

lmata moved T267019: Alert design guidelines for teams are produced from Inbox to In progress on the observability board.
Nov 9 2020, 4:18 PM · observability
lmata moved T267176: alert on too many close-to-saturated appservers / apiservers from Inbox to Radar on the observability board.
Nov 9 2020, 4:16 PM · Patch-For-Review, User-jijiki, serviceops, observability, SRE
lmata moved T267186: alerts.w.o / idp.w.o interaction and CORS from Inbox to Radar on the observability board.
Nov 9 2020, 4:16 PM · CAS-SSO, Patch-For-Review, observability
lmata moved T267018: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations from Inbox to In progress on the observability board.
Nov 9 2020, 4:15 PM · SRE, netops, User-fgiunchedi, observability
lmata moved T266535: Silencing alerts through the alerts dashboard is supported and functional from Inbox to In progress on the observability board.
Nov 9 2020, 4:15 PM · User-fgiunchedi, observability
lmata moved T266515: Set ENV SERVERGROUP for jobrunner MW web requests from Inbox to Radar on the observability board.

Please let us know if there is anything we can assist with... moving to radar meanwhile.

Nov 9 2020, 4:14 PM · Platform Team Workboards (External Code Reviews), Developer Productivity, serviceops, observability

Nov 3 2020

lmata added a comment to T265876: Logging options for apache httpd in k8s.

Just dropping a quick update here, we should schedule some time to review options. Had a brief exchange with @akosiaris and we'll get the team together for a discussion on proposed paths and collaboration.

Nov 3 2020, 5:51 PM · observability, SRE, serviceops, MW-on-K8s

Oct 26 2020

lmata moved T266019: Several days of metrics not uploaded to Thanos object storage from Prometheus on PoPs from Backlog to In progress on the observability board.
Oct 26 2020, 3:54 PM · User-fgiunchedi, observability
lmata assigned T266019: Several days of metrics not uploaded to Thanos object storage from Prometheus on PoPs to herron.
Oct 26 2020, 3:54 PM · User-fgiunchedi, observability
lmata added a comment to T265938: Create a separate logstash ElasticSearch index for schemaed events.

we've been working on a consolidated logging schema that might prove to be very helpful for this particular task. We'd love to talk to you about it, what is the best way to do this? we can setup a meeting or just share details in this phab task. Thanks!

Oct 26 2020, 3:47 PM · Wikimedia-Logstash, observability, Analytics, Product-Data-Infrastructure
lmata moved T265938: Create a separate logstash ElasticSearch index for schemaed events from Inbox to Radar on the observability board.
Oct 26 2020, 3:47 PM · Wikimedia-Logstash, observability, Analytics, Product-Data-Infrastructure
lmata removed a project from T221904: swift backend decomms / rebalances are noisy: observability.

I'm going to un tag Observability for now as this is more swift related and less o11y related. :-) if this changes please retag

Oct 26 2020, 3:45 PM · Patch-For-Review, User-fgiunchedi, SRE-swift-storage, SRE
lmata moved T263423: librenms page didn't auto-resolve in VO from Inbox to Backlog on the observability board.
Oct 26 2020, 3:42 PM · SRE, observability
lmata moved T265590: ulog: filter out diffscan from ulog from Inbox to In progress on the observability board.
Oct 26 2020, 3:41 PM · observability, Security, SRE, User-jbond
lmata moved T265649: PuppetDB grafana graphs not matching logs from Inbox to Radar on the observability board.
Oct 26 2020, 3:41 PM · User-jbond, SRE, observability, Puppet
lmata set Due Date to Dec 9 2020, 4:00 PM on T266019: Several days of metrics not uploaded to Thanos object storage from Prometheus on PoPs.
Oct 26 2020, 3:33 PM · User-fgiunchedi, observability
lmata moved T266017: Implement alerting roadmap phase 2 from Inbox to In progress on the observability board.
Oct 26 2020, 3:31 PM · Patch-For-Review, User-fgiunchedi, observability
lmata moved T266019: Several days of metrics not uploaded to Thanos object storage from Prometheus on PoPs from Inbox to Backlog on the observability board.
Oct 26 2020, 3:31 PM · User-fgiunchedi, observability
lmata moved T266216: Increase visibility of container/pod ressource exhaustion from Inbox to Radar on the observability board.
Oct 26 2020, 3:30 PM · observability, serviceops, Prod-Kubernetes, Kubernetes

Oct 20 2020

lmata updated lmata.
Oct 20 2020, 4:16 AM
lmata updated lmata.
Oct 20 2020, 4:16 AM

Oct 19 2020

lmata moved T263103: Compress graphite carbon-cache log files from Inbox to Backlog on the observability board.
Oct 19 2020, 3:36 PM · Patch-For-Review, observability
lmata moved T263027: Missing 'notify' for some Icinga configuration files from Inbox to Backlog on the observability board.
Oct 19 2020, 3:35 PM · SRE, observability
lmata added a comment to T184086: Add prometheus exporter to Gerrit.

moving to radar but probably will close eventually as the Gitlab move progresses

Oct 19 2020, 3:34 PM · Release-Engineering-Team (Radar), Patch-For-Review, observability, Gerrit, SRE
lmata moved T184086: Add prometheus exporter to Gerrit from Inbox to Radar on the observability board.
Oct 19 2020, 3:33 PM · Release-Engineering-Team (Radar), Patch-For-Review, observability, Gerrit, SRE
lmata moved T182759: Add Prometheus exporter to Jenkins instances from Inbox to Radar on the observability board.
Oct 19 2020, 3:32 PM · Release-Engineering-Team (Seen), observability, Continuous-Integration-Infrastructure, User-fgiunchedi, Goal, SRE
lmata moved T209709: Feature: enable prometheus-nginx-exporter for nginx metrics from Inbox to Backlog on the observability board.
Oct 19 2020, 3:32 PM · observability