Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (19)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (270 w, 12 h)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
Filippo Giunchedi [ Global Accounts ]

Recent Activity

Yesterday

fgiunchedi removed a project from T191659: Configure a threshold for earlier notification of /srv/cassandra/instance-data: User-fgiunchedi.
Thu, Dec 5, 1:35 PM · Core Platform Team Legacy (Later), Patch-For-Review, Operations, Services (next), RESTBase-Cassandra, User-Eevans, Cassandra
fgiunchedi created T239907: mtail stuck on some mw hosts.
Thu, Dec 5, 1:26 PM · observability
fgiunchedi archived P9823 Masterwork From Distant Lands.
Thu, Dec 5, 1:25 PM
fgiunchedi added a comment to T236573: "etcd" Cloud VPS project jessie deprecation.

For what is worth, I have no usage of these machines nor the project.

Thu, Dec 5, 11:16 AM · Cloud-VPS (Debian Jessie Deprecation)
fgiunchedi added a comment to T174432: Unclear LVS bandwidth graph in "load balancers" dashboard.

Are the non-icmp graphs somehow LVS-specific?

Yes, the metrics are: node_ipvs_backend_connections_active, node_ipvs_incoming_packets_total, node_ipvs_incoming_bytes_total. The icmp graph instead plots node_netstat_Icmp_InMsgs.
The text panel @fgiunchedi added is correct, so I guess that should be enough to clarify the ambiguity? Alternatively, we could move the ICMP graphs to a new dashboard with host-specific metrics only.

Thu, Dec 5, 10:37 AM · Traffic, Operations
fgiunchedi closed T236700: 'LVS connections' graph on Load Balancers dashboard takes a rate of a gauge as Resolved.

Fixed now and 'load balancers' dashboard adjusted

Thu, Dec 5, 10:35 AM · Traffic, Operations, observability
fgiunchedi added a comment to T226373: Swift object servers become briefly unresponsive on a regular basis.

I've investigated a bit the scope and impact of this issue, namely by joining the transactions IDs for which swift reported ConnectionTimeout in server.log with swift proxy-access.log. The idea being to see what swift sent back to ATS and with which latency.

Thu, Dec 5, 10:21 AM · Performance-Team (Radar), User-jijiki, serviceops, Patch-For-Review, SRE-swift-storage, Operations
fgiunchedi moved T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet from Doing to Radar on the User-fgiunchedi board.
Thu, Dec 5, 8:44 AM · ops-eqiad, User-fgiunchedi, Operations
fgiunchedi updated the task description for T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet.
Thu, Dec 5, 8:39 AM · ops-eqiad, User-fgiunchedi, Operations
fgiunchedi reassigned T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet from fgiunchedi to Cmjohnson.

Hosts are fully in service now!

Thu, Dec 5, 8:37 AM · ops-eqiad, User-fgiunchedi, Operations

Wed, Dec 4

fgiunchedi added a comment to T239805: ms-fe2007 NIC failure.

@fgiunchedi the 10G NiC is dead
1- option replace the server with another server
https://netbox.wikimedia.org/dcim/devices/1099/
2- option Buy another 10G NIC

Wed, Dec 4, 5:45 PM · Operations, ops-codfw
fgiunchedi renamed T239805: ms-fe2007 NIC failure from ms-fe2007 nic failure to ms-fe2007 NIC failure.
Wed, Dec 4, 3:11 PM · Operations, ops-codfw
fgiunchedi created T239805: ms-fe2007 NIC failure.
Wed, Dec 4, 12:32 PM · Operations, ops-codfw

Tue, Dec 3

fgiunchedi added a comment to T180051: Reduce the number of fields declared in elasticsearch by logstash.

We've been working with service owners to fix the obvious offenders in terms of "fields spam" and bumped the fields limit to 2048. We're also alerting on indexing failures when Logstash gets errors from Elasticsearch. ATM only kartotherian bumps into the limit, although that doesn't necessarily mean kartotherian is the "fields spammer" in this case. I'll be following up with a patch to further bump the limit to 4096, that should be plenty to fully ingest all logs we're producing now.

Tue, Dec 3, 3:09 PM · Patch-For-Review, observability, Core Platform Team Legacy (Watching / External), Services (watching), Operations, Wikimedia-Logstash
fgiunchedi added a comment to T189333: Changing Kibana filters is ridiculously slow.

I re-ran my analysis today, and oddly enough the total number of fields it not only similar but equal to the number of fields there were three months ago. Currently at 7,665 table columns.

That's indeed unexpected, can you share how you are doing the analysis/pulling the field names?

  1. Open a Logstash dashboard in a Chromium browser, and open the Dev Tools.
  2. Edit or create a filter bubble in the Kibana UI, and open the channel dropdown.
  3. Then, from the Console tab in Dev Tools, execute copy($$('ul.uiSelectChoices--autoWidth.ui-select-dropdown')[0].textContent)

This queries the DOM for the <ul> node that represents the channel dropdown menu, then uses textContent (recursively aggregates the textual content of all child list items and concatenates it), and copies it to your clipboard.
Then, paste in a text editor and use some method of removing empty lines and count them :)

The more direct place to get this information is to click the Management (gear/cog) link in sidebar and select Index Patterns. This will report all the fields kibana knows about, along with counts. Today it lists 11091 fields. I'm not sure when exactly this metadata updates, or if it's real time. The refresh button which gives a big warning about resetting popularity counters suggests to it might not auto-update? We can compare to the actual indices with a bit of jq magic, but would take a bit to work up.

Tue, Dec 3, 1:56 PM · User-fgiunchedi, observability, Traffic, Operations, User-Addshore, Wikimedia-Logstash
fgiunchedi updated the task description for T239713: Citoid is logging all request / response headers as separate fields.
Tue, Dec 3, 1:49 PM · Citoid
fgiunchedi updated subscribers of T239713: Citoid is logging all request / response headers as separate fields.
Tue, Dec 3, 1:47 PM · Citoid
fgiunchedi created T239713: Citoid is logging all request / response headers as separate fields.
Tue, Dec 3, 1:47 PM · Citoid
fgiunchedi added a comment to T239458: Mediawiki logging indexing conflict.

Similar message but for errors

Tue, Dec 3, 11:26 AM · MediaWiki-Logging, User-fgiunchedi, VisualEditor, MediaWiki-General
fgiunchedi added projects to T239458: Mediawiki logging indexing conflict: User-fgiunchedi, MediaWiki-Logging.
Tue, Dec 3, 11:25 AM · MediaWiki-Logging, User-fgiunchedi, VisualEditor, MediaWiki-General
fgiunchedi added a comment to T234854: Upgrade ELK Stack.

Hello! I took the liberty to ack a lot of criticals/unknowns in icinga that were related to these new hosts, IIUC these are not in production :)

Tue, Dec 3, 9:07 AM · Patch-For-Review, Operations, Wikimedia-Logstash

Mon, Dec 2

fgiunchedi added a comment to T233934: Collects metrics for CAS.

While talking metrics and such for java, please consider also adding jmx_exporter (in addition to the native metrics) to CAS' jvm as we are doing for other JVMs across the fleet in T177197: Export Prometheus-compatible JVM metrics from JVMs in production

Mon, Dec 2, 2:13 PM · User-jbond, Operations
fgiunchedi updated the task description for T156955: Standardizing our partman recipes.
Mon, Dec 2, 2:10 PM · Patch-For-Review, Operations
fgiunchedi added a comment to T151009: Provide authenticated access to Prometheus native web interface.

Im tempted to add this directly to apereo cas (time permitting) however im curious what you had in mind for the service domain names considering we need one for each codfw and eqiad?
Something like:

https://prometheous.codfw.wikimedia.org/
https://prometheous.eqiad.wikimedia.org/

or did you have something else in mind?

Mon, Dec 2, 1:55 PM · observability, Patch-For-Review, User-fgiunchedi, Operations, Prometheus-metrics-monitoring
fgiunchedi updated the task description for T156955: Standardizing our partman recipes.
Mon, Dec 2, 12:26 PM · Patch-For-Review, Operations
fgiunchedi added a comment to T221904: swift backend decomms / rebalances are noisy.

AFAICS through the latest rebalances we haven't observed any alerts, possibly also due to using multiple servers per port (T222366)

Mon, Dec 2, 12:20 PM · observability, SRE-swift-storage, Operations
fgiunchedi updated the task description for T239054: Reimage all mediawiki servers .
Mon, Dec 2, 11:51 AM · Operations, serviceops

Fri, Nov 29

fgiunchedi created T239458: Mediawiki logging indexing conflict.
Fri, Nov 29, 9:08 AM · MediaWiki-Logging, User-fgiunchedi, VisualEditor, MediaWiki-General

Thu, Nov 28

fgiunchedi closed T187708: Monitor prometheus exporters "up" status as Resolved.

All deployed now, boldly resolving

Thu, Nov 28, 10:25 AM · User-fgiunchedi, observability

Wed, Nov 27

fgiunchedi updated the task description for T187708: Monitor prometheus exporters "up" status.
Wed, Nov 27, 5:39 PM · User-fgiunchedi, observability
fgiunchedi updated the task description for T156955: Standardizing our partman recipes.
Wed, Nov 27, 4:46 PM · Patch-For-Review, Operations
fgiunchedi changed the status of T215904: Better understanding of Logstash performance from Stalled to Open.
Wed, Nov 27, 11:05 AM · User-fgiunchedi, observability, Wikimedia-Logstash
fgiunchedi added a comment to T215904: Better understanding of Logstash performance.

Thanks for the in depth investigation and the numbers @colewhite ! Indeed looks like we'll need to tweak logstash pipeline parameters to >= 1000

Wed, Nov 27, 11:04 AM · User-fgiunchedi, observability, Wikimedia-Logstash
fgiunchedi created T239321: Deprecate msdos partition scheme in favor of GPT.
Wed, Nov 27, 10:49 AM · Operations
fgiunchedi moved T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet from Backlog to Doing on the User-fgiunchedi board.
Wed, Nov 27, 8:08 AM · ops-eqiad, User-fgiunchedi, Operations
fgiunchedi added a project to T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet: User-fgiunchedi.
Wed, Nov 27, 8:07 AM · ops-eqiad, User-fgiunchedi, Operations

Tue, Nov 26

fgiunchedi added a comment to T237587: Determine & implement near-term method for escalating network alerts.

FTR, re: paging on librenms alerts, see this plan: https://phabricator.wikimedia.org/T224888#5690188

Tue, Nov 26, 3:50 PM · Operations, netops, observability
fgiunchedi moved T187708: Monitor prometheus exporters "up" status from Up next to Doing on the User-fgiunchedi board.
Tue, Nov 26, 3:42 PM · User-fgiunchedi, observability
fgiunchedi added a comment to T224888: Network port utilization alerts should be paging .

Any preferences or thoughts re: the special tag? Right now I'm leaning towards #page as that seems the most self-explanatory.

Tue, Nov 26, 3:39 PM · observability, Traffic, Operations, netops
fgiunchedi added a comment to T238909: Proposal: simplify set up of a new load-balanced service on kubernetes.

First thank you for getting the ball rolling on this proposal! A question: are all approaches proposed targeting group B actions only or some approaches would also tackle group A? Also I think it'll be helpful if the (only most promising?) approaches have an outline of what group B actions will turn into.

Tue, Nov 26, 2:37 PM · Operations, Prod-Kubernetes, Pybal, Traffic, serviceops
fgiunchedi added a comment to T224888: Network port utilization alerts should be paging .

I've a proposal for doing this:

  • Add some special tag like #NRPE or #page to the names of any LibreNMS alert rules we'd like to make page. For our purpose here this would just be #6 Primary outbound port utilisation over 80% and #25 Primary inbound port utilisation over 80%.
  • In a Python NRPE:

This will prevent turning any LibreNMS critical into a page for the whole team (e.g. the currently-firing "Sensor over limit" for cr3-esams). It will also mean that ACKing alerts within LibreNMS does the right thing. And it makes it fairly straightforward to add/remove alert rules from the set that pages the team.
SGTU?

Tue, Nov 26, 10:43 AM · observability, Traffic, Operations, netops
fgiunchedi added a comment to T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet.

@Cmjohnson @Jclark-ctr I'm not blocked on this (thus no reassigning) but ms-be1059 is in row D judging by its ip address and netbox says row C. I believe netbox will need updating

Tue, Nov 26, 10:34 AM · ops-eqiad, User-fgiunchedi, Operations
fgiunchedi claimed T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet.
Tue, Nov 26, 10:31 AM · ops-eqiad, User-fgiunchedi, Operations
fgiunchedi added a comment to T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet.

@fgiunchedi These are ready for you for implementation. I removed the ops-eqiad tag. if you have an issue please assign to me and add the ops-eqiad tag back

Tue, Nov 26, 10:13 AM · ops-eqiad, User-fgiunchedi, Operations
fgiunchedi added a project to T239090: Restbase logging indexing conflict: User-fgiunchedi.
Tue, Nov 26, 9:51 AM · User-fgiunchedi, Wikimedia-Logstash, RESTBase

Mon, Nov 25

fgiunchedi moved T230570: De-noise systemd alerts (Reduce Icinga alert noise goal) from In progress to Backlog on the observability board.
Mon, Nov 25, 4:11 PM · Patch-For-Review, Goal, observability
fgiunchedi updated subscribers of T239039: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey.

Took a quick look at the expression and the idea LGTM, thanks @CDanis. Also cc @ayounsi as the original implementor of the alert

Mon, Nov 25, 2:39 PM · Patch-For-Review, Traffic, Operations, observability
fgiunchedi moved T236700: 'LVS connections' graph on Load Balancers dashboard takes a rate of a gauge from Backlog to In progress on the observability board.
Mon, Nov 25, 2:00 PM · Traffic, Operations, observability
fgiunchedi moved T237587: Determine & implement near-term method for escalating network alerts from Backlog to Up next on the observability board.
Mon, Nov 25, 1:53 PM · Operations, netops, observability
fgiunchedi moved T97297: Select a standard log shipping solution to use with applications that cannot be configured to send log events directly to Logstash and/or fluorine from Up next to Backlog on the observability board.
Mon, Nov 25, 1:51 PM · observability, Operations, Wikimedia-Logstash
fgiunchedi moved T205856: Retire udp2log: onboard its producers and consumers to the logging pipeline from Up next to Backlog on the observability board.
Mon, Nov 25, 1:51 PM · Analytics, observability, Wikimedia-Logstash, Operations
fgiunchedi moved T217340: Change logstash plugin deployment to use deb packaging and deployment from Backlog to Up next on the observability board.
Mon, Nov 25, 1:50 PM · Patch-For-Review, observability, Operations, Discovery-Search
fgiunchedi moved T236075: Evaluate, suggest and choose an alert escalation solution from Backlog to In progress on the observability board.
Mon, Nov 25, 1:49 PM · User-fgiunchedi, observability
fgiunchedi moved T237407: basic prometheus monitoring for PoolCounter from Backlog to Radar on the observability board.
Mon, Nov 25, 1:48 PM · Operations, observability, serviceops
fgiunchedi moved T101141: UDP rcvbuferrors and inerrors on graphite hosts from Up next to Backlog on the observability board.
Mon, Nov 25, 1:48 PM · observability, MW-1.27-release-notes, MW-1.27-release (WMF-deploy-2016-04-12_(1.27.0-wmf.21)), Operations, Graphite
fgiunchedi closed T238416: Logstash doesn't parse ulogd source and destination ports as Resolved.

Looks like this is all done, resolving

Mon, Nov 25, 1:47 PM · Operations, observability
fgiunchedi closed T238791: dropped packets to conf1004/5/6 2379/tcp as Resolved.

Fixed!

Mon, Nov 25, 1:47 PM · Operations, serviceops, observability
fgiunchedi moved T225604: log spam from mtail 3.0.0~rc19 on wezen from Radar to Up next on the observability board.
Mon, Nov 25, 1:46 PM · Patch-For-Review, observability
fgiunchedi moved T148541: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring from Up next to In progress on the observability board.
Mon, Nov 25, 1:46 PM · User-fgiunchedi, Patch-For-Review, Prometheus-metrics-monitoring, Operations, observability
fgiunchedi closed T150106: Type collisions in log events causing indexing failures in ELK Elasticsearch as Resolved.
Mon, Nov 25, 1:45 PM · observability, MW-1.31-release-notes (WMF-deploy-2017-10-31 (1.31.0-wmf.6)), Analytics, Patch-For-Review, MW-1.29-release-notes, Event-Platform, Wikimedia-Logstash
fgiunchedi closed T150106: Type collisions in log events causing indexing failures in ELK Elasticsearch, a subtask of T157850: Interacting with Wikimedia logs should be a pleasant experience, as Resolved.
Mon, Nov 25, 1:45 PM · Epic, Wikimedia-General-or-Unknown
fgiunchedi moved T238791: dropped packets to conf1004/5/6 2379/tcp from Backlog to In progress on the observability board.
Mon, Nov 25, 1:45 PM · Operations, serviceops, observability
fgiunchedi added a comment to T150106: Type collisions in log events causing indexing failures in ELK Elasticsearch.

We're now alerting when logstash index failures happen (T236343) thus I'm boldly resolving this task as followup will happen on Wikimedia-Logstash as issues come up.

Mon, Nov 25, 1:45 PM · observability, MW-1.31-release-notes (WMF-deploy-2017-10-31 (1.31.0-wmf.6)), Analytics, Patch-For-Review, MW-1.29-release-notes, Event-Platform, Wikimedia-Logstash
fgiunchedi added a comment to T214183: Setup graphs for power usage readings in Grafana.

Status update: I've been working on a dashboard with wattage from sentry3 + sentry4. It has got a global stacked graph + drilldown per-site: https://grafana.wikimedia.org/d/OBD1jy1Zk/filippo-pdu

Mon, Nov 25, 1:42 PM · DC-Ops, observability
fgiunchedi closed T236367: Tune HTTP availability alerts as Resolved.

Thresholds adjusted for global availability and I've updated "frontend traffic" dashboard

Mon, Nov 25, 1:41 PM · Operations, observability
fgiunchedi created T239090: Restbase logging indexing conflict.
Mon, Nov 25, 11:50 AM · User-fgiunchedi, Wikimedia-Logstash, RESTBase
fgiunchedi added a comment to T238414: Write ulogd logs to a dedicated logfile.

FWIW I'm ok with doing whichever is easiest, IIRC we can ship to kafka first and then add rules to log to a separate file.

Mon, Nov 25, 10:41 AM · observability, Operations
fgiunchedi added a comment to T238707: Migrate from deployment-logstash2 (jessie) to deployment-logstash03 (stretch).

AFAIK if dashboards have been migrated then deployment-logstash02 should be ready to be turned off

Mon, Nov 25, 10:28 AM · Cloud-VPS (Debian Jessie Deprecation), Beta-Cluster-Infrastructure
fgiunchedi awarded T238807: Clean up ORES metrics a Like token.
Mon, Nov 25, 10:25 AM · observability, Operations
fgiunchedi reopened T238973: Appservers rising GET latency might have triggered LVS pages as "Open".

I find it hard to believe this is the case. Text-lb checks request a cached url, so the backend latency should not matter.

Mon, Nov 25, 9:45 AM · Operations, serviceops
fgiunchedi added a comment to T238973: Appservers rising GET latency might have triggered LVS pages.

The cause was indeed appservers latency, resolving in favor of T238939

Mon, Nov 25, 8:54 AM · Operations, serviceops
fgiunchedi merged T238973: Appservers rising GET latency might have triggered LVS pages into T238939: Increased latency in appservers - 22 Nov 2019.
Mon, Nov 25, 8:54 AM · Traffic, Operations, serviceops
fgiunchedi merged task T238973: Appservers rising GET latency might have triggered LVS pages into T238939: Increased latency in appservers - 22 Nov 2019.
Mon, Nov 25, 8:54 AM · Operations, serviceops

Sat, Nov 23

fgiunchedi added a comment to T238939: Increased latency in appservers - 22 Nov 2019.

Found this task only now, but see also T238973: Appservers rising GET latency might have triggered LVS pages

Sat, Nov 23, 9:54 AM · Traffic, Operations, serviceops
fgiunchedi updated the task description for T238973: Appservers rising GET latency might have triggered LVS pages.
Sat, Nov 23, 9:53 AM · Operations, serviceops
fgiunchedi created T238973: Appservers rising GET latency might have triggered LVS pages.
Sat, Nov 23, 9:49 AM · Operations, serviceops

Fri, Nov 22

fgiunchedi moved T187708: Monitor prometheus exporters "up" status from Backlog to Up next on the User-fgiunchedi board.
Fri, Nov 22, 12:13 PM · User-fgiunchedi, observability
fgiunchedi added a comment to T238795: The "logstash-*" index pattern does not contain any of the following field types: ip .

Looks good! I won't have time to look into this in depth but I'm happy to help if patches need review

Fri, Nov 22, 10:20 AM · Operations, observability

Thu, Nov 21

fgiunchedi added a comment to T238707: Migrate from deployment-logstash2 (jessie) to deployment-logstash03 (stretch).

looks like each of the logstash hosts runs its own elasticsearch cluster locally, would our source be something like deployment-logstash2.deployment-prep.eqiad.wmflabs:9200 or deployment-logstash2.deployment-prep.eqiad.wmflabs:9300 ? it seems we'd need to configure reindex.remote.whitelist somewhere too though I have no idea where

Thu, Nov 21, 10:51 AM · Cloud-VPS (Debian Jessie Deprecation), Beta-Cluster-Infrastructure
fgiunchedi added a comment to T238083: Citoid logs fields explosion.

LGTM so far, thanks @mobrovac for working on this!

Thu, Nov 21, 10:47 AM · Patch-For-Review, Core Platform Team Workboards (Clinic Duty Team), Citoid, Wikimedia-Logstash
fgiunchedi added a comment to T238795: The "logstash-*" index pattern does not contain any of the following field types: ip .

Yes we can, if you know the name of the field we can add an explicit mapping to force the type in modules/profile/files/logstash/elasticsearch-template.json

Thu, Nov 21, 10:42 AM · Operations, observability
fgiunchedi added a comment to T238791: dropped packets to conf1004/5/6 2379/tcp.

@fgiunchedi you need to use https, and it works locally.
If you want to access metrics remotely, either you switch to port 4001 which is publically exposed. We have different ports in codfw and eqiad until we've migrated codfw to etcd3 as well.

Thu, Nov 21, 9:52 AM · Operations, serviceops, observability
fgiunchedi lowered the priority of T238727: Include zone+subnet checks for DNS validation from Medium to Low.

@fgiunchedi I think is fair request, but given we're in process of auto-generating all mgmt and then server's DNS records this might have less benefit that in the current situation. Would be ok to treat it as lower priority?

Thu, Nov 21, 9:34 AM · Traffic, Operations, DNS, SRE-tools
fgiunchedi added a comment to T238695: File on Commons not found: File:Nl-gegourmet.ogg.

Indeed, the file is a Nov 2013 upload, we could search for it in archives containers as well in case it got moved there. re: finding all orphan files, my understanding is that mediawiki has maintenance scripts to achieve that but we're not doing that on a regular basis and investigate the results.

Thu, Nov 21, 9:29 AM · SRE-swift-storage, Operations, Commons
fgiunchedi added a comment to T238807: Clean up ORES metrics.

@fgiunchedi, would you mind having a quick look at P9701? I'd like to run it on production.

Thu, Nov 21, 9:22 AM · observability, Operations
fgiunchedi updated the task description for T224549: Track remaining jessie systems in production.
Thu, Nov 21, 9:14 AM · Operations
fgiunchedi closed T224564: Reimage wezen to Buster (and rename to centrallog2001) as Resolved.

This is complete!

Thu, Nov 21, 9:13 AM · User-fgiunchedi, observability, Operations
fgiunchedi closed T224564: Reimage wezen to Buster (and rename to centrallog2001), a subtask of T224549: Track remaining jessie systems in production, as Resolved.
Thu, Nov 21, 9:13 AM · Operations
fgiunchedi updated the task description for T224564: Reimage wezen to Buster (and rename to centrallog2001).
Thu, Nov 21, 9:13 AM · User-fgiunchedi, observability, Operations
fgiunchedi added a comment to T238791: dropped packets to conf1004/5/6 2379/tcp.

Indeed looks like prometheus is trying to fetch conf1004.eqiad.wmnet:2379/metrics with no success. Locally on conf1004 even past the firewall the endpoint doesn't seem to work:

Thu, Nov 21, 9:13 AM · Operations, serviceops, observability
fgiunchedi added a comment to T238794: dropped packets to kafkamon 9000/tcp.

Indeed that's prometheus@analytics trying to reach burrow-exporter on port 9000 on kafkamon hosts, burrow-exporter is listening there but clearly no ferm. @Ottomata @elukey is port 9000 a legacy configuration we need to clean up or expected to be working ?

Thu, Nov 21, 9:07 AM · Operations, observability

Wed, Nov 20

fgiunchedi added a comment to T237587: Determine & implement near-term method for escalating network alerts.

Thanks! I think we should go with (2) (i.e. investigate integration between icinga (or grafana alerts, and from there icinga checks) for fastnetmon and librenms) so we get all niceties like irc, silence/acknowledge, contact groups etc

Wed, Nov 20, 3:30 PM · Operations, netops, observability
fgiunchedi updated the task description for T224564: Reimage wezen to Buster (and rename to centrallog2001).
Wed, Nov 20, 1:08 PM · User-fgiunchedi, observability, Operations
fgiunchedi moved T224564: Reimage wezen to Buster (and rename to centrallog2001) from Up next to Doing on the User-fgiunchedi board.
Wed, Nov 20, 12:33 PM · User-fgiunchedi, observability, Operations
fgiunchedi added a comment to T238707: Migrate from deployment-logstash2 (jessie) to deployment-logstash03 (stretch).

Wondering what we need to do next. Do we need to copy dashboards over somehow?

Wed, Nov 20, 10:07 AM · Cloud-VPS (Debian Jessie Deprecation), Beta-Cluster-Infrastructure
fgiunchedi added a comment to T225604: log spam from mtail 3.0.0~rc19 on wezen.

The spam is back after centrallog2001 reimage (was wezen) running buster, I've bandaided the issue but it seems we should try one of the latest mtail releases (cc @colewhite)

Wed, Nov 20, 9:09 AM · Patch-For-Review, observability
fgiunchedi created T238727: Include zone+subnet checks for DNS validation.
Wed, Nov 20, 9:03 AM · Traffic, Operations, DNS, SRE-tools

Tue, Nov 19

fgiunchedi created T238677: "unknown session id" from bird on centrallog hosts.
Tue, Nov 19, 5:32 PM · netops, Operations
fgiunchedi edited projects for T238540: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter, added: Traffic, observability; removed Graphite.
Tue, Nov 19, 2:55 PM · User-Addshore, Operations, observability, Traffic, Wikidata
fgiunchedi added a comment to T238540: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter.

I can confirm that a DELETE of https://grafana.wikimedia.org/api/dashboards/uid/000000599 results in a 403, further I don't see the request reaching grafana1001's apache logs. I'm adding Traffic since this looks like a regression, perhaps ATS is involved.

Tue, Nov 19, 2:55 PM · User-Addshore, Operations, observability, Traffic, Wikidata