Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (15)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (380 w, 6 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
Filippo Giunchedi [ Global Accounts ]

Recent Activity

Yesterday

fgiunchedi updated the task description for T296199: Prometheus hardware refresh (+ Bullseye upgrade).
Thu, Jan 20, 1:59 PM · Patch-For-Review, SRE Observability (FY2021/2022-Q3), User-fgiunchedi
fgiunchedi moved T296199: Prometheus hardware refresh (+ Bullseye upgrade) from Up next to Doing on the User-fgiunchedi board.
Thu, Jan 20, 1:23 PM · Patch-For-Review, SRE Observability (FY2021/2022-Q3), User-fgiunchedi
fgiunchedi created T299633: Add search.discovery.wmnet to search certs SAN.
Thu, Jan 20, 10:18 AM · SRE Observability (FY2021/2022-Q3), User-fgiunchedi
fgiunchedi created T299628: Debug / fine tune puppet failed metrics and alerts on alert* hosts.
Thu, Jan 20, 9:01 AM · User-fgiunchedi, Observability-Alerting, SRE

Wed, Jan 19

fgiunchedi moved T288937: Run Thanos backend on Bullseye from Up next to Doing on the User-fgiunchedi board.
Wed, Jan 19, 4:34 PM · User-fgiunchedi, SRE-swift-storage
fgiunchedi closed T299383: Move graphite back to eqiad as Resolved.

This is done, graphite is back in eqiad

Wed, Jan 19, 1:04 PM · SRE Observability (FY2021/2022-Q3)

Tue, Jan 18

fgiunchedi closed T297604: cergen should include the cert's name in SAN too, a subtask of T291946: Move service::catalog checks (“monitoring” section) to blackbox exporter and Alertmanager, as Resolved.
Tue, Jan 18, 4:01 PM · SRE Observability (FY2021/2022-Q3), Patch-For-Review, User-fgiunchedi
fgiunchedi closed T297604: cergen should include the cert's name in SAN too as Resolved.

This is done, cergen 0.2.6 includes this feature

Tue, Jan 18, 4:01 PM · SRE Observability (FY2021/2022-Q3), User-fgiunchedi
fgiunchedi committed rOBGP1fa1374fb8c8: README: clarify USENETWORK=yes usage (authored by fgiunchedi).
README: clarify USENETWORK=yes usage
Tue, Jan 18, 3:38 PM
fgiunchedi added a comment to T291946: Move service::catalog checks (“monitoring” section) to blackbox exporter and Alertmanager.

Most of the entries in service::catalog now have a probes section, these are the failures as of today:

Tue, Jan 18, 1:42 PM · SRE Observability (FY2021/2022-Q3), Patch-For-Review, User-fgiunchedi
fgiunchedi created T299383: Move graphite back to eqiad.
Tue, Jan 18, 10:14 AM · SRE Observability (FY2021/2022-Q3)
fgiunchedi added a comment to T297145: Ask for regular backups of our Wikidata Graphite data .

This is complete I believe, we're backing up the daily hierarchy now

Tue, Jan 18, 10:05 AM · Wikidata-Campsite (Team A Hearth 🏰🔥), Wikidata

Mon, Jan 17

fgiunchedi added a comment to T236954: Hieradata yaml style checking.

Thanks for the work on this looks really good, in relation to linting vs automatic formatting i agree with the the conclusion that automatic formatting would be the better option. In relation to tooling i think it would be useful to check some other tools to see if they are free of the aforementioned bugs.

In relation to vinyl itself, it seems (from a scan of the commit histories) that the original author has renamed this to yamlfmt. do you know any more of the history i.e.did this project use to be called vinyl, and why didn't we go with yamlfmt?

copying @Joe's comments from the gerrit patch here

a justification of why we're reformatting yaml, and of why vinyl was specifically chosen. For instance, I'd like to see a consistent use of double quotes rather than single ones, but that doesn't seem to be included here.
why the formatting choices were made (for instance, I don't like the removal of blank lines)

This is an artefact of the tool we pick. for my view, and similar to the conversation on python black. i don't care to much what style ultimately get picked as long as there is consistency

Mon, Jan 17, 4:24 PM · Infrastructure-Foundations, Patch-For-Review, Puppet, SRE, User-jbond
fgiunchedi moved T294564: Migrate Foundations Prometheus alerts to AlertManager from FY2021/2022-Q2 to Backlog on the SRE Observability board.
Mon, Jan 17, 2:08 PM · Observability-Alerting
fgiunchedi moved T293970: Alert on systemd units crashloops from FY2021/2022-Q2 to Backlog on the SRE Observability board.
Mon, Jan 17, 2:08 PM · Observability-Alerting
fgiunchedi moved T297604: cergen should include the cert's name in SAN too from FY2021/2022-Q2 to FY2021/2022-Q3 on the SRE Observability board.
Mon, Jan 17, 1:56 PM · SRE Observability (FY2021/2022-Q3), User-fgiunchedi
fgiunchedi moved T278514: Wishlist for AlertManager alerts from Grafana from FY2021/2022-Q3 to Backlog on the SRE Observability board.
Mon, Jan 17, 1:55 PM · Observability-Alerting, User-fgiunchedi, Performance-Team (Radar)
fgiunchedi moved T267019: Alert design guidelines for teams are produced from FY2021/2022-Q3 to Backlog on the SRE Observability board.
Mon, Jan 17, 1:55 PM · Observability-Alerting
fgiunchedi moved T288196: Retire Prometheus 'global' instance from FY2021/2022-Q3 to Backlog on the SRE Observability board.
Mon, Jan 17, 1:54 PM · Observability-Metrics, Performance-Team (Radar)
fgiunchedi moved T297231: Sending Apache Spark metrics to PushGateway from FY2021/2022-Q3 to Backlog on the SRE Observability board.
Mon, Jan 17, 1:35 PM · Observability-Metrics
fgiunchedi added a comment to T298945: Investigate false WebPageTest LCP alerts.

I suspect this has to do with Grafana 8 test instance at grafana2001 evaluating alerts and for some reason (to be investigated) reaching different conclusions about the alert's state (cfr T282863: Upgrade Grafana to 8.x)

Mon, Jan 17, 10:47 AM · Performance-Team (Radar), WebPageTest
fgiunchedi added a comment to T282863: Upgrade Grafana to 8.x.

Another data point: today while investigating T298945 I ran into this on grafana2001's logs:

Mon, Jan 17, 10:45 AM · SRE Observability (FY2021/2022-Q3), Performance-Team (Radar)

Wed, Jan 5

fgiunchedi created T298610: Upgrade Karma (alerts dashboard).
Wed, Jan 5, 1:55 PM · Observability-Alerting

Tue, Jan 4

fgiunchedi added a comment to T298521: Numerous Graphite queries stopped working (timeout after 6s).

I took a quick look and the timeout seems related to the amount of metrics matching MediaWiki.resourceloader_build.*.sample_rate (about ~5k). For example changing to MediaWiki.resourceloader_build.wikibase*.sample_rate I can load 2/30 days of data. The change might be due to (either or both):

Tue, Jan 4, 3:42 PM · Performance-Team (Radar), observability
fgiunchedi added a comment to T297933: ms-be2065 failed drive sdq.

Thank you @Papaul

Tue, Jan 4, 10:24 AM · SRE, ops-codfw
fgiunchedi updated the task description for T272559: Unused puppet resources audit, 2021.
Tue, Jan 4, 10:22 AM · Infrastructure-Foundations, Patch-For-Review, SRE, Puppet
fgiunchedi awarded T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats a Love token.
Tue, Jan 4, 10:05 AM · observability, serviceops-radar, Services (later), service-runner, SRE

Dec 17 2021

fgiunchedi added a comment to T297959: thanos-be hosts filing up root filesystem with logs.

I've bandaided the immediate issue, leaving the task open since we haven't addressed the high volume of logs

Dec 17 2021, 4:41 PM · serviceops, SRE-swift-storage, Observability-Metrics
fgiunchedi created T297959: thanos-be hosts filing up root filesystem with logs.
Dec 17 2021, 4:01 PM · serviceops, SRE-swift-storage, Observability-Metrics
fgiunchedi created T297933: ms-be2065 failed drive sdq.
Dec 17 2021, 9:36 AM · SRE, ops-codfw
fgiunchedi added a comment to T60773: VisualEditor: Adding newlines at start/end of preformatted text is broken.

And ran into this again :( quite annoying for preformatted sections in VE, I need to switch back to source editing

Dec 17 2021, 9:15 AM · VisualEditor, VisualEditor-ContentEditable
fgiunchedi added a comment to T291946: Move service::catalog checks (“monitoring” section) to blackbox exporter and Alertmanager.

With the last set of patches we're able to add probes for the majority of internal/discovery services. Including use cases like sending JSON strings and checking responses with regular expressions.

Dec 17 2021, 8:32 AM · SRE Observability (FY2021/2022-Q3), Patch-For-Review, User-fgiunchedi

Dec 16 2021

fgiunchedi added a comment to T294137: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012.

yes +1 to spread around rows as much as we can

Dec 16 2021, 12:49 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops

Dec 15 2021

fgiunchedi moved T296199: Prometheus hardware refresh (+ Bullseye upgrade) from Backlog to Up next on the User-fgiunchedi board.
Dec 15 2021, 10:07 AM · Patch-For-Review, SRE Observability (FY2021/2022-Q3), User-fgiunchedi
fgiunchedi moved T297604: cergen should include the cert's name in SAN too from Backlog to Doing on the User-fgiunchedi board.
Dec 15 2021, 10:07 AM · SRE Observability (FY2021/2022-Q3), User-fgiunchedi
fgiunchedi updated the task description for T288726: Move > 60% of observability Prometheus-based checks to Alertmanager.
Dec 15 2021, 10:06 AM · SRE Observability (FY2021/2022-Q3), Patch-For-Review, User-fgiunchedi
fgiunchedi moved T296373: Define owners email address for Pontoon from Doing to Backlog on the User-fgiunchedi board.
Dec 15 2021, 10:06 AM · User-fgiunchedi
fgiunchedi changed the status of T296373: Define owners email address for Pontoon from Open to Stalled.

Blackholing emails is Good Enough™ for now

Dec 15 2021, 10:06 AM · User-fgiunchedi
fgiunchedi closed T297265: graphite1004 freezing as Resolved.

Reverting to 5.10.0-9 has brought back stability, resolving. We still have T297433 to update firmware, which will happen when dcops can

Dec 15 2021, 10:05 AM · User-fgiunchedi, Patch-For-Review, Wikimedia-Incident, SRE, Graphite
fgiunchedi added a comment to T230733: Expose pooled status of gdnsd and conftool managed services as metrics.

I have implemented part of this work for service::catalog network probes, specifically I needed to exporter per-service state field. Even easier if the "calculations" of metrics can happen locally on the Prometheus hosts, see also https://gerrit.wikimedia.org/r/c/operations/puppet/+/747140 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/747139

Dec 15 2021, 9:36 AM · User-CDanis, SRE, observability
fgiunchedi added a comment to T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats.

I believe this is now (partially?) done, and service-runner supports Prometheus natively these days. What do you think @akosiaris ?

Dec 15 2021, 9:34 AM · observability, serviceops-radar, Services (later), service-runner, SRE
fgiunchedi removed a project from T228970: Test memsniff as possible replacement of memkeys: observability.
Dec 15 2021, 9:32 AM · User-Elukey, SRE, serviceops
fgiunchedi removed a project from T222075: Prevent puppet catalog compiler workers from running out of disk space: observability.
Dec 15 2021, 9:27 AM · User-jbond, Infrastructure-Foundations, User-herron, puppet-compiler, SRE
fgiunchedi removed a project from T222362: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications: observability.
Dec 15 2021, 9:26 AM · SRE-swift-storage, SRE
fgiunchedi added a comment to T196994: Open Phab tasks on SMART failure.

This is possible now with Prometheus and Alertmanager, cc T294564

Dec 15 2021, 9:24 AM · Patch-For-Review, SRE, observability
fgiunchedi renamed T169860: Replace smokeping with a Prometheus-based solution from Investigate/setup prometheus blackbox_exporter to Replace smokeping with a Prometheus-based solution.
Dec 15 2021, 9:22 AM · observability, User-fgiunchedi, Patch-For-Review, Prometheus-metrics-monitoring
fgiunchedi closed T267186: alerts.w.o / idp.w.o interaction and CORS as Resolved.

I just got the session expired on alerts.w.o and the application correctly detected the situation and reloaded, resolving

Dec 15 2021, 9:20 AM · Infrastructure-Foundations, CAS-SSO, Patch-For-Review, observability
fgiunchedi closed T267186: alerts.w.o / idp.w.o interaction and CORS, a subtask of T266017: Implement alerting roadmap phase 2, as Resolved.
Dec 15 2021, 9:20 AM · Patch-For-Review, User-fgiunchedi, observability

Dec 13 2021

fgiunchedi added a comment to T297604: cergen should include the cert's name in SAN too.

I have some vague feeling that there was a reason not to do this, but I can't recall why and I can't find any docs by my past self to indicate that it was (which I usually do if intentional).

I'm ok with it! But, I wonder if it is better to be explicit about this and manually add it to the list of alt_names instead of always doing it? Perhaps there is some reason to not have the CN in the SAN?

Dec 13 2021, 3:30 PM · SRE Observability (FY2021/2022-Q3), User-fgiunchedi
fgiunchedi created T297604: cergen should include the cert's name in SAN too.
Dec 13 2021, 1:22 PM · SRE Observability (FY2021/2022-Q3), User-fgiunchedi
fgiunchedi closed T294355: Several Wikidata Grafana boards missing data before October 2021 as Resolved.

I'm tentatively resolving the task since all short term mitigations are completed, feel free to reopen if sth is amiss

Dec 13 2021, 9:18 AM · Data-Persistence, Data-Persistence-Backup, bacula, SRE Observability (FY2021/2022-Q2), Wikidata Analytics, Wikidata, Graphite

Dec 10 2021

fgiunchedi moved T297265: graphite1004 freezing from Backlog to Doing on the User-fgiunchedi board.
Dec 10 2021, 4:06 PM · User-fgiunchedi, Patch-For-Review, Wikimedia-Incident, SRE, Graphite
fgiunchedi added a comment to T294355: Several Wikidata Grafana boards missing data before October 2021.

Thank you for the suggestion @fgiunchedi! Do we have an explanation somewhere of how to do this?

Dec 10 2021, 3:17 PM · Data-Persistence, Data-Persistence-Backup, bacula, SRE Observability (FY2021/2022-Q2), Wikidata Analytics, Wikidata, Graphite
fgiunchedi added a comment to T294355: Several Wikidata Grafana boards missing data before October 2021.

@Manuel @Lydia_Pintscher going forward I suggest also investing resources to switch to Prometheus as the supported metric system. Graphite is deprecated and in "life support" mode while all producers (essentially mediawiki and related) are being ported over, thanks!

Dec 10 2021, 2:30 PM · Data-Persistence, Data-Persistence-Backup, bacula, SRE Observability (FY2021/2022-Q2), Wikidata Analytics, Wikidata, Graphite
fgiunchedi added a comment to T294355: Several Wikidata Grafana boards missing data before October 2021.

I’m not sure I understand the discussion correctly :) do you still need a list of paths to back up, or does it look like we can back up everything now?

What's "everything" in this context? :) If you are talking about daily then yes it does look like it!

I was thinking of everything, even non-daily stuff, but it looks like daily would actually be enough for us. Manuel created a list of important dashboards in T297145; the topics they use are:

Dec 10 2021, 1:32 PM · Data-Persistence, Data-Persistence-Backup, bacula, SRE Observability (FY2021/2022-Q2), Wikidata Analytics, Wikidata, Graphite
fgiunchedi added a comment to T297231: Sending Apache Spark metrics to PushGateway.

Thank you for the summary @gmodena ! Some replies inline

Dec 10 2021, 9:51 AM · Observability-Metrics
fgiunchedi added a comment to T297435: Send metrics of db errors of mediawiki to prometheus .

Yes once you have logs in elasticsearch you can turn search queries into Prometheus metrics, from there you have dashboards and alerts too (either based on Grafana, or as Prometheus alerting rules in operations/alerts.git). HTH!

Dec 10 2021, 8:18 AM · MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), Patch-For-Review, DBA, observability, SRE, User-Ladsgroup

Dec 9 2021

fgiunchedi added a comment to T297265: graphite1004 freezing.

I've rolled back graphite2003 to 5.10.0-9-amd64, next steps as per IRC convo are to wait for graphite2003' stability, and consider upgrading firmware on graphite1004 since we might want that anyways

Dec 9 2021, 7:44 PM · User-fgiunchedi, Patch-For-Review, Wikimedia-Incident, SRE, Graphite
fgiunchedi added a comment to T297265: graphite1004 freezing.

I looked at the stack trace and to me it looks like either a kernel bug (we've never run graphite with 5.10.0-8-amd64 as per thanos metrics link ) Or the hardware is faulty, the SSDs are kinda old but I believe we should be seeing different failures at least from one of the drives)

Dec 9 2021, 10:07 AM · User-fgiunchedi, Patch-For-Review, Wikimedia-Incident, SRE, Graphite
fgiunchedi added a comment to T297265: graphite1004 freezing.

The temporary netconsole client on graphite1004 paid off, see https://phabricator.wikimedia.org/P18076 for logs from the host (journalctl -u netconsole on centrallog1001).

Dec 9 2021, 8:46 AM · User-fgiunchedi, Patch-For-Review, Wikimedia-Incident, SRE, Graphite
fgiunchedi added a comment to T297265: graphite1004 freezing.

Thank you folks for taking care of this!

Dec 9 2021, 8:44 AM · User-fgiunchedi, Patch-For-Review, Wikimedia-Incident, SRE, Graphite
fgiunchedi awarded T296289: swift-proxy not starting on ms-fe2009 due to missing python-monotonic a Like token.
Dec 9 2021, 8:26 AM · SRE-swift-storage

Dec 8 2021

fgiunchedi added a project to T297265: graphite1004 freezing: User-fgiunchedi.
Dec 8 2021, 4:33 PM · User-fgiunchedi, Patch-For-Review, Wikimedia-Incident, SRE, Graphite
fgiunchedi added a comment to T297265: graphite1004 freezing.

For the record, for testing purposes I've manually enabled netconsole on graphite1004 and pointed it to centrallog1001. Once the patch series above are merged the same config will be in puppet too

Dec 8 2021, 4:13 PM · User-fgiunchedi, Patch-For-Review, Wikimedia-Incident, SRE, Graphite
fgiunchedi added a comment to T296289: swift-proxy not starting on ms-fe2009 due to missing python-monotonic.

OK, I know what the problem is (at least at one level). Our swift front-ends use a bit of middleware wmf.rewrite which is shipped by us from puppet; that calls import monotonic. But there's nothing to say that python-monotonic (or python3-monotonic) should be installed from swift front-ends as far as I can see.

I'm not sure what was causing python-monotonic to get installed in the past, but it might have just been luck! I'll have a look at puppetizing this.

@fgiunchedi while investigating this, I find in modules/swift/files/python3.9/SwiftMedia/README the note

The files in the wmf directory come from the svn repo for the SwiftMedia extension. That is the canonical copy; any changes here should be copied there as well.

I can only find signs of a long-deleted https://www.mediawiki.org/wiki/Extension:SwiftMedia ...? If this is now part of code MW could we move the rewriting work there?

Dec 8 2021, 11:30 AM · SRE-swift-storage
fgiunchedi added a comment to T282863: Upgrade Grafana to 8.x.

Just a note to indicate that given the recent Grafana 8 vulnerability we should make sure to upgrade to the latest 8 version

Dec 8 2021, 11:16 AM · SRE Observability (FY2021/2022-Q3), Performance-Team (Radar)
fgiunchedi added a comment to T297231: Sending Apache Spark metrics to PushGateway.

Thanks @gmodena for the summary, do you have a list of metrics and labels you pushed in your local environment? It'll help review the names/practices/etc

Dec 8 2021, 8:54 AM · Observability-Metrics
fgiunchedi awarded T297231: Sending Apache Spark metrics to PushGateway a Like token.
Dec 8 2021, 8:52 AM · Observability-Metrics
fgiunchedi added a comment to T212231: Remove Diamond from production.

Even though T210993 is open? Thanks! I am uploading a change to delete them.

Dec 8 2021, 8:52 AM · observability, SRE
fgiunchedi added a comment to T297265: graphite1004 freezing.

Thank you folks for investigating this! I am taking a look too and so far have failed to find anything of note

Dec 8 2021, 8:18 AM · User-fgiunchedi, Patch-For-Review, Wikimedia-Incident, SRE, Graphite

Dec 7 2021

fgiunchedi updated the task description for T297180: Revert 5.10.70 from bullseye hosts.
Dec 7 2021, 2:41 PM · Infrastructure-Foundations, SRE
fgiunchedi added a comment to T297180: Revert 5.10.70 from bullseye hosts.

I chatted with @MoritzMuehlenhoff re: the rollback, apt won't let you remove a running kernel though there's a way to ask grub to reboot into another menu entry (the second entry of the first submenu in this case). Therefore the procedure can look like this:

Dec 7 2021, 11:35 AM · Infrastructure-Foundations, SRE
fgiunchedi updated the task description for T297180: Revert 5.10.70 from bullseye hosts.
Dec 7 2021, 11:33 AM · Infrastructure-Foundations, SRE

Dec 6 2021

fgiunchedi added a comment to T288549: Indexing errors from logs generated by Activator.

This is back just now FWIW

Dec 6 2021, 3:45 PM · Observability-Logging, Machine-Learning-Team
fgiunchedi updated the task description for T288726: Move > 60% of observability Prometheus-based checks to Alertmanager.
Dec 6 2021, 3:04 PM · SRE Observability (FY2021/2022-Q3), Patch-For-Review, User-fgiunchedi
fgiunchedi added a comment to T212231: Remove Diamond from production.

The unused classes

diamond::collector::servicestats
diamond::collector::servicestats_lib

still exist and pop up in T272559

Dec 6 2021, 8:19 AM · observability, SRE

Dec 3 2021

herron awarded T296892: New VictorOps user request a Party Time token.
Dec 3 2021, 5:19 PM · observability
fgiunchedi moved T265435: codfw: Testing Out Sample PDUs from Doing to Radar on the User-fgiunchedi board.
Dec 3 2021, 10:13 AM · Patch-For-Review, User-fgiunchedi, observability, ops-codfw, SRE, DC-Ops
fgiunchedi added a comment to T296892: New VictorOps user request.

Very strange indeed, I just sent a test email and another invitation from VO to jhathaway@wikimedia.org. I've also sent you an email from an external address, let's see what makes it

Dec 3 2021, 8:43 AM · observability

Dec 2 2021

fgiunchedi closed T296892: New VictorOps user request as Resolved.

Hello Jesse, thanks for reaching out. You have the VO invitation in your inbox now, please see also https://wikitech.wikimedia.org/wiki/VictorOps for general documentation and next steps. I'll resolve the task though please feel free to reopen is something is amiss!

Dec 2 2021, 2:22 PM · observability

Dec 1 2021

fgiunchedi updated subscribers of T295706: Improve TransactionProfiler as replacement for tendril's slow queries.

It should be but for whatever reason logstash still thinks the fields are not numeric. I'm on it (by on it I mean I'm bugging Filippo)

Dec 1 2021, 8:35 AM · Performance-Team-publish, MW-1.38-notes (1.38.0-wmf.9; 2021-11-16), Patch-For-Review, Performance-Team (Radar), Developer Productivity, Wikimedia-Rdbms, DBA, User-Ladsgroup

Nov 29 2021

fgiunchedi created P17894 (An Untitled Masterwork).
Nov 29 2021, 4:23 PM
fgiunchedi created P17893 (An Untitled Masterwork).
Nov 29 2021, 4:14 PM
fgiunchedi closed T126574: puppet should try to mount all mountable swift filesystems as Invalid.

Went another route, namely having proper permissions on the mount directory

Nov 29 2021, 10:31 AM · Patch-Needs-Improvement, SRE

Nov 26 2021

fgiunchedi added a comment to T294355: Several Wikidata Grafana boards missing data before October 2021.

I’m not sure I understand the discussion correctly :) do you still need a list of paths to back up, or does it look like we can back up everything now?

Nov 26 2021, 2:45 PM · Data-Persistence, Data-Persistence-Backup, bacula, SRE Observability (FY2021/2022-Q2), Wikidata Analytics, Wikidata, Graphite
fgiunchedi created T296540: logstash2028.mgmt flapping.
Nov 26 2021, 2:15 PM · SRE, ops-codfw
fgiunchedi added a subtask for T296199: Prometheus hardware refresh (+ Bullseye upgrade): T294967: (Need By: TBD) rack/setup/install prometheus100[56].
Nov 26 2021, 1:40 PM · Patch-For-Review, SRE Observability (FY2021/2022-Q3), User-fgiunchedi
fgiunchedi added a parent task for T294967: (Need By: TBD) rack/setup/install prometheus100[56]: T296199: Prometheus hardware refresh (+ Bullseye upgrade).
Nov 26 2021, 1:40 PM · SRE, SRE Observability (FY2021/2022-Q2), ops-eqiad, DC-Ops
fgiunchedi added a parent task for T296199: Prometheus hardware refresh (+ Bullseye upgrade): T247962: Migrate role::prometheus to Bullseye.
Nov 26 2021, 1:37 PM · Patch-For-Review, SRE Observability (FY2021/2022-Q3), User-fgiunchedi
fgiunchedi added a subtask for T247962: Migrate role::prometheus to Bullseye: T296199: Prometheus hardware refresh (+ Bullseye upgrade).
Nov 26 2021, 1:37 PM · SRE Observability (FY2021/2022-Q3)
fgiunchedi edited projects for T292180: rsyslog error: queue directory '/var/spool/rsyslog' and file name prefix 'output_kafka_json' already used, added: Observability-Logging; removed SRE Observability (FY2021/2022-Q2).
Nov 26 2021, 1:36 PM · Observability-Logging, User-ema, SRE
fgiunchedi edited projects for T292175: rsyslog errors about duplicate module includes, added: Observability-Logging; removed SRE Observability (FY2021/2022-Q2).
Nov 26 2021, 1:36 PM · Observability-Logging, User-ema, SRE
fgiunchedi edited projects for T293761: statsd and gunicorn metrics for superset, added: Observability-Metrics; removed SRE Observability (FY2021/2022-Q2).
Nov 26 2021, 1:35 PM · Data-Engineering, Superset, Observability-Metrics, SRE
fgiunchedi moved T288726: Move > 60% of observability Prometheus-based checks to Alertmanager from Inbox to In progress on the SRE Observability (FY2021/2022-Q2) board.
Nov 26 2021, 1:34 PM · SRE Observability (FY2021/2022-Q3), Patch-For-Review, User-fgiunchedi
fgiunchedi moved T294355: Several Wikidata Grafana boards missing data before October 2021 from Inbox to In progress on the SRE Observability (FY2021/2022-Q2) board.
Nov 26 2021, 1:34 PM · Data-Persistence, Data-Persistence-Backup, bacula, SRE Observability (FY2021/2022-Q2), Wikidata Analytics, Wikidata, Graphite
fgiunchedi added a project to T296199: Prometheus hardware refresh (+ Bullseye upgrade): User-fgiunchedi.
Nov 26 2021, 1:34 PM · Patch-For-Review, SRE Observability (FY2021/2022-Q3), User-fgiunchedi
fgiunchedi moved T296373: Define owners email address for Pontoon from Backlog to Doing on the User-fgiunchedi board.
Nov 26 2021, 1:33 PM · User-fgiunchedi
fgiunchedi updated subscribers of T296373: Define owners email address for Pontoon.

A different but related solution by @Kormat would be to disable/blackhole emails altogether

Nov 26 2021, 1:22 PM · User-fgiunchedi

Nov 25 2021

fgiunchedi updated subscribers of T296373: Define owners email address for Pontoon.

A different but related solution by @Kormat would be to disable/blackhole emails altogether

Nov 25 2021, 3:25 PM · User-fgiunchedi
fgiunchedi moved T264016: Host page did not auto-resolve in VO from Backlog to Up next on the User-fgiunchedi board.
Nov 25 2021, 2:14 PM · User-fgiunchedi, Observability-Alerting