Page MenuHomePhabricator

ema (Emanuele Rocca)
Disabled

Projects

User Details

User Since
Sep 29 2015, 8:49 PM (351 w, 3 d)
Roles
Disabled
IRC Nick
ema
LDAP User
Ema
MediaWiki User
Unknown
This account has been disabled.

Recent Activity

Feb 12 2022

Aklapper defrocked ema.
Feb 12 2022, 4:35 PM

Feb 9 2022

ema added a comment to T288106: Experiment with single backend CDN nodes.

Although we did briefly discuss the results of this experiment within Traffic, I don't think we ever publicly disclosed our analysis.

Feb 9 2022, 9:04 AM · Performance-Team (Radar), User-ema, Patch-For-Review, SRE, Traffic

Feb 1 2022

ema added a comment to T300247: Remove old and unused libvarnishapi.

We should probably add a Conflicts: libvarnishapi1 to our varnish 6 packaging, or whatever relationship magic is the right one to ensure that if varnish 6 is installed, libvarnishapi1 is not.

Feb 1 2022, 10:53 AM · SRE, Traffic

Jan 31 2022

ema triaged T300525: Beta cluster down: Error: 502, Next Hop Connection Failed as Medium priority.

Apparently deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud is gone and has been replaced by deployment-mediawiki12.

Jan 31 2022, 3:22 PM · Beta-Cluster-reproducible, Beta-Cluster-Infrastructure

Jan 12 2022

ema triaged T299054: Make varnish-frontend-restart work on Beta Cluster as Low priority.
Jan 12 2022, 2:08 PM · Beta-Cluster-Infrastructure, SRE, Traffic
ema created T299054: Make varnish-frontend-restart work on Beta Cluster.
Jan 12 2022, 2:08 PM · Beta-Cluster-Infrastructure, SRE, Traffic

Jan 7 2022

ema added a comment to T298758: Package and deploy Varnish 6.0.9.

Smoke testing of 6.0.9 is fine on deployment-prep, I'll start upgrading production nodes next week.

Jan 7 2022, 2:14 PM · Performance-Team (Radar), Patch-For-Review, User-ema, SRE, Traffic
ema moved T298758: Package and deploy Varnish 6.0.9 from Backlog to Doing on the User-ema board.
Jan 7 2022, 2:10 PM · Performance-Team (Radar), Patch-For-Review, User-ema, SRE, Traffic
ema updated the task description for T298758: Package and deploy Varnish 6.0.9.
Jan 7 2022, 1:56 PM · Performance-Team (Radar), Patch-For-Review, User-ema, SRE, Traffic
ema added a comment to T298758: Package and deploy Varnish 6.0.9.

Change 752151 had a related patch set uploaded (by Ema; author: Ema):

[operations/debs/varnish4@debian-wmf] Use libunwind for backtraces

https://gerrit.wikimedia.org/r/752151

Jan 7 2022, 1:56 PM · Performance-Team (Radar), Patch-For-Review, User-ema, SRE, Traffic
ema triaged T298758: Package and deploy Varnish 6.0.9 as Medium priority.
Jan 7 2022, 9:53 AM · Performance-Team (Radar), Patch-For-Review, User-ema, SRE, Traffic
ema created T298758: Package and deploy Varnish 6.0.9.
Jan 7 2022, 9:53 AM · Performance-Team (Radar), Patch-For-Review, User-ema, SRE, Traffic
ema moved T288106: Experiment with single backend CDN nodes from Doing to Backlog on the User-ema board.
Jan 7 2022, 8:56 AM · Performance-Team (Radar), User-ema, Patch-For-Review, SRE, Traffic

Dec 15 2021

ema renamed T297544: Frequent server errors (503 and 502), happened several times in the last 2 days from Frequent backend server errors (503), happened several times in the last 2 days to Frequent server errors (503 and 502), happened several times in the last 2 days.
Dec 15 2021, 9:03 AM · Traffic
ema triaged T297544: Frequent server errors (503 and 502), happened several times in the last 2 days as High priority.

Hello @Ade56facc and @Yann, thanks for the bug reports.

Dec 15 2021, 8:56 AM · Traffic

Dec 8 2021

Dzahn awarded T210411: Applayer services without TLS a Barnstar token.
Dec 8 2021, 3:50 PM · Traffic-Icebox, Patch-For-Review, serviceops, SRE
ema closed T108580: HTTPS for internal service traffic as Resolved.

Many of the assumptions made when this task was created have changed since the migration to ATS for cache backends (no more IPSec, the difference between Tier1 and Tier2 DCs is now gone, ...). We are now in a world where all backend caches access the origins via TLS, which I think largely covers what we wanted to achieve here. @BBlack: I'm marking the task as resolved, but of course feel free to reopen / create other tasks as needed if you think that anything is missing.

Dec 8 2021, 8:32 AM · Traffic-Icebox, codfw-rollout, SRE, HTTPS
ema updated the task description for T210411: Applayer services without TLS.
Dec 8 2021, 8:22 AM · Traffic-Icebox, Patch-For-Review, serviceops, SRE
ema closed T210411: Applayer services without TLS, a subtask of T207048: ATS production-ready as a backend cache layer, as Resolved.
Dec 8 2021, 8:22 AM · Patch-For-Review, SRE, Traffic
ema closed T210411: Applayer services without TLS as Resolved.
Dec 8 2021, 8:22 AM · Traffic-Icebox, Patch-For-Review, serviceops, SRE
ema updated the task description for T210411: Applayer services without TLS.
Dec 8 2021, 8:19 AM · Traffic-Icebox, Patch-For-Review, serviceops, SRE

Dec 7 2021

ema created T297187: Upgrade pybal-test200[23] from Stretch to Buster.
Dec 7 2021, 10:12 AM · Traffic

Dec 6 2021

ema moved T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough from Doing to Radar on the User-ema board.
Dec 6 2021, 9:14 AM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic
ema lowered the priority of T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough from High to Low.

In the last 24 hours we had just one overrun on 4 nodes:

Dec 6 2021, 9:11 AM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic

Nov 19 2021

ema added a comment to T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough .

The situation has improved significantly, we are now processing up to 13K lines per second vs the ~8K plateau from last week:

varnishmtail-lines-processed.png (819×1 px, 186 KB)

Nov 19 2021, 3:03 PM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic

Nov 18 2021

ema added a comment to T288106: Experiment with single backend CDN nodes.

After setting cache::single_backend_fqdn: cp4021.ulsfo.wmnet in hiera, cp4021 is now gone from the list of cache backends on all upload@ulsfo nodes, see for instance cp4022:

Nov 18 2021, 9:55 AM · Performance-Team (Radar), User-ema, Patch-For-Review, SRE, Traffic

Nov 8 2021

ema added a comment to T295253: compiler1003.puppet-diffs.eqiad1.wikimedia.cloud out of disk space.

This is a recurring problem, see for example T273599 T222072 T295253. T222075 has ideas on how to tackle the issue. I've tried to access the instance to free up some space but I don't seem to have the permissions to do so.

Nov 8 2021, 10:28 AM · Infrastructure-Foundations, SRE, puppet-compiler

Nov 5 2021

ema closed T295120: Varnish packages installed from the wrong component on host reimage as Resolved.
Nov 5 2021, 10:21 AM · SRE, Traffic
ema added a comment to T295120: Varnish packages installed from the wrong component on host reimage.

The puppet code lacks a "priority => 1002", if you want to override "main" (which also has priority=1001). See the comments in the apt::package_from_component define for further context.

Nov 5 2021, 9:56 AM · SRE, Traffic
ema triaged T295120: Varnish packages installed from the wrong component on host reimage as Medium priority.
Nov 5 2021, 9:54 AM · SRE, Traffic
ema updated the task description for T295120: Varnish packages installed from the wrong component on host reimage.
Nov 5 2021, 9:49 AM · SRE, Traffic
ema created T295120: Varnish packages installed from the wrong component on host reimage.
Nov 5 2021, 9:47 AM · SRE, Traffic

Oct 28 2021

ema added a comment to T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough .

The optimizations to varnishxcache.mtail and varnishreqstats.mtail paid off, time spent in tryBacktrack has decreased significantly:

cp3062-varnishmtail-cpu-profile-2.png (3×3 px, 1 MB)

Oct 28 2021, 8:07 AM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic

Oct 27 2021

ema added a comment to T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough .

Change 734893 merged by Ema:

[operations/puppet@production] varnishxcache.mtail: avoid unnecessary filtering

https://gerrit.wikimedia.org/r/734893

Oct 27 2021, 8:24 AM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic

Oct 25 2021

ema created P17594 (An Untitled Masterwork).
Oct 25 2021, 4:04 PM
ema added a comment to T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough .

Change 732925 merged by Ema:

[operations/puppet@production] varnishttfb.mtail: use native histogram type

https://gerrit.wikimedia.org/r/732925

Oct 25 2021, 1:08 PM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic
ema committed rOALE692897c5f79c: Use ats-tls metrics for edge traffic drop alert (authored by ema).
Use ats-tls metrics for edge traffic drop alert
Oct 25 2021, 7:37 AM

Oct 22 2021

ema closed T294116: Varnish reload failing on deployment-cache-upload06 as Resolved.

I upgraded varnish to 6.0.8 everywhere (see T292290) and forgot about restarting the service on deployment-cache-upload06. It should be fixed now, thanks @Majavah.

Oct 22 2021, 1:47 PM · SRE, Beta-Cluster-Infrastructure, Traffic
ema added a comment to T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough .

Mentioned in SAL (#wikimedia-operations) [2021-10-22T08:23:48Z] <ema> cp3062: test 0008-vsl_check_e_inval_assertion.patch https://gerrit.wikimedia.org/r/c/operations/debs/varnish4/+/732913/ T293879

Oct 22 2021, 12:53 PM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic
ema updated the task description for T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough .
Oct 22 2021, 7:20 AM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic

Oct 21 2021

ema created P17568 (An Untitled Masterwork).
Oct 21 2021, 2:04 PM
ema added a comment to T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough .

I've tried using a separate mtail instance with a subset of the scripts used by the production instance, namely:

Oct 21 2021, 12:37 PM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic
ema added a comment to T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough .

By giving a very large amount - 3G instead of the default 80M - of vsl_space to cp3062, the issue happens less often but still does happen. On all text@esams nodes, there has been no overrun between 00:44 and 04:46 (when esams traffic is at its lowest), while on cp3062 the last overrun happened at 21:56 and the first one this EU morning at 06:38. Bumping vsl_space alone does not fix the issue.

Oct 21 2021, 9:17 AM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic

Oct 20 2021

ema updated the task description for T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough .
Oct 20 2021, 1:14 PM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic
ema added a comment to T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough .

Trying the lowest possible hanging fruit first, namely rising vsl_space. I've first tried setting it to 512M as mentioned in the SAL entry above, but that failed due to the size of /var/lib/varnish (512M, but it needs space for other stuff too). We now have vsl_space=480M on cp3062, let's see if that changes anything at all.

Oct 20 2021, 1:10 PM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic
ema moved T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough from Backlog to Doing on the User-ema board.
Oct 20 2021, 12:35 PM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic
ema closed T292290: Package and deploy Varnish 6.0.8 as Resolved.

All hosts upgraded.

Oct 20 2021, 12:14 PM · Performance-Team (Radar), User-ema, SRE, Traffic
ema renamed T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough from varnishmtail metric loss due to performance issues to varnishmtail metric loss due to mtail not reading from pipe fast enough .
Oct 20 2021, 12:10 PM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic
ema updated the task description for T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough .
Oct 20 2021, 11:52 AM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic
ema updated the task description for T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough .
Oct 20 2021, 11:47 AM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic
ema triaged T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough as High priority.
Oct 20 2021, 11:41 AM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic
ema created T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough .
Oct 20 2021, 11:40 AM · Observability-Metrics, Patch-For-Review, User-ema, SRE, Traffic

Oct 19 2021

ema added a comment to T292290: Package and deploy Varnish 6.0.8.

Caches have now filled up. Response start looks good on cp3060 compared to one week ago:

Oct 19 2021, 7:18 AM · Performance-Team (Radar), User-ema, SRE, Traffic

Oct 18 2021

ema moved T293605: purged rdkafka crashes: assert: rkq->rkq_refcnt > 0 from Backlog to Radar on the User-ema board.
Oct 18 2021, 3:29 PM · User-ema, SRE, Traffic
ema triaged T293605: purged rdkafka crashes: assert: rkq->rkq_refcnt > 0 as Low priority.

Setting priority to low for now as these seem isolated, sporadic crashes and systemd took care of the restarts as expected so there was no production impact.

Oct 18 2021, 9:28 AM · User-ema, SRE, Traffic
ema created T293605: purged rdkafka crashes: assert: rkq->rkq_refcnt > 0.
Oct 18 2021, 9:26 AM · User-ema, SRE, Traffic
ema added a comment to T292290: Package and deploy Varnish 6.0.8.

I've made some improvements to the by-host dash that may be of use:
https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host

Oct 18 2021, 7:33 AM · Performance-Team (Radar), User-ema, SRE, Traffic

Oct 14 2021

ema moved T201317: wmf-auto-reimage: 'execution expired' on first puppet run from Doing to Radar on the User-ema board.
Oct 14 2021, 10:03 AM · User-ema, Infrastructure-Foundations, SRE, SRE-tools

Oct 13 2021

ema moved T293157: Toolhub API requests with PATCH verbs blocked by CDN from Backlog to Radar on the User-ema board.
Oct 13 2021, 11:36 AM · User-bd808, User-ema, SRE, Traffic, Toolhub
ema added a project to T293157: Toolhub API requests with PATCH verbs blocked by CDN: User-ema.
Oct 13 2021, 11:35 AM · User-bd808, User-ema, SRE, Traffic, Toolhub
ema added a comment to T201317: wmf-auto-reimage: 'execution expired' on first puppet run.

Trying another reimage as follows:

Oct 13 2021, 11:33 AM · User-ema, Infrastructure-Foundations, SRE, SRE-tools
ema closed T292820: Create runbook for VarnishTrafficDrop alert, change dashboard link as Resolved.

Runbook and updated dashboard link are shown correctly. Closing.

Oct 13 2021, 9:24 AM · User-ema, SRE, Traffic
ema moved T201317: wmf-auto-reimage: 'execution expired' on first puppet run from Backlog to Doing on the User-ema board.
Oct 13 2021, 9:16 AM · User-ema, Infrastructure-Foundations, SRE, SRE-tools
ema added a project to T201317: wmf-auto-reimage: 'execution expired' on first puppet run: User-ema.
Oct 13 2021, 9:16 AM · User-ema, Infrastructure-Foundations, SRE, SRE-tools
ema added a comment to T293157: Toolhub API requests with PATCH verbs blocked by CDN.

@bd808: looks like we're all set!

Oct 13 2021, 9:02 AM · User-bd808, User-ema, SRE, Traffic, Toolhub

Oct 12 2021

ema committed rOALE3843b0bc4c70: VarnishTrafficDrop: add runbook and change dashboard link (authored by ema).
VarnishTrafficDrop: add runbook and change dashboard link
Oct 12 2021, 3:10 PM
ema awarded T289787: Clean up Traffic tag/workboard a Love token.
Oct 12 2021, 1:25 PM · PM, SRE, Traffic
ema added a comment to T292820: Create runbook for VarnishTrafficDrop alert, change dashboard link.

Runbook created: https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop

Oct 12 2021, 12:18 PM · User-ema, SRE, Traffic
ema moved T292820: Create runbook for VarnishTrafficDrop alert, change dashboard link from Backlog to Doing on the User-ema board.
Oct 12 2021, 11:38 AM · User-ema, SRE, Traffic

Oct 11 2021

ema added a project to T292290: Package and deploy Varnish 6.0.8: Performance-Team.

Heads up Performance-Team: as with all Varnish upgrades, this may have an impact (positive or negative) on performance. You may want to keep the upgrade process on your radar: beta has been running with Varnish 6.0.8 for a few days now without obvious issues, I'll upgrade one prod text node in ulsfo today and then carry on relatively quickly unless things break.

Oct 11 2021, 9:48 AM · Performance-Team (Radar), User-ema, SRE, Traffic
ema updated the task description for T288106: Experiment with single backend CDN nodes.
Oct 11 2021, 8:59 AM · Performance-Team (Radar), User-ema, Patch-For-Review, SRE, Traffic
ema updated ema.
Oct 11 2021, 8:42 AM

Oct 8 2021

ema triaged T292175: rsyslog errors about duplicate module includes as Medium priority.
Oct 8 2021, 11:54 AM · Patch-For-Review, Observability-Logging, User-ema, SRE
ema triaged T292180: rsyslog error: queue directory '/var/spool/rsyslog' and file name prefix 'output_kafka_json' already used as Medium priority.
Oct 8 2021, 11:54 AM · Observability-Logging, User-ema, SRE
ema triaged T292815: ATS should alert if the number of total or active connections reached maximum as High priority.
Oct 8 2021, 11:53 AM · SRE, User-ema, Traffic
ema triaged T292817: Multiple ATS HTTP2 stats missing from Prometheus as Medium priority.
Oct 8 2021, 11:53 AM · Traffic-Icebox, User-ema, SRE
ema triaged T292820: Create runbook for VarnishTrafficDrop alert, change dashboard link as Medium priority.
Oct 8 2021, 11:53 AM · User-ema, SRE, Traffic
ema created T292820: Create runbook for VarnishTrafficDrop alert, change dashboard link.
Oct 8 2021, 9:02 AM · User-ema, SRE, Traffic
ema added a project to T292815: ATS should alert if the number of total or active connections reached maximum: SRE.
Oct 8 2021, 8:25 AM · SRE, User-ema, Traffic
ema created T292817: Multiple ATS HTTP2 stats missing from Prometheus.
Oct 8 2021, 8:25 AM · Traffic-Icebox, User-ema, SRE
ema created T292815: ATS should alert if the number of total or active connections reached maximum.
Oct 8 2021, 7:56 AM · SRE, User-ema, Traffic
ema moved T289974: Prometheus Varnish exporter alert: add runbook and link to dashboard from Backlog to Radar on the User-ema board.
Oct 8 2021, 7:28 AM · User-ema, Observability-Alerting, SRE, Traffic
ema moved T290870: rsyslog service should fail on configuration errors from Backlog to Radar on the User-ema board.
Oct 8 2021, 7:28 AM · SRE Observability, User-ema, User-fgiunchedi, SRE
ema moved T292175: rsyslog errors about duplicate module includes from Backlog to Radar on the User-ema board.
Oct 8 2021, 7:28 AM · Patch-For-Review, Observability-Logging, User-ema, SRE
ema moved T292180: rsyslog error: queue directory '/var/spool/rsyslog' and file name prefix 'output_kafka_json' already used from Backlog to Radar on the User-ema board.
Oct 8 2021, 7:28 AM · Observability-Logging, User-ema, SRE
ema moved T288106: Experiment with single backend CDN nodes from Backlog to Doing on the User-ema board.
Oct 8 2021, 7:28 AM · Performance-Team (Radar), User-ema, Patch-For-Review, SRE, Traffic
ema moved T292290: Package and deploy Varnish 6.0.8 from Backlog to Doing on the User-ema board.
Oct 8 2021, 7:27 AM · Performance-Team (Radar), User-ema, SRE, Traffic
ema added a project to T290870: rsyslog service should fail on configuration errors : User-ema.
Oct 8 2021, 7:25 AM · SRE Observability, User-ema, User-fgiunchedi, SRE
ema added a project to T292180: rsyslog error: queue directory '/var/spool/rsyslog' and file name prefix 'output_kafka_json' already used: User-ema.
Oct 8 2021, 7:25 AM · Observability-Logging, User-ema, SRE
ema added a project to T292175: rsyslog errors about duplicate module includes: User-ema.
Oct 8 2021, 7:25 AM · Patch-For-Review, Observability-Logging, User-ema, SRE
ema added a project to T289974: Prometheus Varnish exporter alert: add runbook and link to dashboard : User-ema.
Oct 8 2021, 7:25 AM · User-ema, Observability-Alerting, SRE, Traffic
ema added a project to T292506: Investigate cp5006 crash: User-ema.
Oct 8 2021, 7:25 AM · SRE Observability, User-ema, SRE, Traffic
ema added a project to T292290: Package and deploy Varnish 6.0.8: User-ema.
Oct 8 2021, 7:24 AM · Performance-Team (Radar), User-ema, SRE, Traffic
ema added a project to T288106: Experiment with single backend CDN nodes: User-ema.
Oct 8 2021, 7:17 AM · Performance-Team (Radar), User-ema, Patch-For-Review, SRE, Traffic
ema created User-ema.
Oct 8 2021, 7:13 AM

Oct 6 2021

ema added a comment to T288106: Experiment with single backend CDN nodes.

Change 726912 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: exclude single backend experiment from pooled ATS backends

https://gerrit.wikimedia.org/r/726912

Oct 6 2021, 2:36 PM · Performance-Team (Radar), User-ema, Patch-For-Review, SRE, Traffic
ema closed T286502: Figure out why deployment-cache-text06 keeps crashing as Resolved.

After lowering the amount of memory used for the ATS backend ram cache, there's now some more available on the system:

Oct 6 2021, 12:30 PM · SRE, Traffic, Beta-Cluster-Infrastructure

Oct 5 2021

ema added a comment to T292290: Package and deploy Varnish 6.0.8.

Preliminary testing in beta looks good, uploading the package to the archive.

Oct 5 2021, 12:53 PM · Performance-Team (Radar), User-ema, SRE, Traffic
ema added a project to T286502: Figure out why deployment-cache-text06 keeps crashing: Traffic.
Oct 5 2021, 12:27 PM · SRE, Traffic, Beta-Cluster-Infrastructure
ema added a comment to T286502: Figure out why deployment-cache-text06 keeps crashing.

The instance has 4G of memory, of which up to 1G is used by the varnish cache and 2G by the ATS backend ram cache (setting proxy.config.cache.ram_cache.size). Indeed the system is sometimes running out of memory, and the OOM killer is sacrificing the ATS process in those cases:

Oct 5 2021, 12:17 PM · SRE, Traffic, Beta-Cluster-Infrastructure