Page MenuHomePhabricator

ema (Emanuele Rocca)
Staff Site Reliability Engineer, Traffic TeamAdministrator

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Sep 29 2015, 8:49 PM (293 w, 6 d)
Roles
Administrator
Availability
Available
IRC Nick
ema
LDAP User
Ema
MediaWiki User
Unknown

Recent Activity

Yesterday

ema triaged T282880: Revisit varnish dynamic backends mechanism as Medium priority.
Mon, May 17, 7:58 AM · Patch-For-Review, SRE, Traffic

Fri, May 14

ema updated the task description for T282880: Revisit varnish dynamic backends mechanism.
Fri, May 14, 3:33 PM · Patch-For-Review, SRE, Traffic
ema created T282880: Revisit varnish dynamic backends mechanism.
Fri, May 14, 3:30 PM · Patch-For-Review, SRE, Traffic

Fri, Apr 30

ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

p75 averages for the past 12 hours:

Fri, Apr 30, 8:26 AM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic

Thu, Apr 29

ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Change 683572 merged by Ema:

[operations/puppet@production] vcl: make vcl_hit invoke default VCL

https://gerrit.wikimedia.org/r/683572

Thu, Apr 29, 3:46 PM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic
ema updated the task description for T281344: SRE Onboarding for Marc Mandere.
Thu, Apr 29, 8:42 AM · SRE, SRE-Access-Requests
ema triaged T281344: SRE Onboarding for Marc Mandere as Medium priority.
Thu, Apr 29, 8:42 AM · SRE, SRE-Access-Requests
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).
Thu, Apr 29, 8:03 AM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic
ema awarded T280484: debmonitor-client.service stays in failed state in case of server errors a Baby Tequila token.
Thu, Apr 29, 7:22 AM · SRE-tools, SRE

Wed, Apr 28

ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

The change's claimed behaviour is definitely consistant with the change we observed in hit/miss ratio.

Wed, Apr 28, 1:21 PM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic
ema created P15619 (An Untitled Masterwork).
Wed, Apr 28, 9:55 AM

Tue, Apr 27

ema added a comment to T277769: Expose cache host that served the response via Server Timing and collect it with navtiming daemon.
Tue, Apr 27, 7:52 AM · MW-1.36-notes (1.36.0-wmf.38; 2021-04-06), Patch-For-Review, Performance-Team
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Vanilla 6.0.1 was performing worse than 5.1.3 and similarly to 6.0.7 when we tested it in January:

Tue, Apr 27, 7:51 AM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic

Mon, Apr 26

ema added a project to T281090: Various debmonitor-client systemdtimer errors starting April 21st: SRE.
Mon, Apr 26, 8:17 AM · SRE, SRE-tools
ema created T281090: Various debmonitor-client systemdtimer errors starting April 21st.
Mon, Apr 26, 8:16 AM · SRE, SRE-tools

Wed, Apr 21

ema added a comment to T280439: File:Chessboard480.svg WEBP thumbnail version not visible on safari when size is fixed at 208px.

Thanks for pointing me to this task @akosiaris. The issue here is webp, this is broken on Safari: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Chessboard480.svg/312px-Chessboard480.svg.png.webp

Wed, Apr 21, 1:33 PM · Traffic, SRE, MediaWiki-General, Browser-Support-Apple-Safari

Mon, Apr 19

ema created T280484: debmonitor-client.service stays in failed state in case of server errors.
Mon, Apr 19, 8:33 AM · SRE-tools, SRE

Apr 15 2021

ema added a comment to T275809: cache_upload cache policy + large_objects_cutoff concerns.

I've added a dashboard called Varnish Anomalies, currently plotting when nuke_limit is reached, as well as Varnish fetch failures. On non-eqsin upload there's basically a one-to-one match between the two at the moment:

Apr 15 2021, 9:05 AM · Patch-For-Review, SRE, Traffic

Apr 14 2021

ema added a comment to T275809: cache_upload cache policy + large_objects_cutoff concerns.

Apparently we do occasionally reach nuke_limit on cache_upload nodes in DCs other than eqsin -- unrelated to the exp policy, we haven't moved away from the static "size threshold" policy anywhere other than Singapore yet.

Apr 14 2021, 3:32 PM · Patch-For-Review, SRE, Traffic

Apr 13 2021

ema added a comment to T275809: cache_upload cache policy + large_objects_cutoff concerns.

cp5001 has been running with the exp policy for 5 days now: compared to other upload nodes in eqsin, its hitrate is higher (~3%) and it is storing significantly more objects (~5.8M vs ~3.9M). Extending the policy change to the rest of upload@eqsin.

Apr 13 2021, 9:53 AM · Patch-For-Review, SRE, Traffic
ema added a comment to T265864: Remove 185.15.56.0/24 from network::external.

Now to figure out the actual service impact and if it's safe to merge.
To re-iterate, 185.15.56.0/24 is dedicated to WMCS CloudVPS

Apr 13 2021, 9:23 AM · Patch-For-Review, cloud-services-team (Kanban), SRE, netops
ema closed T279147: Grant access to Superset for Mikeraish as Resolved.
Apr 13 2021, 7:59 AM · SRE, LDAP-Access-Requests

Apr 8 2021

ema added a comment to T275809: cache_upload cache policy + large_objects_cutoff concerns.

Today I've added exp_policy.py, a trivial script to show the probability of caching objects of various sizes with the "exp" policy. Based on that script, I found that with rate=0.1 and base=-20.3 we get the following probabilities for a 384G cache:

obj sizeadmission
1.0 KB99.9%
2.0 KB99.8%
4.0 KB99.7%
8.0 KB99.3%
16.0 KB98.6%
32.0 KB97.2%
64.0 KB94.6%
128.0 KB89.4%
256.0 KB79.9%
512.0 KB63.9%
1024.0 KB40.8%
2048.0 KB16.7%
4096.0 KB2.78%
Apr 8 2021, 4:19 PM · Patch-For-Review, SRE, Traffic
ema added a comment to T279147: Grant access to Superset for Mikeraish.

@MRaishWMF: you should be all set! Let me know if you can now access Superset.

Apr 8 2021, 1:16 PM · SRE, LDAP-Access-Requests
ema added a comment to P15260 x.c.
$ gcc x.c  -lm ; ./a.out 
ADM_PARAM=645474.242184 x=645474.242184 -103275878.749405 -0.000062
Apr 8 2021, 11:29 AM
ema created P15260 x.c.
Apr 8 2021, 11:28 AM
ema closed T279533: Add exp cache admission policy parameters to hiera as Resolved.

After changing exp_policy_rate and exp_policy_base in hiera for traffic-cache-atstext-buster, the rendered VCL now looks like this:

+// Includes for Exp cache admission policy, admission probability exponentially
+// decreasing with size. See wm_admission_policies T144187
+C{
+   #include <stdlib.h>
+   #include <math.h>
+   #include <errno.h>
+
+   #define RATE 0.1
+   #define BASE -20.3
+   #define MEMORY 0.0009765625
+   #define ADM_PARAM pow(MEMORY, RATE) / pow(2.0, BASE)
+}C

Closing!

Apr 8 2021, 8:04 AM · Patch-For-Review, SRE, Traffic
ema closed T279533: Add exp cache admission policy parameters to hiera, a subtask of T275809: cache_upload cache policy + large_objects_cutoff concerns, as Resolved.
Apr 8 2021, 8:03 AM · Patch-For-Review, SRE, Traffic

Apr 7 2021

ema moved T279147: Grant access to Superset for Mikeraish from Backlog to Awaiting User Input on the LDAP-Access-Requests board.
Apr 7 2021, 3:21 PM · SRE, LDAP-Access-Requests
ema triaged T279147: Grant access to Superset for Mikeraish as Medium priority.
Apr 7 2021, 3:20 PM · SRE, LDAP-Access-Requests
ema added a comment to T279147: Grant access to Superset for Mikeraish.

@MRaishWMF: hi, we need approval from your manager here on the ticket. Thanks!

Apr 7 2021, 3:20 PM · SRE, LDAP-Access-Requests
ema moved T279531: Add Lena Meintrup to the ldap/wmde and ldap/nda group from Backlog to NDA Pending on the LDAP-Access-Requests board.
Apr 7 2021, 3:16 PM · SRE, LDAP-Access-Requests
ema added a comment to T279531: Add Lena Meintrup to the ldap/wmde and ldap/nda group.

@RStallman-legalteam, @KFrancis: hello! We have a NDA request for WMDE. Thanks!

Apr 7 2021, 3:16 PM · SRE, LDAP-Access-Requests
ema triaged T279531: Add Lena Meintrup to the ldap/wmde and ldap/nda group as Medium priority.
Apr 7 2021, 3:16 PM · SRE, LDAP-Access-Requests
ema added a comment to T279244: CAS SSO for reedy.

I think racktables is replaced by netbox for Reedy's needs and he does have access to that.

Apr 7 2021, 3:11 PM · CAS-SSO, SRE, LDAP-Access-Requests
ema triaged T279533: Add exp cache admission policy parameters to hiera as Medium priority.
Apr 7 2021, 12:46 PM · Patch-For-Review, SRE, Traffic
ema created T279533: Add exp cache admission policy parameters to hiera.
Apr 7 2021, 12:46 PM · Patch-For-Review, SRE, Traffic

Apr 6 2021

ema moved T279310: Need access to noc@wikimedia.org (associated with Analytics' MaxMind account) from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Apr 6 2021, 12:36 PM · SRE, SRE-Access-Requests
ema moved T277629: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it from SRE Meeting Review to Awaiting User Input on the SRE-Access-Requests board.
Apr 6 2021, 12:33 PM · SRE, SRE-Access-Requests, Dumps-Generation
ema triaged T279310: Need access to noc@wikimedia.org (associated with Analytics' MaxMind account) as Medium priority.
Apr 6 2021, 12:25 PM · SRE, SRE-Access-Requests

Mar 31 2021

ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

T277769 needs to be completed for the dashboard to be restored. Essentially the host data has to go through a new pipeline. The EventGate collection part is ready to merge, next I'll write the changes to the navtiming daemon.

Mar 31 2021, 9:36 AM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

The Response Time By Host dashboard shows no data, last bits of information on February 22. Is there anything we can do to fix it?

Mar 31 2021, 9:09 AM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic

Mar 26 2021

ema created P15083 0007-LJ_GC64-mode-by-default.patch.
Mar 26 2021, 10:39 AM

Mar 25 2021

ema added a comment to T275809: cache_upload cache policy + large_objects_cutoff concerns.

We have such a strategy in our puppetized VCL, currently unused, called "exp" - https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/templates/wikimedia-frontend.vcl.erb#914 (and see also the definition of adm_param near the top of the file). This policy admits files into cache storage based on a probabilistic function with an exponential response. This gives us a probability-based solution that is very likely to cache smaller objects the first time they're seen, but less-likely to cache larger objects unless they're seen enough times to make it past a probability filter (e.g. a 64MB file might haven 8% chance to enter cache each time we see it, or whatever).

The tuning parameters for the function, as presently puppetized, auto-tune themselves to maximize object hitrate given our total cache storage size as a parameter. Some of the other fixed parameters that contribute to the calculation of the adm_param were based on research from our webrequest data years ago to figure our size and popularity distributions, and might still be "close", but are no longer current.

In practice, with the parameters that are puppetized there today, it skews heavily in favor of avoiding the caching of large objects in the ranges we care about for our failure scenario, again because it wasn't driven by defending against this "attack", but instead by maximizing hitrates.

Mar 25 2021, 9:59 AM · Patch-For-Review, SRE, Traffic

Mar 23 2021

ema added a comment to T255973: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers.

Although, it is a go library, which I'm not sure we have much tooling around dealing with. I think maybe @ema has made some Go based .debs before?

Mar 23 2021, 1:45 PM · Analytics-Kanban, Analytics-Clusters

Jan 22 2021

ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Awesome, glad to see that the bisecting paid off! Still 361 commits between those 2 versions, though 😕

Jan 22 2021, 1:40 PM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic

Jan 20 2021

ema added a comment to T271953: Add client TCP source port to webrequest.
  • In Varnish we'll need to add VCL code to set the new parameters to X-Analytics, so that Varnishkafka will pick them up and we'll able to use them in Analytics-land.

Something like:

Jan 20 2021, 10:52 AM · Patch-For-Review, Analytics-Kanban, Analytics
ema added a comment to T271953: Add client TCP source port to webrequest.

@ema quick question - is the client src port something that we could pass from ATS-TLS to Varnish frontend? Via HTTP header etc..

Jan 20 2021, 9:50 AM · Patch-For-Review, Analytics-Kanban, Analytics

Jan 14 2021

ema closed T265625: ats-be occasional system CPU usage increase as Resolved.

Erratic CPU usage gone:

Jan 14 2021, 10:08 AM · Performance-Team, SRE, Traffic

Jan 13 2021

ema added a comment to T265625: ats-be occasional system CPU usage increase.

Disabling JIT in all Lua scripts resulted in significantly decreased CPU usage as well as TTFB:

Jan 13 2021, 8:53 AM · Performance-Team, SRE, Traffic

Jan 12 2021

ema added a comment to T265625: ats-be occasional system CPU usage increase.

Disabling JIT in all Lua scripts on cp5008 resulted in ats-be not calling lj_vm_hotcall/mmap anymore and CPU usage went down significantly:

Jan 12 2021, 5:02 PM · Performance-Team, SRE, Traffic
ema added a project to T265625: ats-be occasional system CPU usage increase: Performance-Team.
Jan 12 2021, 11:28 AM · Performance-Team, SRE, Traffic
ema raised the priority of T265625: ats-be occasional system CPU usage increase from Medium to High.

Lowering the number of Lua states on cp3050 did reduce system CPU usage a bit, without having any visibile performance difference, hence it seems safe to proceed with 64 as the default everywhere. That's just a minor improvement however, and the big "mmap issue" still remains.

Jan 12 2021, 11:27 AM · Performance-Team, SRE, Traffic

Jan 11 2021

ema added a comment to T265625: ats-be occasional system CPU usage increase.

A low-hanging fruit when it comes to Lua overhead seems to be tuning the number of allowed Lua states. By looking at the internal tslua statistics on cp3050, it seems that most of our 256 states have a fairly low number of associated threads (most between 0 and 3, some outliers up to 6). Let's try to lower that value from the default (256) to 64 on cp3050 and evaluate the impact on kernel cpu usage.

Jan 11 2021, 9:55 AM · Performance-Team, SRE, Traffic

Jan 8 2021

ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Over the past 2 days, 6.0.1 has been performing worse than 5.1.3, and similarly to 6.0.7. This would seem to indicate that the regression was introduced between 6.0.0 and 6.0.1.

Jan 8 2021, 11:39 AM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic

Jan 6 2021

ema added a comment to T265625: ats-be occasional system CPU usage increase.

It turns out that malloc(3) does not really say the whole truth: the threshold for choosing when to use mmap vs brk is dynamic, and not hardcoded to 128K. For instance I found values of ~400K with the following SystemTap script:

Jan 6 2021, 1:30 PM · Performance-Team, SRE, Traffic

Jan 5 2021

ema updated subscribers of T265625: ats-be occasional system CPU usage increase.

The plot thickens. I now have more questions than I do have answers, but here's the story so far.

Jan 5 2021, 12:42 PM · Performance-Team, SRE, Traffic
ema moved T265625: ats-be occasional system CPU usage increase from Caching to Bug Reports on the Traffic board.
Jan 5 2021, 11:45 AM · Performance-Team, SRE, Traffic
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

6.0.1-1wm1 has now been working fine for a day on the beta cluster, upgrading cp3054.

Jan 5 2021, 9:21 AM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic

Jan 4 2021

ema added a comment to T265625: ats-be occasional system CPU usage increase.

All that CPU time is spent in the kernel, and specifically calling mmap a lot. I've seen ksys_mmap_pgoff featured prominently in perf report of affected nodes, and tracing for 10 seconds how many times such function is called by various PIDs we can see this:

Jan 4 2021, 5:08 PM · Performance-Team, SRE, Traffic
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Varnish 6.0.0 does not seem to be affected by the regression, here is the average webperf_navtiming_responsestart_by_host_seconds (p75) during the past 10 days:

Jan 4 2021, 9:45 AM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic
ema closed T270074: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network as Resolved.

@ema these should all be fixed now. :-) I'll send out an announcement today.

Jan 4 2021, 9:30 AM · SRE, Technical-blog-posts, Traffic
ema updated the task description for T270223: FY2021-2022: Enable basic Multi-DC operations for read traffic (tracking).
Jan 4 2021, 8:58 AM · Performance-Team, Epic

Dec 18 2020

ema closed T270270: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` as Resolved.

I think we can revert the no-streaming patch then, will check tomorrow. Thanks @kostajh!

Dec 18 2020, 2:28 PM · Patch-For-Review, Traffic, serviceops, SRE, User-zeljkofilipin, MediaWiki-Docker
ema added a comment to T270074: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network.

@ema I published this

Thanks!

Dec 18 2020, 12:43 PM · SRE, Technical-blog-posts, Traffic

Dec 17 2020

ema added a comment to T270270: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header`.

Change 650191 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "vcl: do not stream responses to docker"

https://gerrit.wikimedia.org/r/650191

Dec 17 2020, 9:16 PM · Patch-For-Review, Traffic, serviceops, SRE, User-zeljkofilipin, MediaWiki-Docker
ema added a comment to T270270: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header`.

That being said, I suspect that our VCL trying to do_gzip on a HEAD request might confuse things. Let's try to only do_gzip for GET.

Dec 17 2020, 9:09 PM · Patch-For-Review, Traffic, serviceops, SRE, User-zeljkofilipin, MediaWiki-Docker
ema added a comment to T270270: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header`.

The current theory is that the problem boils down to the following HEAD request returning no Content-Length if served by cp3052:

Dec 17 2020, 8:31 PM · Patch-For-Review, Traffic, serviceops, SRE, User-zeljkofilipin, MediaWiki-Docker
ema created P13596 (An Untitled Masterwork).
Dec 17 2020, 7:59 PM
ema added a comment to T270270: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header`.

Maybe Mac sends a different user-agent? That would be fun...

Dec 17 2020, 5:27 PM · Patch-For-Review, Traffic, serviceops, SRE, User-zeljkofilipin, MediaWiki-Docker
ema added a comment to T270270: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header`.

This should now be fixed by using https://gerrit.wikimedia.org/r/c/operations/puppet/+/650156/ as a VCL workaround, please try to reproduce again!

Dec 17 2020, 4:31 PM · Patch-For-Review, Traffic, serviceops, SRE, User-zeljkofilipin, MediaWiki-Docker
ema added a comment to T270074: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network.

I looked at the doc and was able to copy edit it! If you are able to go through and accept changes in the next day, I'm confident we can get this published before I leave for vacation.

Dec 17 2020, 12:21 PM · SRE, Technical-blog-posts, Traffic

Dec 16 2020

ema added a comment to T270074: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network.

@ema Awesome! Let me know when your first draft is ready. Looking forward to reading and editing this!

Dec 16 2020, 2:10 PM · SRE, Technical-blog-posts, Traffic
ema triaged T270074: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network as Medium priority.
Dec 16 2020, 2:09 PM · SRE, Technical-blog-posts, Traffic
ema awarded T270195: Reflected Cross-Site scripting (XSS) vulnerability in analytics-quarry-web a Love token.
Dec 16 2020, 12:08 PM · cloud-services-team (Kanban), Vuln-XSS, Analytics, Quarry, Security, Security-Team

Dec 14 2020

ema created T270074: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network.
Dec 14 2020, 10:07 AM · SRE, Technical-blog-posts, Traffic

Dec 11 2020

ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Initial results of the 6.0.0 experiment on cp3054 are encouraging: for the past 12 hours performance has been in line with 5.1.3 on cp3052:

Dec 11 2020, 12:44 PM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic

Dec 10 2020

ema triaged T269828: X-Cache-Status: distinguish between fresh and stale hits/misses as Medium priority.
Dec 10 2020, 9:14 AM · SRE, Traffic
ema created T269828: X-Cache-Status: distinguish between fresh and stale hits/misses .
Dec 10 2020, 9:13 AM · SRE, Traffic
ema triaged T269825: Incorrect X-Cache-Status reported by deployment-prep caches as Lowest priority.
Dec 10 2020, 9:00 AM · Patch-For-Review, SRE, Traffic
ema created T269825: Incorrect X-Cache-Status reported by deployment-prep caches.
Dec 10 2020, 9:00 AM · Patch-For-Review, SRE, Traffic

Dec 9 2020

ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

OK the amount of work needed to get 5.2.1 in a usable state really seems excessive. Let's give a try to 6.0.0, which is the version immediately after 5.2.1 and according to the changelog should address the VSM bugs introduced in 5.2.1. I've rebuilt all the dependencies, here the versions for reference:

Dec 9 2020, 4:09 PM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

The list of VSM-related issues affecting 5.2.1 according to upstream's changelog is: 2430, 2470, 2518, 2535, 2541, 2545, 2546. I suspect they forgot to list some important changes given that 2586 isn't mentioned in changes.rst at all though it looks precisely like the infinite "Log overrun" problem mentioned yesterday.

Dec 9 2020, 11:11 AM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic

Dec 8 2020

ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

It's likely that 5.2.1 is affected by T264074, we need to backport our fix to 5.2.x too. However I don't think that during the 6.0.x upgrades we ran into this specific varnishlog issue, there might be more work to do in that regard too.

Dec 8 2020, 12:38 PM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

I've downgraded Varnish to version 5.2.1-1wm1 on cp3054 and had to revert due to an issue with varnishlog. After the upgrade, I've noticed that systemd-journald was maxing-out a CPU and suppressing millions of messages.

Dec 8 2020, 10:28 AM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic

Dec 7 2020

ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

It took a while to package and rebuild libvmod-re2, libvmod-netmapper, varnish-modules, and varnishkafka against varnish 5.2.1. Long story short, we now have the following packages available on deneb and working with Varnish 5.2.1:

Dec 7 2020, 4:48 PM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic

Dec 3 2020

ema closed T130904: Host rewrite for /static/ not applied to purges as Resolved.

I tried a test purge on cp3054:

curl -v -X PURGE -H "Host: fr.wikipedia.org" http://127.0.0.1/static/ema-test

And indeed varnish is now doing the Host header rewrite as expected:

Dec 3 2020, 3:19 PM · Patch-For-Review, SRE, Traffic
ema closed T268736: Package and deploy varnish 6.0.7 as Resolved.

6.0.7-1wm1 deployed fleet-wide, closing.

Dec 3 2020, 1:36 PM · SRE, Traffic
ema added a comment to T268736: Package and deploy varnish 6.0.7.

Varnish 6.0.7-1wm1 has been working well on both cp4032 and cp3054, upgrading all other nodes (except for cp3052 which is running 5.1.3-1wm15 as part of T264398).

Dec 3 2020, 9:37 AM · SRE, Traffic

Nov 30 2020

ema closed T268883: fifo-log-tailer: gracefully handle missing unix socket as Resolved.
root@cp4028:~# fifo-log-tailer -socket this-does-not-exist-at-all.socket
2020/11/30 16:42:38 Unable to read from socket: dial unix this-does-not-exist-at-all.socket: connect: no such file or directory
2020/11/30 16:42:39 Unable to read from socket: dial unix this-does-not-exist-at-all.socket: connect: no such file or directory
[...]
2020/11/30 16:42:48 Could not connect to this-does-not-exist-at-all.socket after 10 attempts. Exiting.
Nov 30 2020, 4:43 PM · SRE, Traffic
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Varnish 6.0.7 is behaving well in terms of functionality on cp4032 (T268736).

Nov 30 2020, 2:15 PM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic
ema closed T256467: Make atsmtail-backend.service depend on fifo-log-demux as Resolved.

Unit ordering at boot time is now correct:

Nov 30 2020, 10:34 AM · SRE, Traffic

Nov 27 2020

ema moved T268883: fifo-log-tailer: gracefully handle missing unix socket from Triage to Bug Reports on the Traffic board.
Nov 27 2020, 10:50 AM · SRE, Traffic
ema triaged T268883: fifo-log-tailer: gracefully handle missing unix socket as Low priority.
Nov 27 2020, 10:50 AM · SRE, Traffic
ema created T268883: fifo-log-tailer: gracefully handle missing unix socket.
Nov 27 2020, 10:50 AM · SRE, Traffic
ema added a comment to T265625: ats-be occasional system CPU usage increase.

This happened again last night at 2020-11-27T00:08, we had alerts on cp1089, cp1077, cp1087, cp1083 and cp1075 in eqiad, cp2029 (codfw), cp3062 and cp3064 (esams), and cp5009 (eqsin):

Nov 27 2020, 10:09 AM · Performance-Team, Traffic, SRE

Nov 26 2020

ema closed T256302: Certain links being rejected by caching if opened in Internet Explorer with a HTTP 400 error as Invalid.

Timing out given that 5 months have passed since this issue was reported and to the best of my knowledge it was an isolated case. Feel free to reopen if it happens again obviously.

Nov 26 2020, 3:05 PM · SRE, Traffic
ema updated subscribers of T268736: Package and deploy varnish 6.0.7.

@Gilles: FYI during the next few weeks we'll be upgrading to this latest bugfix release. The list of changes (see task description) does not seem to suggest anything that could have an obvious performance impact, but you never know. I am going to upgrade one single node first, see how it behaves for a while and then proceed with the rest.

Nov 26 2020, 9:11 AM · Traffic, SRE
ema moved T268736: Package and deploy varnish 6.0.7 from Triage to Feature Requests on the Traffic board.
Nov 26 2020, 9:06 AM · Traffic, SRE
ema added a comment to T266857: 2nd part of blog post series: the evolution of Wikimedia's Content Delivery Network.

Can you look it over and make sure that everything looks correct before I announce on Twitter?

Nov 26 2020, 8:46 AM · Traffic, SRE, Technical-blog-posts